分类目录归档:所有文章

Genomics HW4速成指南

Genomics HW4 速成指南

这次的作业中只有第一题有一些难度。因为要用动态规划进行序列比对。更细致的来讲,使用Smith-Waterman算法进行局部比对。

动态规划(英语:Dynamic programming,简称DP)是一种在数学、管理科学、计算机科学、经济学和生物信息学中使用的,通过把原问题分解为相对简单的子问题的方式求解复杂问题的方法。(From Wikipedia)

Smith-Waterman算法中涉及到一个打分矩阵,简单的来说就是先将第一行和第一列赋值为0,再按照下面的打分公式对每一个小格进行填充,并记录下来最大打分分数的来源,最后从最大分数开始回溯到0,就是两个序列之间的最佳匹配。设M_{ij}为这个以小格子的填充分数,那么打分的公式有:

F(i,j) = max \begin{cases} F(i-1,j-1)+s(x_i,y_j)\\\\ F(i-1,j)+GapPenalty \\\\F(i,j-1)+ GapPenalty \\\\ 0 \end{cases}

用python3实现:

#Genomics Homework4 
#Use DP algorithm to find the highest scored alignment
#1. Calculate the alignment score of seq1&seq2
#2. find the best alignment
#3. Calculate the best alignment score 

seq1 = "ACGCNCTTTTCTCGCGTACCTTTACTAATAGAATGCAAAGACGTCCCCCG"
seq2 = "CTCGCGTACCTTTACTAAGAGAATGCGNAGACGTCCCCCGGNGGACCGAT"

match_award = 1
mismatch_penalty = -2
N = -1
gap_penalty = -3
Extendgap_recalibration = 1 #延伸gap的罚分是-2,检测到上一个来自于上方向或者左方向,就给gap_penalty+1

def alignScore(seq1,seq2):
    score =0
    for i,j in zip(seq1,seq2):
        if(i == "N" or j == "N"):
            score -=1
        elif (i == "-" or j =="-"):
            score -= 2
        elif (i == j):
            score +=1
        else :
            score -=2 #mismatch
    return score

# 初始化打分表
def zeros(shape):
    retval = []
    for x in range(shape[0]):
        retval.append([])
        for y in range(shape[1]):
            retval[-1].append(0)
    return retval

# 计算得分
def match_score(alpha, beta):
    if alpha == beta:
        return match_award
    elif alpha == 'N' or beta == 'N':
        return N
    elif alpha == '-' or beta == '-':
        return gap_penalty
    else:
        return mismatch_penalty

def water(seq1, seq2):
    m, n = len(seq1), len(seq2)  # length of two sequences

    # Generate DP table and traceback path pointer matrix
    score = zeros((m+1, n+1))      # the DP table
    pointer = zeros((m+1, n+1))    # to store the traceback path

    max_score = 0        # initial maximum score in DP table

    # Calculate DP table and mark pointers
    for i in range(1, m + 1):
        for j in range(1, n + 1):#第一行第一列都是0,所以循环从1开始。
            score_diagonal = score[i-1][j-1] + match_score(seq1[i-1], seq2[j-1])
            #填充的格子从1开始,填充格子的1对应序列的0,所以match_score的下表要减去1
            score_up = score[i][j-1] + gap_penalty
            score_left = score[i-1][j] + gap_penalty

            if (pointer[i][j-1] == ( 2)):
                score_up = score[i][j-1] + gap_penalty + Extendgap_recalibration
            if (pointer[i-1][j] == (1 )):
                score_left = score[i-1][j] + gap_penalty + Extendgap_recalibration

            score[i][j] = max(0,score_left, score_up, score_diagonal)

            if score[i][j] == 0:
                pointer[i][j] = 0 # 0 means end of the path
            if score[i][j] == score_left:
                pointer[i][j] = 1 # 1 means trace up
            if score[i][j] == score_up:
                pointer[i][j] = 2 # 2 means trace left
            if score[i][j] == score_diagonal:
                pointer[i][j] = 3 # 3 means trace diagonal
            if score[i][j] >= max_score:
                max_i = i
                max_j = j
                max_score = score[i][j]

    align1, align2 = '', ''    # initial sequences

    i,j = max_i,max_j    # indices of path starting point

    #traceback, follow pointers
    while pointer[i][j] != 0:
        if pointer[i][j] == 3:
            align1 += seq1[i-1]
            align2 += seq2[j-1]
            i -= 1
            j -= 1
        elif pointer[i][j] == 2:
            align1 += '-'
            align2 += seq2[j-1]
            j -= 1
        elif pointer[i][j] == 1:
            align1 += seq1[i-1]
            align2 += '-'
            i -= 1
    #[align1,align2,symbol,maxscore] 
    answerlist= finalize(align1, align2)
    answerlist.append(max_score)
    return answerlist

def finalize(align1, align2):
    align1 = align1[::-1]    #reverse sequence 1
    align2 = align2[::-1]    #reverse sequence 2

    i,j = 0,0

    #输出结果: alignment 
    symbol = ''
    identity = 0
    for i in range(0,len(align1)):
        # if two AAs are the same, then output the letter
        if (align1[i] == align2[i]):                
            symbol += "*"
        # if one of them is N
        elif (align1[i] =="N" or align2[i]=="N"):
            symbol += "N"
        # if they are not identical and none of them is gap
        elif (align1[i] != align2[i] and align1[i] != '-' and align2[i] != '-'): 
            symbol += 'M'
        #if one of them is a gap, output a -
        elif (align1[i] == '-' or align2[i] == '-'):          
            symbol += '-'

    return[align1,align2,symbol]

def printer(answerlist):
    #answerlist: [align1,align2,symbol,maxscore]
    print("seq1 " + answerlist[0])
    print("seq2 " + answerlist[1])
    print("symb " + answerlist[2])
    print("Best Alignment score: " + str(answerlist[3]))


print("Alignment socre: " + str(alignScore(seq1,seq2)))
printer(water(seq1,seq2))

新年快乐

早有:

祸兮福之所倚,福兮祸之所伏。孰知其极?其无正。正复为奇,善复为妖。人之迷,其日固久!是以圣人方而不割,廉而不刿,直而不肆,光而不耀。

近来有:

世尊告诸比丘,我忆宿命未成正觉时,独一静处,专精禅思。作是念:何法有故?老死有,何法缘故?老死有。即正思惟,生如实无间等:生有故,老死有,生缘故,老死有。如是有、取、爱、受、触、六入处、名色。何法有故?名色有,何法缘故?名色有。即正思惟,如实无间等生:识有故,名色有,识缘故有,名色有。我作是思惟时,齐识而还,不能过彼,谓:缘识名色,缘名色六入处,缘六入处触,缘触受,缘受爱,缘爱取,缘取有,缘有生,缘生老、病、死、忧、悲、恼、苦,如是如是纯大苦聚集。
我时作是念:何法无故?则老死无,何法灭故?老死灭。即正思惟,生如实无间等:生无故,老死无,生灭故,老死灭。如是生、有、取、爱、受、触、六入处、名色、识、行广说。我复作是思惟,何法无故?行无,何法灭故?行灭,即正思惟,如实无间等:无明无故,行无,无明灭故,行灭,行灭故识灭,识灭故名色灭,名色灭故六入处灭,六入处灭故触灭,触灭故受灭,受灭故爱灭,爱灭故取灭,取灭故有灭,有灭故生灭,生灭故老、病、死、忧、悲、恼、苦灭,如是如是纯大苦聚灭。
我时作是念:我得古仙人道,古仙人迳,古仙人道迹,古仙人从此迹去,我今随去。……今我如是,得古仙人道,……谓八圣道。……我从彼道,见老、病、死,老、病、死集,老、病、死灭,老、病、死灭道迹。见生、有、取、爱、受、触、六入处、名色、识、行,行集,行灭,行灭道迹。我于此法,自知自觉,成等正觉。

我觉得他们说得有理。这算是19年末的一点收获。

不管怎么说,2020年一定要开开心心的。是的说你呢
哦对了,你明年也要开开心心的。(反正你也看不见)
新年快乐。

Skelviper 12.30于珞珈山