Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
600 views
in Technique[技术] by (71.8m points)

algorithm - how to match dna sequence pattern

I am getting a trouble finding an approach to solve this problem.

Input-output sequences are as follows :

 **input1 :** aaagctgctagag 
 **output1 :** a3gct2ag2

 **input2 :** aaaaaaagctaagctaag 
 **output2 :** a6agcta2ag

Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.

For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".

Any help appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Explanation of the algorithm:

  • Having a sequence S with symbols s(1), s(2),…, s(N).
  • Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
  • So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
  • What we want to know is B(N).

To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)

To calculate B(i+1), we will pick the best sequence among the candidates:

  1. Candidate sequences where the last block has one element:

    B(i )s(i+1)1 B(i-1)s(i+1)2 ; only if s(i) = s(i+1) B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1) … B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1) B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)

  2. Candidate sequences where the last block has 2 elements:

    B(i-1)s(i)s(i+1)1 B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1) B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1) …

  3. Candidate sequences where the last block has 3 elements:

  4. Candidate sequences where the last block has 4 elements:

  5. Candidate sequences where last block has n+1 elements:

    s(1)s(2)s(3)………s(i+1)

For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.

The algorithm will be some thing like this in psude-c code:

B(0) = “”
for (i=1; i<=N; i++) {
    // Calculate all the candidates for B(i)
    BestCandidate=null
    for (j=1; j<=i; j++) {
        Calculate all the candidates of length (i)

        r=1;
        do {
             Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
             If (   (BestCandidate==null) 
                      || (Candidate is shorter that BestCandidate)) 
                 {
            BestCandidate=Candidate.
                 }
             r++;
        } while (  ([i-j]*r <= i) 
             &&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))

    }
    B(i)=BestCandidate
}

Hope that this can help a little more.

The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.

EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>


// This struct represents a compressed segment like atg4, g3,  agc1
    struct Segment {
        char *elements;
        int nElements;
        int count;
    };
         // As an example, for the segment agagt3  elements would be:
         // {
         //      elements: "agagt",
         //      nElements: 5,
         //      count: 3
         // }

    struct Sequence {
        struct Segment lastSegment;
        struct Sequence *prev;      // Points to a sequence without the last segment or NULL if it is the first segment
        int totalLen;  // Total length of the compressed sequence.
    };
       // as an example, for the sequence agt32ta5, the representation will be:
       // {
       //     lastSegment:{"ta" , 2 , 5},
       //     prev: @A,
       //     totalLen: 8
       // }
       // and A will be
       // {
       //     lastSegment{ "agt", 3, 32},
       //     prev: NULL,
       //     totalLen: 5
       // }


// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.

char *sequence2string(struct Sequence *S) {
    char *Res=malloc(S->totalLen + 1);
    char *digits="0123456789";

    int p= S->totalLen;
    Res[p]=0;

    while (S!=NULL) {
            // first we insert the count of the last element.
            // We do digit by digit starting with the units.
            int C = S->lastSegment.count;
            while (C) {
                p--;
                Res[p] = digits[ C % 10 ];
                C /= 10;
            }

            p -= S->lastSegment.nElements;
            strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);

            S = S ->prev;
    }


    return Res;
}


// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
    int i,j;

    int N = strlen(in);;            // Number of elements of a in sequence.



    // B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
    // What we want to return is B[N];
    struct Sequence *B;
    B = malloc((N+1) * sizeof (struct Sequence));

    // We first do an initialization for i=0

    B[0].lastSegment.elements="";
    B[0].lastSegment.nElements=0;
    B[0].lastSegment.count=0;
    B[0].prev = NULL;
    B[0].totalLen=0;

    // and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth,  We will try different sequences and keep the minimum one.
    for (i=1; i<=N; i++) B[i].totalLen = INT_MAX;   // A very high value

    for (i=1; i<=N; i++) {

        // at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
        for (j=1; j<=i; j++) {

            // Here we will check all the candidates where the last segment has j elements

            int r=1;                  // number of times the last segment is repeated
            int rNDigits=1;           // Number of digits of r
            int rNDigitsBound=10;     // We will increment r, so this value is when r will have an extra digit.
                                      // when r = 0,1,...,9 => rNDigitsBound = 10
                                      // when r = 10,11,...,99 => rNDigitsBound = 100
                                      // when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.

            do {

                // Here we analitze a candidate B(i).
                // where the las segment has j elements repeated r times.

                int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
                if (CandidateLen < B[i].totalLen) {
                    B[i].lastSegment.elements = in + i - j*r;
                    B[i].lastSegment.nElements = j;
                    B[i].lastSegment.count = r;
                    B[i].prev = &(B[i-j*r]);
                    B[i].totalLen = CandidateLen;
                }

                r++;
                if (r == rNDigitsBound ) {
                    rNDigits++;
                    rNDigitsBound *= 10;
                }
            } while (   (i - j*r >= 0)
                     && (strncmp(in + i -j, in + i - j*r, j)==0));

        }
    }

    char *Res=sequence2string(&(B[N]));
    free(B);

    return Res;
}

int main(int argc, char** argv) {
    char *compressedDNA=dnaCompress(argv[1]);
    puts(compressedDNA);
    free(compressedDNA);
    return 0;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...