Problem gives you practice with string methods and file inpu

Problem gives you practice with string methods and file input using an example frombiology. The problem is modeled after “Finding a Motif in DNA,” one of the problems on theRosalind site ( http://rosalind.info ). Solving problems on this site is a great way to practice usingPython to solve problems that are frequently encountered in bioinformatics.---------------
Finding the same interval of DNA in the genomes of two different organisms (often taken fromdifferent species) is highly suggestive that the interval has the same function in both organisms.
We define a motif as such a commonly shared interval of DNA. A common task in molecularbiology is to search an organism\'s genome for a known motif.
Given two strings s and t, t is a substring of s if t is contained as a contiguous collection ofsymbols in s (as a result, t must be no longer than s).
The position of a symbol in a string is determined by its distance from the initial symbol of thestring, which is given the position “0” (e.g., the positions of all occurrences of \'U\' in\"AUGCUUCAGAAAGGUCUUACG\" are 1, 4, 5, 14, 16, and 17). The symbol at position i of s isdenoted by s[i].
A substring of s can be represented as s[j:k], where j and k represent the starting and endingpositions of the substring in s; for example, if s = \"AUGCUUCAGAAAGGUCUUACG\", then s[1:5] =\"UGCU\". (NOTE: the ending position is NOT included in the substring).
The location of a substring s[j:k] is its beginning position j; note that t will have multiplelocations in s if it occurs more than once as a substring of s (see the Sample below). Occurrencesof a motif are allowed to overlap with each other (see sample dataset and sample output for anexample).

Given: Two DNA strings s and t (each of length at most 1000 symbols). These strings areprovided in the datafile “motifFinding.txt” provided. You will need to have your program read inthe data from the datafile and assign the first string and the second string to different variables.HINT: the following commands may be useful for this problem:

my_file = open(r\" test.txt\", \"r\")my_data = my_file.read().split()
Return: All locations of t as a substring of s.

Sample Dataset
GATATATGCATATACTTATAT
Sample Output

1 3 9
Submit your program for Problem 3 as a .py file. Include your solution to the problem as acommented line at the end of your program, using the output format shown in the sampleoutput example.

motifFinding.txt :

CTCATGGTTTTCATGGTTCATGGTAGTTCGCCACGATCTGACTGTCATGGTTCATGGTTCATGGTGTCATGGTGAGTCAAGTCATGGTCCCCTCATGGTATCATGGTAAAAAATAAAGCGATGATCATGGTGTCATGGTGTCATGGTGATCATGGTTCATGGTCTCTGTCATGGTGCGGTCATGGTGTGCCATGCTTTCATGGTTCATGGTATCATGGTTCATGGTTCATGGTACTGTCATCATCATGGTCAGTCATGGTTCATGGTTCTCATGGTCGATCATGGTTCATGGTTTTGAGATCATGGTTTCATGGTGTAGTCATGGTCTGCTCATGGTTCATGGTTGTTTCATGGTAAATTCATGGTTTCATGGTTCATGGTGCAGCATCATGGTACGTCATGGTTGGTCATGGTATGTATCATGGTTACGATCATGGTGTTAACTTTCATGGTCTCATGGTGTTGCAGGGCATGTCTCTCTTATTGGCTTCATGGTATCATGGTTTATCATGGTATCCTCCTCATGGTAGTTCATGGTCATCATGGTACCATCATGGTCGGATCATGGTTTCATGGTTCATGGTTCATCATGGTTCATGGTCTTTATCATGGTTCATGGTGTTTCATGGTTGTCATGGTTTCATGGTCATGTCATGGTATATCATGGTGGGCTCATGGTTCATGGTCTCATGGTATCATGGTATCATGGTCGAGTCATGGTCTTCATGGTTTTTAATCATGGTGATCATGGTTCATGGTGCTAAAGTTCATGGTACGTCATGGTTCATGGTTCATGGTTTGGCACGATCATGGTCTAAATCATGGTATTCATGGTTCATGGTTC

Solution

def max_gc_content(fasta): r\"\"\" Computing GC Content http://rosalind.info/problems/gc/ Identifying Unknown DNA Quicklyclick to collapse A quick method used by early computer software to determine the language of a given piece of text was to analyze the frequency with which each letter appeared in the text. This strategy was used because each language tends to exhibit its own letter frequencies, and as long as the text under consideration is long enough, software will correctly recognize the language quickly and with a very low error rate. See Figure 1 for a table compiling English letter frequencies. You may ask: what in the world does this linguistic problem have to do with biology? Although two members of the same species will have different genomes, they still share the vast percentage of their DNA; notably, 99.9% of the 3.2 billion base pairs in a human genome are common to almost all humans (i.e., excluding people having major genetic defects). For this reason, biologists will speak of the human genome, meaning an average-case genome derived from a collection of individuals. Such an average case genome can be assembled for any species, a challenge that we will soon discuss. The biological analog of identifying unknown text arises when researchers encounter a molecule of DNA deriving from an unknown species. Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule. Thus, to analyze the symbol frequencies of DNA for comparison against a database, we compute the molecule\'s GC-content, or the percentage of its bases that are either cytosine or guanine. In practice, the GC-content of most eukaryotic genomes hovers around 50%. However, because genomes are so long, we may be able to distinguish species based off very small discrepancies in GC-content; furthermore, most prokaryotes have a GC-content significantly higher than 50%, so that GC-content can be used to quickly differentiate many prokaryotes and eukaryotes by using relatively small DNA samples. Problem The GC-content of a DNA string is given by the percentage of symbols in the string that are \'C\' or \'G\'. For example, the GC-content of \"AGCTATAG\" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content. DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with \'>\', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with \'>\' indicates the label of the next string. In Rosalind\'s implementation, a string in FASTA format will be labeled by the ID \"Rosalind_xxxx\", where \"xxxx\" denotes a four-digit code between 0000 and 9999. Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each). Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below. Sample Dataset >Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC >Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT Sample Output Rosalind_0808 60.919540% Note on Absolute Errorclick to collapse We say that a number x is within an absolute error of y to a correct solution if x is within y of the correct solution. For example, if an exact solution is 6.157892, then for x to be within an absolute error of 0.001, we must have that |x6.157892|<0.001, or 6.156892>> s = \'>Rosalind_6404\ CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC\ TCCCACTAATAATTCTGAGG\ >Rosalind_5959\ CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT\ ATATCCATTTGTCAGCAGACACGC\ >Rosalind_0808\ CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC\ TGGGAACCTGCGGGCAGTAGGTGGAAT\ \' >>> max_gc_content(s) Rosalind_0808 60.919540% \"\"\" gc = [(seq_name, (seq.count(\"G\") + seq.count(\"C\")) * 100.0 / len(seq)) for seq_name, seq in fasta_iter(fasta)] seq_name, value = max(gc, key=lambda ln: ln[1]) result = \"%s\ %.6f%%\ \" % (seq_name, value) sys.stdout.write(result) sys.stdout.flush() def fasta_iter(fa, buffsize=100000): r\"\"\" iter over a fasta file or file-like object or string. input: fa could be a file object, a filename or a string of fasta records return: record of fasta: (\"sequence_name\", \"sequence\") >>> s = \'>Rosalind_6404\ CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC\ TCCCACTAATAATTCTGAGG\ >Rosalind_5959\ CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT\ ATATCCATTTGTCAGCAGACACGC\ >Rosalind_0808\ CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC\ TGGGAACCTGCGGGCAGTAGGTGGAAT\ \' >>> list(fasta_iter(s)) [(\'Rosalind_6404\', \'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG\'), (\'Rosalind_5959\', \'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC\'), (\'Rosalind_0808\', \'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT\')] \"\"\" basestring_type = basestring if sys.version_info[0] == 2 else str if isinstance(fa, basestring_type): # if fa is filename or fasta string if path.exists(fa): # fa is a file on disk if fa.endswith(\".gz\"): fobj = gzip.open(fa, \'rb\') else: fobj = open(fa) elif fa == \'-\': # fa is standard input fobj = sys.stdin buffsize = 10000 # send output earlier if the input is sys.stdin elif fa[0] == \">\" and \"\ \" in fa:# fa is a fasta string fobj = iter(fa.splitlines()) else: raise Exception(\"Don\'t recognize the format. Valid formats includes: string, list/tuple of string, file-like object.\") # if (fa is list or tuple of string), or (fa is file-like object) # in both cases, we can use \"for line in fa\" to iter over fa elif (hasattr(fa, \'__getitem__\') and callable(fa.__getitem__)) or (hasattr(fa, \'readline\') and callable(fa.readline)): fobj = fa else: raise Exception(\"Don\'t recognize the format. Valid formats includes: string, list/tuple of string, file-like object.\") chunk = [] while True: new_data = list(islice(fobj, buffsize)) if not new_data: break chunk.extend(new_data) idx = [i for i, ln in enumerate(chunk) if ln[0] == \'>\'] for i, j in enumerate(idx[:-1]): yield (chunk[j][1:].rstrip(\'\ \'), \"\".join(chunk[j+1:idx[i+1]]).replace(\"\ \", \"\")) chunk = chunk[idx[-1]:] if chunk: yield (chunk[0][1:].rstrip(\'\ \'), \"\".join(chunk[1:]).replace(\"\ \", \"\")) #------------------------------------------------------------------------------- def count_point_mutation(s, t=None): \"\"\" Counting Point Mutations http://rosalind.info/problems/hamm/ Problem Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t. See Figure 2. Given: Two DNA strings s and t of equal length (not exceeding 1 kbp). Return: The Hamming distance dH(s,t). Sample Dataset GAGCCTACTAACGGGAT CATCGTAATGACGGCCT Sample Output 7 >>> count_point_mutation(\'GAGCCTACTAACGGGAT\', \'CATCGTAATGACGGCCT\') 7 \"\"\" if t is None: s, t = s.split() if len(s) != len(t): raise Exception(\"two strings should be of the same length!\") return sum(1 for i in range(len(s)) if s[i] != t[i]) #------------------------------------------------------------------------------- def find_motif(seq, motif): \"\"\" Finding a Motif in DNA http://rosalind.info/problems/subs/ Combing Through the Haystackclick to collapse Finding the same interval of DNA in the genomes of two different organisms (often taken from different species) is highly suggestive that the interval has the same function in both organisms. We define a motif as such a commonly shared interval of DNA. A common task in molecular biology is to search an organism\'s genome for a known motif. The situation is complicated by the fact that genomes are riddled with intervals of DNA that occur multiple times (possibly with slight modifications), called repeats. These repeats occur far more often than would be dictated by random chance, indicating that genomes are anything but random and in fact illustrate that the language of DNA must be very powerful (compare with the frequent reuse of common words in any human language). The most common repeat in humans is the Alu repeat, which is approximately 300 bp long and recurs around a million times throughout every human genome. However, Alu has not been found to serve a positive purpose, and appears in fact to be parasitic: when a new Alu repeat is inserted into a genome, it frequently causes genetic disorders. Problem Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s). The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of \'U\' in \"AUGCUUCAGAAAGGUCUUACG\" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i]. A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = \"AUGCUUCAGAAAGGUCUUACG\", then s[2:5] = \"UGCU\". The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below). Given: Two DNA strings s and t (each of length at most 1 kbp). Return: All locations of t as a substring of s. Sample Dataset GATATATGCATATACTT ATAT Sample Output 2 4 10 >>> find_motif(\'GATATATGCATATACTT\', \'ATAT\') [2, 4, 10] \"\"\" import re return [match.start()+1 for match in re.finditer(r\'(?=%s)\'%re.escape(motif), seq)] #------------------------------------------------------------------------------- def cons(s): \"\"\" Consensus and Profile http://rosalind.info/problems/cons/ Finding a Most Likely Common Ancestorclick to collapse In “Counting Point Mutations”, we calculated the minimum number of symbol mismatches between two strings of equal length to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands. Problem A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j. Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that \'A\' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below). A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings. A T C C A G C T G G G C A A C T A T G G A T C T DNA Strings A A G C A A C C T T G G A A C T A T G C C A T T A T G G C A C T A 5 1 0 0 5 5 0 0 Profile C 0 0 1 4 2 0 6 1 G 1 1 6 3 0 1 0 0 T 1 5 0 0 0 1 1 6 Consensus A T G C A A C T Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp). Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.) Sample Dataset
Problem gives you practice with string methods and file input using an example frombiology. The problem is modeled after “Finding a Motif in DNA,” one of the pr
Problem gives you practice with string methods and file input using an example frombiology. The problem is modeled after “Finding a Motif in DNA,” one of the pr
Problem gives you practice with string methods and file input using an example frombiology. The problem is modeled after “Finding a Motif in DNA,” one of the pr

Get Help Now

Submit a Take Down Notice

Tutor
Tutor: Dr Jack
Most rated tutor on our site