{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
Preclass prep: Chapters 8 and 10 from “Think Python”
\n\n\nThis material uses examples from notebooks developed by Ben Langmead
\n
In Python, strings are sequences of characters enclosed in quotation marks (either single or double quotes). They can be assigned to a variable and manipulated using various string methods. For example:
\nstring1 = “Hello World!”\nstring2 = ‘Hello World!’
\nprint(string1) # Output: Hello World!\nprint(string2) # Output: Hello World!
\nYou can also use the + operator to concatenate strings, and the * operator to repeat a string a certain number of times:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "string3 = \"Hello \" + \"World!\"\n", "print(string3) # Output: Hello World!\n", "\n", "string4 = \"Hello \" * 3\n", "print(string4) # Output: Hello Hello Hello" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-2", "source": "Hello World!\n Hello Hello Hello
\nThere are also many built-in string methods such as upper()
, lower()
, replace()
, split()
, find()
, len()
, etc. These can be used to manipulate and extract information from strings:
HELLO WORLD!\n hello world!\n Jello World!\n [‘Hello’, ‘World!’]
\nYou can also use indexing, slicing and string formatting to access or manipulate substrings or insert dynamic data into a string.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "name = \"John\"\n", "age = 30\n", "\n", "print(\"My name is {} and I am {} years old.\".format(name, age))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-6", "source": "My name is John and I am 30 years old.
\nYou can also use f-strings, or formatted string literals, to embed expressions inside string literals.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "name = \"John\"\n", "age = 30\n", "\n", "print(f\"My name is {name} and I am {age} years old.\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-8", "source": "My name is John and I am 30 years old.
\nIn Python, strings are sequences of characters, and each character has a corresponding index, starting from 0.\nThis means that we can access individual characters in a string using their index. This is called “string indexing”.
\nHere is an example:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "string = \"Hello World!\"\n", "print(string[0]) # Output: H\n", "print(string[1]) # Output: e\n", "print(string[-1]) # Output: !" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-10", "source": "H\n e\n !
\nYou can also use negative indexing to access characters from the end of the string, with -1 being the last character, -2 the second to last and so on.
\nYou can also use slicing to extract a substring from a string. The syntax is string[start:end:step]
, where start
is the starting index, end
is the ending index (not included), and step is the number of characters to skip between each index.
Hello\n World\n HloWrd
\nYou can also use the string formatting method to get the string at the specific index.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "string = \"Hello World!\"\n", "index = 3\n", "print(f\"The character at index {index} is: {string[index]}\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-14", "source": "The character at index 3 is: l
\nIn Python, many built-in string methods can be used to manipulate and extract information from strings. Here are some of the most commonly used ones:
\nupper()
: Converts the string to uppercaselower()
: Converts the string to lowercasereplace(old, new)
: Replaces all occurrences of the old substring with the new substringsplit(separator)
: Splits the string into a list of substrings using the specified separatorfind(substring)
: Returns the index of the first occurrence of the substring, or -1 if the substring is not foundindex(substring)
: Returns the index of the first occurrence of the substring or raises a ValueError if the substring is not foundcount(substring)
: Returns the number of occurrences of the substringjoin(iterable)
: Concatenates the elements of an iterable (such as a list or tuple) with the string as the separatorstrip()
: Removes leading and trailing whitespaces from the stringlstrip()
: Removes leading whitespaces from the stringrstrip()
: Removes trailing whitespaces from the stringstartswith(substring)
: Returns True if the string starts with the specified substring, False otherwiseendswith(substring)
: Returns True if the string ends with the specified substring, False otherwiseisalpha()
: Returns True if the string contains only alphabetic characters, False otherwiseisdigit()
: Returns True if the string contains only digits, False otherwiseisalnum()
: Returns True if the string contains only alphanumeric characters, False otherwiseformat()
: Formats the string by replacing placeholders with specified valueslen()
: Returns the length of the stringHere is an example of some of these methods:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "string = \"Hello World!\"\n", "\n", "print(string.upper()) # Output: HELLO WORLD!\n", "print(string.lower()) # Output: hello world!\n", "print(string.replace(\"H\", \"J\")) # Output: Jello World!\n", "print(string.split(\" \")) # Output: ['Hello', 'World!']\n", "print(string.find(\"World\")) # Output: 6\n", "print(string.count(\"l\")) # Output: 3\n", "print(string.strip()) # Output: \"Hello World!\"\n", "print(string.startswith(\"Hello\")) # Output: True\n", "print(string.endswith(\"World!\")) # Output: True\n", "print(string.isalpha()) # Output: False\n", "print(string.isalnum()) # Output: False\n", "print(string.format()) # Output: Hello World!\n", "print(len(string)) # Output: 12" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-16", "source": "HELLO WORLD!\n hello world!\n Jello World!\n [‘Hello’, ‘World!’]\n 6\n 3\n Hello World!\n True\n True\n False\n False\n Hello World!\n 12
\n4
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "'' # empty string (epsilon)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-22", "source": "’’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "len('')" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-24", "source": "0
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "import random\n", "random.choice('ACGT') # generating a random nucleotide" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-26", "source": "‘G’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "# now I'll make a random nucleotide string by concatenating random nucleotides\n", "st = ''.join([random.choice('ACGT') for _ in range(40)])\n", "st" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-28", "source": "‘GCTATATCAATGTTATCCGTTTTCTGATGTCGCGAGGACA’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "st[1:3] # substring, starting at position 1 and extending up to but not including position 3\n", "# note that the first position is numbered 0" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-30", "source": "‘CT’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "st[0:3] # prefix of length 3" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-32", "source": "‘GCT’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "st[:3] # another way of getting the prefix of length 3" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-34", "source": "‘GCT’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "st[len(st)-3:len(st)] # suffix of length 3" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-36", "source": "‘ACA’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "st[-3:] # another way of getting the suffix of length 3" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-38", "source": "‘ACA’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "st1, st2 = 'CAT', 'ATAC'" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-40", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "st1" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-42", "source": "‘CAT’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "st2" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-44", "source": "‘ATAC’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "st1 + st2 # concatenation of 2 strings" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-46", "source": "‘CATATAC’
\nThis notebook explores [FASTQ], the most common format for storing sequencing reads.
\nFASTA and FASTQ are rather similar, but FASTQ is almost always used for storing sequencing reads (with associated quality values), whereas FASTA is used for storing all kinds of DNA,\nRNA or protein sequences (without associated quality values).
\nHere’s a single sequencing read in FASTQ format:
\n@ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1\nAGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATGAGAGCTCAGAAGTAACAGTTGCTTTCAGTCCCATAAAAACAGTCCTACAA\n+\nBDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGFEGFGFFDFFHBGDGFHGEFGHFGHGFGFFFEHGGFGGDGHGFEEHFFHGE\n
It’s spread across four lines. The four lines are:
\n@
” followed by a read name+
”, possibly followed by some info, but ignored by virtually all toolsDownload a sample small fastq file:
\n!wget https://zenodo.org/records/10602772/files/fastq_single_end_short.fq\n
Now we will use a very simple Python function to read this file and load fastq data into a list:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "def parse_fastq(fh):\n", " \"\"\" Parse reads from a FASTQ filehandle. For each read, we\n", " return a name, nucleotide-string, quality-string triple. \"\"\"\n", " reads = []\n", " while True:\n", " first_line = fh.readline()\n", " if len(first_line) == 0:\n", " break # end of file\n", " name = first_line[1:].rstrip()\n", " seq = fh.readline().rstrip()\n", " fh.readline() # ignore line starting with +\n", " qual = fh.readline().rstrip()\n", " reads.append((name, seq, qual))\n", " return reads\n", "\n", "with open('fastq_single_end_short.fq','r') as fq:\n", " reads = parse_fastq(fq)\n", "\n", "for read in reads:\n", " print(read)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-48", "source": "('ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1', 'AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG', 'BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF')\n('ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1', 'CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT', '@@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF')\n('ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1', 'GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT', 'CDAF<FFDEHEFDDFEEFDGDFCHD=GHG<GEDHDGJFHEFFGEFEE@GH')\n
The nucleotide string can sometimes contain the character “N
”. N
essentially means “no confidence.” The sequencer knows there’s a nucleotide there but doesn’t know whether it’s an A, C, G or T.
\n\nComment: A note on while True\nIn Python, the while loop is used to repeatedly execute a block of code as long as a certain condition is true. The while True statement is a special case where the loop will run indefinitely until a break statement is encountered inside the loop.
\nHere is an example of a while True loop:
\n\nwhile True:\n user_input = input(\"Enter 'q' to quit: \")\n if user_input == 'q':\n break\n print(\"You entered:\", user_input)\n\nprint(\"Exited the loop\")\n
\nEnter 'q' to quit: q\nExited the loop\n
In this example, the loop will keep asking for user input until the user enters the ‘q’ character, which triggers the break statement, and the loop is exited.
\nIt is important to be careful when using while True loops, as they will run indefinitely if a break statement is not included. This can cause the program to crash or hang, if not handled properly.
\nAlso, It is recommended to use
\nwhile True
loop with abreak
statement, in case you want to execute the loop until some specific condition is met, otherwise, it’s not a good practice to usewhile True
.It’s a good practice to include a way for the user to exit the loop, such as the break statement in the example above, or a counter variable to keep track of the number of iterations.
\n
Read names often contain information about:
\nERR294379
(an SRA accession number) in the read names correspond to this study./1
you see at the end of the read names above indicates the read is the first end from a paired-end read.Quality values are probabilities. Each nucleotide in each sequencing read has an associated quality value. A nucleotide quality value encodes the probability that the nucleotide was incorrectly called by the sequencing instrument and its software. If the nucleotide is A
, the corresponding quality value encodes the probability that the nucleotide at that position is actually not an A
.
Quality values are encoded in two senses: first, the relevant probabilities are re-scaled using the Phread scale, which is a negative log scale.\nIn other words if p is the probability that the nucleotide was incorrectly called, we encode this as Q where Q = -10 * log10(p).
\nFor example, if Q = 30, then p = 0.001, a 1-in-1000 chance that the nucleotide is wrong. If Q = 20, then p = 0.01, a 1-in-100 chance. If Q = 10, then p = 0.1, a 1-in-10 chance. And so on.
\nSecond, scaled quality values are rounded to the nearest integer and encoded using ASCII printable characters. For example, using the Phred33 encoding (which is by far the most common), a Q of 30 is encoded as the ASCII character with code 33 + 30 = 63, which is “?
”. A Q of 20 is encoded as the ASCII character with code 33 + 20 = 53, which is “5
”. And so on.
Let’s define some relevant Python functions:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "def phred33_to_q(qual):\n", " \"\"\" Turn Phred+33 ASCII-encoded quality into Phred-scaled integer \"\"\"\n", " return ord(qual)-33\n", "\n", "def q_to_phred33(Q):\n", " \"\"\" Turn Phred-scaled integer into Phred+33 ASCII-encoded quality \"\"\"\n", " return chr(Q + 33)\n", "\n", "def q_to_p(Q):\n", " \"\"\" Turn Phred-scaled integer into error probability \"\"\"\n", " return 10.0 ** (-0.1 * Q)\n", "\n", "def p_to_q(p):\n", " \"\"\" Turn error probability into Phred-scaled integer \"\"\"\n", " import math\n", " return int(round(-10.0 * math.log10(p)))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-50", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "# Here are the examples I discussed above\n", "\n", "# Convert Qs into ps\n", "q_to_p(30), q_to_p(20), q_to_p(10)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-52", "source": "(0.001, 0.01, 0.1)
\np_to_q(0.00011) # note that the result is rounded\n
\n40\n
q_to_phred33(30), q_to_phred33(20)\n
\n('?', '5')\n
To convert an entire string Phred33-encoded quality values into the corresponding Q or p values, I can do the following:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "# Take the first read from the small example above\n", "name, seq, qual = parse_fastq(StringIO(fastq_string))[0]\n", "q_string = list(map(phred33_to_q, qual))\n", "p_string = list(map(q_to_p, q_string))\n", "print(q_string)\n", "print(p_string)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-54", "source": "[33, 35, 35, 36, 36, 37, 30, 37, 38, 37, 37, 37, 39, 38, 37, 37, 39, 39, 38, 39, 38, 38, 39, 34, 39, 31, 38, 39, 39, 39, 38, 37, 32, 39, 36, 38, 37, 36, 39, 38, 36, 37, 38, 39, 34, 34, 38, 38, 38, 37]\n [0.000501187233627272, 0.00031622776601683794, 0.00031622776601683794, 0.00025118864315095795, 0.00025118864315095795, 0.00019952623149688788, 0.001, 0.00019952623149688788, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.0001258925411794166, 0.00015848931924611126, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0001258925411794166, 0.0007943282347242813, 0.00015848931924611126, 0.0001258925411794166, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.000630957344480193, 0.0001258925411794166, 0.00025118864315095795, 0.00015848931924611126, 0.00019952623149688788, 0.00025118864315095795, 0.0001258925411794166, 0.00015848931924611126, 0.00025118864315095795, 0.00019952623149688788, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0003981071705534969, 0.00015848931924611126, 0.00015848931924611126, 0.00015848931924611126, 0.00019952623149688788]
\nYou might wonder how the sequencer and its software can know the probability that a nucleotide is incorrectly called. It can’t; this number is just an estimate. To describe exactly how it’s estimated is beyond the scope of this notebook; if you’re interested, search for academic papers with “base calling” in the title. Here’s a helpful video by Rafa Irizarry.
\nA final note: other ways of encoding quality values were proposed and used in the past. For example, Phred64 uses an ASCII offset of 64 instead of 33, and Solexa64 uses “odds” instead of the probability p. But Phred33 is by far the most common today and you will likely never have to worry about this.
\n\n\nComment: A note in map()\nIn Python, the
\nmap()
function is used to apply a given function to all elements of an iterable (such as a list, tuple, or string) and return an iterator (an object that can be iterated, e.g. in afor
-loop) that yields the results.The
\nmap()
function takes two arguments:A function that is to be applied to each element of the iterable\nAn iterable on which the function is to be applied
\nHere is an example:
\n\n# Using a function to square each element of a list\nnumbers = [1, 2, 3, 4, 5]\nsquared_numbers = map(lambda x: x**2, numbers)\nprint(list(squared_numbers)) # Output: [1, 4, 9, 16, 25]\n
\n[1, 4, 9, 16, 25]\n
In the example above, the
\nmap()
function applies the lambda functionlambda x: x**2
to each element of the numbers list, and returns an iterator of the squared numbers. Thelist()
function is used to convert the iterator to a list, so that the result can be printed.Another example is,
\n\n# Using the map() function to convert a list of strings to uppercase\nwords = [\"hello\", \"world\"]\nuppercase_words = map(lambda word: word.upper(), words)\nprint(list(uppercase_words)) # Output: ['HELLO', 'WORLD']\n
\n['HELLO', 'WORLD']\n
It’s important to note that the
\nmap()
function returns an iterator, which can be used in a for loop, but is not a list, tuple, or any other iterable. If you want to create a list, tuple, or other iterable from the result of themap()
function, you can use thelist()
,tuple()
, or any other built-in function that creates an iterable.In Python 3, the
\nmap()
function returns an iterator, which can be used in a for loop, but it’s not iterable. If you want to create a list, tuple, or other iterable from the result of themap()
function, you can use thelist()
,tuple()
, or any other built-in function that creates an iterable.In Python 2,
\nmap()
function returns a list, which can be used in a for loop, and it’s iterable.In python 3.x, there is an alternative way to use map() function is
\nlist(map(...))
ortuple(map(...))
etc.
Sequencing reads can come in pairs. Basically instead of reporting a single snippet of nucleotides from the genome, the sequencer might report a pair of\nsnippets that appear close to each other in the genome. To accomplish this, the sequencer sequences both ends of a longer fragment of DNA.
\nHere is simple Python code that mimics how the sequencer obtains one paired-end read:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "# Let's just make a random genome of length 1K\n", "import random\n", "random.seed(637485)\n", "genome = ''.join([random.choice('ACGT') for _ in range(1000)])\n", "genome" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-56", "source": "‘AGTACGTCATACCGTTATGATCTAGGTGGGATCGCGGATTGGTCGTGCAGAATACAGCCTTGGAGAGTGGTTAACACGATAAGGCCGATAATATGTCTGGATAAGCTCAGGCTCTGCTCCGAGGCGCTAAGGTACATGTTATTGATTTGGAGCTCAAAAATTGCCATAGCATGCAATACGCCCGTTGATAGACCACTTGCCTTCAGGGGAGCGTCGCATGTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAGGCACTTCGCTAGAAACGGCTTACTGCAGGTGATAAAATCAACAGATGGCACGCTCGCAACAGAAGCATAATATGCTTCCAACCAGGACCGGCATTTAACTCAATATATTAGCTCTCGAGGACAACGCACTACGTTTTCCAATTCAGCGGACTGGCGCCATTACAGTAAGTTGATTGTGCAGTGGTCTTTGACAGACAGCAGTTCGCTCCTTACTGACAATACCTGATACTTATAGTATGGCAGCGAGTCGTTGTCTAGGTTAGCCACCTCAGTCTACAGCAGGTAATGAAGCATTCCCACAAAGGCTGGTCCATACACCCGACTGCTACGATTCATGCTTCGCTCGAGAACTGCCCCTGCCTTAGATTCCCCCTCGTCTCCAATGAATACCCATTTTTTTAGATTGCTGAAAACCTTTCGTAAGACGCTTTCCAGTGATTACATGCCCTAACTGGGTACAGTTTGCCCAGGAGCTTTTTGGATGGAGGAGTATTAGTAGCGACCAAAACTCTTCCTCGACTGTTACTGTGTAGAGTCCCAAACGCTAAAGCGGTCCCAGAAAAACGGAACGGCCTACAGATTAAATTGCTCCGTGTTGCAGTTAAGGCGTACAAACCCCTCTGTGTATTAGTTTAAGTCTCTGAGTCTTCTTTGCTATGACGGATTGATGGGTGCCGGTTTGTAGTTCAAGAACCGTGAGTGAACC’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "# The sequencer draws a fragment from the genome of length, say, 250\n", "offset = random.randint(0, len(genome) - 250)\n", "fragment = genome[offset:offset+250]\n", "fragment" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-58", "source": "‘GTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAGGCACTTCGCTAGAAACGGCTTACTGCAGGTGATAAAATCAACAGATGGCACGCTCGCAACAGAAGCATAATATGCTTCCAACCAGGACCGGCATTTAACTCAATATATTAGCTCTCGAGGACAACGCACTACGTTTTCCAATTCAGCGGACTGGCGCCATTACAGTAAGTTGATT’
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "# Then it reads sequences from either end of the fragment\n", "end1, end2 = fragment[:75], fragment[-75:]\n", "end1, end2" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-60", "source": "(‘GTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAG’,\n ‘CAATATATTAGCTCTCGAGGACAACGCACTACGTTTTCCAATTCAGCGGACTGGCGCCATTACAGTAAGTTGATT’)
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "# And because of how Illumina sequencing works, the\n", "# second end is always from the opposite strand from the first\n", "# (this is not the case for 454 and SOLiD data)\n", "\n", "import string\n", "\n", "# function for reverse-complementing\n", "revcomp_trans = str.maketrans(\"ACGTacgt\", \"TGCAtgca\")\n", "def reverse_complement(s):\n", " return s[::-1].translate(revcomp_trans)\n", "\n", "end2 = reverse_complement(end2)\n", "end1, end2" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-62", "source": "(‘GTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAG’,\n ‘AATCAACTTACTGTAATGGCGCCAGTCCGCTGAATTGGAAAACGTAGTGCGTTGTCCTCGAGAGCTAATATATTG’)
\nFASTQ can be used to store paired-end reads. Say we have 1000 paired-end reads. We should store them in a pair of FASTQ files. The first FASTQ file (say, reads_1.fq
) would contain all of the first ends and the second FASTQ file (say, reads_2.fq
) would contain all of the second ends. In both files, the ends would appear in corresponding order. That is, the first entry in reads_1.fq
is paired with the first entry in reads_2.fq
and so on.
Here is a Python function that parses a pair of files containing paired-end reads.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "def parse_paired_fastq(fh1, fh2):\n", " \"\"\" Parse paired-end reads from a pair of FASTQ filehandles\n", " For each pair, we return a name, the nucleotide string\n", " for the first end, the quality string for the first end,\n", " the nucleotide string for the second end, and the\n", " quality string for the second end. \"\"\"\n", " reads = []\n", " while True:\n", " first_line_1, first_line_2 = fh1.readline(), fh2.readline()\n", " if len(first_line_1) == 0:\n", " break # end of file\n", " name_1, name_2 = first_line_1[1:].rstrip(), first_line_2[1:].rstrip()\n", " seq_1, seq_2 = fh1.readline().rstrip(), fh2.readline().rstrip()\n", " fh1.readline() # ignore line starting with +\n", " fh2.readline() # ignore line starting with +\n", " qual_1, qual_2 = fh1.readline().rstrip(), fh2.readline().rstrip()\n", " reads.append(((name_1, seq_1, qual_1), (name_2, seq_2, qual_2)))\n", " return reads\n", "\n", "fastq_string1 = '''@509.6.64.20524.149722/1\n", "AGCTCTGGTGACCCATGGGCAGCTGCTAGGGAGCCTTCTCTCCACCCTGA\n", "+\n", "HHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIHHIHFHHF\n", "@509.4.62.19231.2763/1\n", "GTTGATAAGCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTATTC\n", "+\n", "HHHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHDHHHHHHGHGHH'''\n", "\n", "fastq_string2 = '''@509.6.64.20524.149722/2\n", "TAAGTCAGGATACTTTCCCATATCCCAGCCCTGCTCCNTCTTTAAATAAT\n", "+\n", "HHHHHHHHHHHHHHHHHHHH@HHFHHHEFHHHHHHFF#FFFFFFFHHHHH\n", "@509.4.62.19231.2763/2\n", "CTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAANAAGAGCTTACTA\n", "+\n", "HHHHHHHHHHHHHHHHHHEHEHHHFHGHHHHHHHH>@#@=44465HHHHH'''\n", "\n", "parse_paired_fastq(StringIO(fastq_string1), StringIO(fastq_string2))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-64", "source": "[((‘509.6.64.20524.149722/1’,\n ‘AGCTCTGGTGACCCATGGGCAGCTGCTAGGGAGCCTTCTCTCCACCCTGA’,\n ‘HHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIHHIHFHHF’),\n (‘509.6.64.20524.149722/2’,\n ‘TAAGTCAGGATACTTTCCCATATCCCAGCCCTGCTCCNTCTTTAAATAAT’,\n ‘HHHHHHHHHHHHHHHHHHHH@HHFHHHEFHHHHHHFF#FFFFFFFHHHHH’)),\n ((‘509.4.62.19231.2763/1’,\n ‘GTTGATAAGCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTATTC’,\n ‘HHHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHDHHHHHHGHGHH’),\n (‘509.4.62.19231.2763/2’,\n ‘CTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAANAAGAGCTTACTA’,\n ‘HHHHHHHHHHHHHHHHHHEHEHHHFHGHHHHHHHH>@#@=44465HHHHH’))]
\n\n\nComment: A note on triple quotes\nIn Python, triple quotes (either single or double) are used to create multi-line strings. They can also be used to create doc-strings, which are used to document a function, class, or module.
\nFor example:
\n\nstring1 = \"\"\"This is a\nmultiline string\"\"\"\n\nstring2 = '''This is also\na multiline string'''\n\nprint(string1)\n# Output:\n# This is a\n# multiline string\n\nprint(string2)\n# Output:\n# This is also\n# a multiline string\n
This is a\n multiline string\n This is also\n a multiline string
\nTriple quotes can also be used to create docstrings, which are used to document a function, class, or module. The first line of a docstring is a brief summary of what the function, class, or module does, and the following lines provide more detailed information.
\n\ndef my_function():\n \"\"\"\n This is a docstring for the my_function.\n This function does not perform any operation.\n \"\"\"\n pass\n\nprint(my_function.__doc__)\n# Output:\n# This is a docstring for the my_function.\n# This function does not perform any operation.\n
\nThis is a docstring for the my_function.\n This function does not perform any operation.\n
Let’s reuse the parse_fastq
function from above:
☝️ This will load data into a list called reads
containing individual reads (name, nucleotides, and quality values) represented as a tuple.
☝️ This will take quality string only for each read from reads
and convert it into a list of values. The end result, a list of lists called run_qualities
, will contain as many elements as there as reads, where each element (also a list) contains all quality values for a read. We need to transpose this list, so that instead of representing every reads as a list of quality values we represent every position as a list of quality values for all reads at that position. This can be easily done with numpy’s T
:
Now we need to extract some key per-position statistics from the base_qualities
list generated above. We, again, use numpy for this:
☝️ The result of this, a dictionary called plotting_data
, contains statistics types as keys and corresponding values as lists. Now we load these into Pandas:
And plot:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "# Plot!\n", "import altair as alt\n", "\n", "base = alt.Chart(plotting_data_df).encode(\n", " alt.X('base:Q', title=\"Position in the read\")\n", ").properties(\n", " width=800,\n", " height=200)\n", "\n", "median = base.mark_tick(color='red',orient='horizontal').encode(\n", " alt.Y('median:Q', title=\"Phred quality score\"),\n", ")\n", "\n", "q = base.mark_rule(color='green',opacity=.5,strokeWidth=10).encode(\n", " alt.Y('q25:Q'),\n", " alt.Y2('q75:Q')\n", ")\n", "\n", "min_max = base.mark_rule(color='black').encode(\n", " alt.Y('min:Q'),\n", " alt.Y2('max:Q')\n", ")\n", "\n", "median + q + min_max" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-76", "source": "Below is the same thing as just one code blob:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-77", "source": [ "# Read the file into a list\n", "with open('fastq_single_end_short.fq','r') as fq:\n", " reads = parse_fastq(fq)\n", "\n", "# Extract qualities and convert them into numerical values\n", "run_qualities = []\n", "for read in reads:\n", " read_qualities = []\n", " for quality in read[2]:\n", " read_qualities.append(phred33_to_q(quality))\n", " run_qualities.append(read_qualities)\n", "\n", "# Transpose the quality values matrix\n", "\n", "import numpy as np\n", "base_qualities = np.array(run_qualities).T\n", "\n", "# Prep data for plotting\n", "plotting_data = {\n", " 'base':[],\n", " 'mean':[],\n", " 'median':[],\n", " 'q25':[],\n", " 'q75':[],\n", " 'min':[],\n", " 'max':[]\n", "}\n", "for i,base in enumerate(base_qualities):\n", " plotting_data['base'].append(i)\n", " plotting_data['mean'].append(np.mean(base))\n", " plotting_data['median'].append(np.median(base))\n", " plotting_data['q25'].append(np.quantile(base,.25))\n", " plotting_data['q75'].append(np.quantile(base,.75))\n", " plotting_data['min'].append(np.min(base))\n", " plotting_data['max'].append(np.max(base))\n", "\n", "# Load into Pandas\n", "import pandas as pd\n", "plotting_data_df = pd.DataFrame(plotting_data)\n", "\n", "# Plot!\n", "import altair as alt\n", "\n", "base = alt.Chart(plotting_data_df).encode(\n", " alt.X('base:Q', title=\"Position in the read\")\n", ").properties(\n", " width=800,\n", " height=200)\n", "\n", "median = base.mark_tick(color='red',orient='horizontal').encode(\n", " alt.Y('median:Q', title=\"Phred quality score\"),\n", ")\n", "\n", "q = base.mark_rule(color='green',opacity=.5,strokeWidth=10).encode(\n", " alt.Y('q25:Q'),\n", " alt.Y2('q75:Q')\n", ")\n", "\n", "min_max = base.mark_rule(color='black').encode(\n", " alt.Y('min:Q'),\n", " alt.Y2('max:Q')\n", ")\n", "\n", "median + q + min_max" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from \"Think Python\"" ], "id": "" } } }, { "id": "cell-78", "source": "In all the examples above, the reads in the FASTQ file are all the same length. This is not necessarily the case though it is usually true for datasets generated by sequencing-by-synthesis instruments. FASTQ files can contain reads of various lengths.
\nFASTQ files often have the extension .fastq
or .fq
.