{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Introduction to sequencing with Python (part one)\n\nby [Anton Nekrutenko](https://training.galaxyproject.org/hall-of-fame/nekrut/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- What are the origins of Sanger sequencing\n- How did sequencing machines evolve?\n- How can we simulate Sanger sequencing with Python?\n\n**Objectives**\n\n- Have a basic understanding of the history of sequencing\n- Understand Python basics\n\n**Time Estimation: 1h**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "

\"Imgur

\n

The problem

\n

The difficulty with sequencing nucleic acids is nicely summarized by Hutchinson 2007:

\n
    \n
  1. The chemical properties of different DNA molecules were so similar that it appeared difficult to separate them.
  2. \n
  3. The chain length of naturally occurring DNA molecules was much greater than for proteins and made complete sequencing seem unapproachable.
  4. \n
  5. The 20 amino acid residues found in proteins have widely varying properties that had proven useful in the separation of peptides. The existence of only four bases in DNA therefore seemed to make sequencing a more difficult problem for DNA than for protein.
  6. \n
  7. No base-specific DNAases were known. Protein sequencing had depended upon proteases that cleave adjacent to certain amino acids.
  8. \n
\n

It is therefore not surprising that protein-sequencing was developed before DNA sequencing by Sanger and Tuppy 1951.

\n

tRNA was the first complete nucleic acid sequenced (see pioneering work of Robert Holley and colleagues and also Holley’s Nobel Lecture). Conceptually, Holley’s approach was similar to Sanger’s protein sequencing: break molecule into small pieces with RNases, determine sequences of small fragments, and use overlaps between fragments to reconstruct (assemble) the final nucleotide sequence.

\n

The work on finding approaches to sequencing DNA molecules began in the late 60s and early 70s. One of the earliest contributions has been made by Ray Wu (Cornell) and Dave Kaiser (Stanford), who used E. coli DNA polymerase to incorporate radioactively labelled nucleotides into protruding ends of bacteriphage lambda. It took several more years for the development of more “high throughput” technologies by Sanger and Maxam/Gilbert. The Sanger technique has ultimately won over Maxam/Gilbert’s protocol due to its relative simplicity (once dideoxynucleotides has become commercially available) and the fact that it required a smaller amount of starting material as the polymerase was used to generate fragments necessary for sequence determination.

\n

Sanger/Coulson plus/minus method

\n
\n
Comment: Two Nobel prizes
\n

Fred Sanger is one of only four people, who received two Nobel Prizes in their original form (for scientific, not societal, breakthroughs).

\n
\n

This method builds on the idea of Wu and Kaiser (for minus part) and on the special property of DNA polymerase isolated from phage T4 (for plus part). The schematics of the method is given in the following figure (from Sanger & Coulson: 1975):

\n
\"Plus/minusOpen image in new tab

Figure 1: Plus/minus method (from Sanger & Coulson 1975)
\n

In this method, a primer and DNA polymerase is used to synthesize DNA in the presence of P32-labeled nucleotides (only one of four is labeled). This generates P32-labeled copies of DNA being sequenced. These are then purified and (without denaturing) separated into two groups: minus and plus. Each group is further divided into four equal parts.

\n

In the case of minus polymerase a mix of nucleotides minus one is added to each of the four aliquotes: ACG (-T), ACT (-G), CGT (-A), AGT (-C). As a result in each case DNA strand is extended up to a missing nucleotide.

\n

In the case of plus only one nucleotide is added to each of the four aliquotes (+A, +C, +G, and +T) and T4 DNA polymerase is used. T4 DNA polymerase acts as an exonuclease that would degrade DNA\nfrom 3’-end up to a nucleotide that is supplied in the reaction.

\n

The products of these are loaded into a denaturing polyacrylamide gel as eight tracks (four for minus and four for plus; from Sanger & Coulson: 1975):

\n
\"Plus/minusOpen image in new tab

Figure 2: Plus/minus method gel radiograph (from Sanger & Coulson 1975)
\n

Maxam/Gilbert chemical cleavage method

\n

In this method DNA is terminally labeled with P32, separated into four equal aliquotes. Two of these are treated with Dimethyl sulfate (DMSO) and the remaining two are treated with hydrazine.

\n

DMSO methylates G and A residues. Treatment of DMSO-incubated DNA with alkali at high temperature will break DNA chains at G and A with Gs being preferentially broken, while treatment of DMSO-incubated DNA with acid will preferentially break DNA at As. Likewise treating hydrazine-incubated DNA with piperidine breaks DNA at C and T, while DNA treated with hydrazine in the presence of NaCl preferentially breaks at Cs. The four reactions are then loaded on a gel generating the following picture (from Maxam & Gilbert: 1977):

\n
\"Maxam/GilbertOpen image in new tab

Figure 3: Radiograph of Maxam/Gilbert gel (from Maxam & Gilbert 1977)
\n

Sanger dideoxy method

\n

The original Sanger +/- method was not popular and had a number of technical limitations. In a new approach, Sanger took advantage of inhibitors that stop the extension of a DNA strand at particular nucleotides.\nThese inhibitors are dideoxy analogs of normal nucleotide triphosphates (from Sanger et al. 1977):

\n
\"SangerOpen image in new tab

Figure 4: Sanger ddNTP gel (from Sanger et al. 1977)
\n

Original approaches were laborious

\n

In the original Sanger paper the authors sequenced bacteriophage phiX174 by using its own restriction fragments as primers. This was an ideal set up to show the proof of principle for the new method. This is because phiX174 DNA is homogeneous and can be isolated in large quantities. Now suppose that you would like to sequence a larger genome (say E. coli). Remember that the original version of Sanger method can only sequence fragments up to 200 nucleotides at a time. So to sequence the entire E. coli genome (which by-the-way was not sequenced until 1997) you would need to split the genome into multiple pieces and sequence each of them individually. This is hard because to produce a readable Sanger sequencing gel each sequence must be amplified to a suitable amount (around 1 nanogram) and be homogeneous (you cannot mix multiple DNA fragments in a single reaction as it will be impossible to interpret the gel). Molecular cloning enabled by the availability of commercially available restriction enzymes and cloning vectors simplified this process. Until the onset of next generation sequencing in 2005 the process for sequencing looked something like this:

\n\n
\"pGEM3z.Open image in new tab

Figure 5: Map of pGEM-3Z cloning vector (from Promega Inc.)
\n

Until the invention of NGS the above protocol was followed with some degree of automation. But you can see that it was quite laborious if the large number of fragments needed to be sequenced. This is because each of them needed to be subcloned and handled separately. This is in part why the Human Genome Project, which will be discussed in future lectures in detail, took so much time to complete.

\n

Evolution of sequencing machines

\n

The simplest possible sequencing machine is a gel rig with polyacrylamide gel. Sanger used it is his protocol obtaining the following results (from Sanger et al. 1977):

\n
\"SangerOpen image in new tab

Figure 6: Typical polyacrylamide gel separating DNA fragments generate with Sanger's dideoxy method (from Sanger et al. 1977)
\n

Here for sequencing each fragment four separate reactions are performed (with ddA, ddT, ggC, and ddG) and four lanes on the gel are used. One simplification of this process that\ncame in the 90s was to use fluorescently labeled dideoxy nucleotides. This is easier because everything can be performed in a single tube and uses a single lane on a gel\n(from Applied Biosystems support site):

\n
\"PAOpen image in new tab

Figure 7: Comparison between a chromatogram and a polyacrylamide gel readouts (from Applied Biosystems)
\n

However, there is still substantial labor involved in pouring the gels, loading them, running machines, and cleaning everything post-run. A significant improvement was offered by the development of capillary electrophoresis allowing automation of liquid handling and sample loading. Although several manufacturers have been developing and selling such machines a de facto standard in this area was (and still is) the Applied Biosystems (ABI) Genetics and DNA Analyzer systems. The highest throughput ABI system, 3730xl, had 96 capillaries and could automatically process 384 samples.

\n

NGS!

\n

384 samples may sound like a lot, but it is nothing if we are sequencing an entire genome. The beauty of NGS is that these technologies are not bound by sample handling logistics. They still require the preparation of libraries,\nbut once a library is made (which can be automated) it is processed more or less automatically to generate multiple copies of each fragment\n(in the case of 454, Illumina, Ion Torrent, PacBio, Oxford Nanopore, Element, Complete Genomics etc…) and loaded onto the machine, where millions of individual fragments are sequenced simultaneously.\nWe will learn about these technologies later in this course.

\n

Reading

\n\n

A few classical papers

\n

In a series of now classical papers (Paper 1, Paper2) Philip Green and co-workers have developed a quantitative framework for the analysis of data generated by automated DNA sequencers:

\n
\"phred1.Open image in new tab

Figure 8: The first paper
\n
\"phred2.Open image in new tab

Figure 9: The second paper
\n

In particular, they developed a standard metric for describing the reliability of base calls:

\n

An important technical aspect of our work is the use of log-transformed error probabilities rather than untransformed ones, which facilitates working with error rates in the range of most importance (very close to 0).\nSpecifically, we define the quality value \\(q\\) assigned to a base-call to be:

\n\n\\[q = -10\\times log\\_{10}(p)\\]\n

where \\(p\\) is the estimated error probability for that base-call. Thus a base-call having a probability of 1/1000 of being incorrect is assigned a quality value of 30. Note that high-quality values correspond to low error probabilities, and conversely.

\n

We will be using the concept of “quality score” or “phred-scaled quality score” repeatedly in this course.

\n

Myers - Green debate

\n

Can we sequence a genome using the shotgun approach?

\n
\"Myers/GreenOpen image in new tab

Figure 10: Gene Myers versus Phil Green
\n

Simulating Sanger sequencing with Python

\n

\"xkcd1306.

\n

In this lesson we will cover some of the fundamental Python basics including variables, expressions, statements, and functions.

\n

Prep

\n
    \n
  1. Start JupyterLab
  2. \n
  3. Within JupyterLab start a new Python3 notebook
  4. \n
\n

Preclass prep: Chapters 1, 2, 3 from “Think Python”

\n

Indentation is everything!

\n
\n
Warning: Indentation or bust!
\n

Python is an indented language: code blocks are defined using indentation with spaces!

\n
\n

In Python, indentation is used to indicate the scope of control structures such as for loops, if statements, and function and class definitions. The amount of indentation is not fixed, but it must be consistent within a block of code.\nThe recommended amount of indentation is 4 spaces, although some developers prefer to use 2 spaces. Indenting is important in Python because it is used to indicate the level of nesting and structure of the code,\nwhich makes it easier to read and understand. Additionally, indentation is also used to indicate which lines of code are executed together as a block.

\n

The storyline

\n

In this lecture, we will re-implement our Sanger sequencing simulator from the previous lecture and generate realistic gel images.

\n

Generate a random sequence

\n

First, we import a module called random which contains a number of functions for generating and working with random numbers

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "import random" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-2", "source": "

Next, we will write a simple loop that would generate a sequence of pre-set lengths:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "seq = ''\n", "for _ in range(100):\n", " seq += random.choice('ATCG')" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-4", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "seq" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-6", "source": "
CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCAAGGTCGTGGCACC\n
\n

Simulate one polymerase molecule

\n

The code below iterates through each element of a sequence seq (assumed to be a string containing nucleotides) and it checks if the current nucleotide is equal to ‘A’. If it is, it generates a random number between 0 and 1 using the random.random() function.

\n

It then checks if the random number is greater than 0.5. If it is, the code does nothing and proceeds to the next iteration. If the random number is less than or equal to 0.5, the code adds the\nlowercase version of the nucleotide (‘a’) to a string called synthesized_strand and then breaks out of the loop.

\n

In every iteration of the loop, regardless of whether the nucleotide is ‘A’ or not, the code then adds the current nucleotide to the synthesized_strand string.

\n

This means that when the current nucleotide is ‘A’, then the generated random number will decide whether the code will add the nucleotide ‘A’ or ‘a’ to the synthesized_strand, and it will break out of the loop\nafter adding the nucleotide to the synthesized_strand. To get a good idea of what is going on let’s visualize the code execution in

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "synthesized_strand = ''\n", "\n", "for nucleotide in seq:\n", " if nucleotide == 'A':\n", " d_or_dd = random.random()\n", " if d_or_dd > 0.5:\n", " None\n", " else:\n", " synthesized_strand += 'a'\n", " break\n", " synthesized_strand += nucleotide" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-8", "source": "

This can be simplified by first removing d_or_dd variable:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "synthesized_strand = ''\n", "for nucleotide in seq:\n", " if nucleotide == 'A':\n", " if random.random() > 0.5:\n", " None\n", " else:\n", " synthesized_strand += 'a'\n", " break\n", " synthesized_strand += nucleotide\n", "print(synthesized_strand)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-10", "source": "
CTTGCGGCTATa\n
\n

and removing an unnecessary group of if ... else statements:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "synthesized_strand = ''\n", "for nucleotide in seq:\n", " if nucleotide == 'A' and random.random() > 0.5:\n", " synthesized_strand += 'a'\n", " break\n", " synthesized_strand += nucleotide\n", "print(synthesized_strand)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-12", "source": "
CTTGCGGCTATAGGAATa\n
\n

finally let’s make synthesized_strand += 'a' a bit more generic:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "synthesized_strand = ''\n", "for nucleotide in seq:\n", " if nucleotide == 'A' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", "print(synthesized_strand)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-14", "source": "

CTTGCGGCTATAGGa

\n

Simulating multiple molecules

\n

To simulate 10 polymerase molecules we simply wrap the code from above into a for loop:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "for _ in range(10):\n", " synthesized_strand = ''\n", " for nucleotide in seq:\n", " if nucleotide == 'A' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " print(synthesized_strand)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-16", "source": "

CTTGCGGCTa\n CTTGCGGCTa\n CTTGCGGCTa\n CTTGCGGCTa\n CTTGCGGCTa\n CTTGCGGCTa\n CTTGCGGCTATa\n CTTGCGGCTATAGGa\n CTTGCGGCTa\n CTTGCGGCTa

\n

One problem with this code is that does not actually save the newly synthesized strand: it simply prints it. To fix this we will create a list\n(or an array) called new_strands and initialize it by assigning an empty array to it:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "new_strands = []\n", "for _ in range(10):\n", " synthesized_strand = ''\n", " for nucleotide in seq:\n", " if nucleotide == 'A' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-18", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "new_strands" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-20", "source": "

[‘CTTGCGGCTa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGAATAa’,\n ‘CTTGCGGCTATAGGAATa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTa’]

\n

Simulating multiple molecules and all nucleotides

\n

And to repeat this for the remaining three nucleotides we will do the following crazy thing:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "new_strands = []\n", "for _ in range(10):\n", " synthesized_strand = ''\n", " for nucleotide in seq:\n", " if nucleotide == 'A' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)\n", "\n", "for _ in range(10):\n", " synthesized_strand = ''\n", " for nucleotide in seq:\n", " if nucleotide == 'C' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)\n", "\n", "for _ in range(10):\n", " synthesized_strand = ''\n", " for nucleotide in seq:\n", " if nucleotide == 'G' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)\n", "\n", "for _ in range(10):\n", " synthesized_strand = ''\n", " for nucleotide in seq:\n", " if nucleotide == 'T' and random.random() > 0.5:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-22", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "len(new_strands)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-24", "source": "

40

\n

Repeating the same code four times is just plain stupid so instead we will write a function called polymerase. Here we need to worry about the scope of variables. The scope of a variable refers to the regions\nof the code where the variable can be accessed or modified. Variables that are defined within a certain block of code (such as a function or a loop) are said to have a local scope, meaning that they can\nonly be accessed within that block of code. Variables that are defined outside of any block of code are said to have a global scope, meaning that they can be accessed from anywhere in the code.

\n

In most programming languages, a variable defined within a function has a local scope, and it can only be accessed within that function. If a variable with the same name is defined outside the function,\nit will have a global scope and can be accessed from anywhere in the code. However, if a variable with the same name is defined within the function, it will take precedence over the global variable and will be used within the function.

\n

There are also some languages that have block scope, where a variable defined within a block (such as an if statement or a for loop) can only be accessed within that block and not outside of it.

\n

In Python, variables defined in the main module have a global scope and can be accessed from any function or module. Variables defined within a function have local scope, and they can only be\naccessed within that function. Variables defined within a loop or a block can be accessed only within the scope of the loop or block.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "def ddN(number_of_iterations, template, base, ddN_ratio):\n", " new_strands = []\n", " for _ in range(number_of_iterations):\n", " synthesized_strand = ''\n", " for nucleotide in template:\n", " if nucleotide == base and random.random() > ddN_ratio:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)\n", " return(new_strands)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-26", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "ddN(10,seq,'A',0.5)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-28", "source": "

[‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATAGGAATa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATa’]

\n

To execute this function on all four types of ddNTPs with need to wrap it in a for loop iterating over the four possibilities:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "for nt in 'ATCG':\n", " ddN(10,seq,nt,0.5)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-30", "source": "

A bit about lists

\n

To store the sequences being generated in the previous loop we will create and initialize a list called seq_run:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "seq_run = []\n", "for nt in 'ATCG':\n", " seq_run.append(ddN(10,seq,nt,0.5))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-32", "source": "

you will see that the seq run is a two-dimensional list:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "seq_run" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-34", "source": "

[[‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGAATAa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATa’],\n [‘CTTGCGGCt’, ‘Ct’, ‘Ct’, ‘Ct’, ‘Ct’, ‘CTt’, ‘Ct’, ‘Ct’, ‘CTTGCGGCt’, ‘Ct’],\n [‘CTTGc’,\n ‘c’,\n ‘c’,\n ‘c’,\n ‘c’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGc’,\n ‘c’,\n ‘CTTGCGGCTATAGGAATAAAAGGc’,\n ‘c’,\n ‘CTTGc’],\n [‘CTTg’,\n ‘CTTGCg’,\n ‘CTTg’,\n ‘CTTg’,\n ‘CTTGCg’,\n ‘CTTGCGGCTATAGGAATAAAAg’,\n ‘CTTGCGGCTATAGGAATAAAAg’,\n ‘CTTg’,\n ‘CTTg’,\n ‘CTTg’]]

\n

as you will read in your next home assignment list elements can be addressed by “index”. The first element has the number 0:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "seq_run[0]" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-36", "source": "

[‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGAATAa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATa’]

\n

A bit about dictionaries

\n

Another way to store these data is in a dictionary, which is a collection of key:value pairs where a key and value can be anything:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "seq_run = {}\n", "for nt in 'ATCG':\n", " seq_run[nt] = ddN(10,seq,nt,0.90)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-38", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "seq_run" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-40", "source": "

{‘A’: [‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTa’,\n ‘CTTGCGGCTATAGGAATAa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATAGGAATAAAa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCAa’,\n ‘CTTGCGGCTATAGGAa’],\n ‘T’: [‘CTTGCGGCTAt’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTt’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACt’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCAAGGTCGTGGCACC’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATt’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTAt’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCAt’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCAAGGTCGTGGCACC’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCAt’,\n ‘CTt’],\n ‘C’: [‘CTTGCGGCTATAGGAATAAAAGGc’,\n ‘CTTGCGGCTATAGGAATAAAAGGc’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTc’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGAc’,\n ‘c’,\n ‘CTTGCGGCTATAGGAATAAAAGGc’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTc’,\n ‘CTTGCGGc’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGAc’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGc’],\n ‘G’: [‘CTTGCGGCTATAGg’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTg’,\n ‘CTTGCg’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCAAGGTCGTGGCACC’,\n ‘CTTg’,\n ‘CTTGCGGCTATAGGAATAAAAGg’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTg’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCg’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGg’,\n ‘CTTGCGGCTATAGGAATAAAAg’]}

\n

dictionary elements can be retrieved using a key:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "seq_run['A']" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-42", "source": "

[‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTa’,\n ‘CTTGCGGCTATAGGAATAa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTa’,\n ‘CTTGCGGCTa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCa’,\n ‘CTTGCGGCTATAGGa’,\n ‘CTTGCGGCTATAGGAATAAAa’,\n ‘CTTGCGGCTATAGGAATAAAAGGCTTTGCGGGTAGTGACCGCGCCGCGTATGTAATTCATGGGTGTCGTCGCGCCCTCACAACTGCAa’,\n ‘CTTGCGGCTATAGGAa’]

\n

Drawing a sequencing gel

\n

Now that we can simulate and store newly synthesized sequencing strands terminated with ddNTPs let us try to draw a realistic representation of the sequencing gel. For this, we will use\nseveral components that will be discussed in much greater detail in the upcoming lectures. These components are:

\n\n

These two libraries will be used in almost all lectures concerning Python in this class.

\n

Gel electophoresis separates molecules based on mass, shape, or charge. In the case of DNA all molecules are universally negatively charged and thus will always migrate\nto (+) electrode. All our molecules are linear single-stranded pieces (our gel is denaturing) and so the only physical/chemical characteristic that distinguishes them is length.\nTherefore the first thing we will do is to convert our sequences into their lengths. For this, we will initialize a new dictionary called seq_lengths:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "seq_lengths = {'base':[],'length':[]}\n", "for key in seq_run.keys():\n", " for sequence in seq_run[key]:\n", " seq_lengths['base'].append(key)\n", " seq_lengths['length'].append(len(sequence))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-44", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "seq_lengths" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-46", "source": "

{‘base’: [‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘A’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘T’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘C’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’,\n ‘G’],\n ‘length’: [81,\n 54,\n 19,\n 34,\n 10,\n 87,\n 15,\n 21,\n 88,\n 16,\n 11,\n 27,\n 84,\n 100,\n 57,\n 51,\n 60,\n 100,\n 60,\n 3,\n 24,\n 24,\n 67,\n 39,\n 1,\n 24,\n 67,\n 8,\n 39,\n 47,\n 14,\n 65,\n 6,\n 100,\n 4,\n 23,\n 28,\n 43,\n 31,\n 22]}

\n

now let’s import pandas:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "import pandas as pd" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-48", "source": "

and inject seq_lengths into a pandas dataframe:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "sequences = pd.DataFrame(seq_lengths)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-50", "source": "

it looks pretty:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "sequences" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-52", "source": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
baselength
0A81
1A54
2A19
3A34
4A10
5A87
6A15
7A21
8A88
9A16
10T11
11T27
12T84
13T100
14T57
15T51
16T60
17T100
18T60
19T3
20C24
21C24
22C67
23C39
24C1
25C24
26C67
27C8
28C39
29C47
30G14
31G65
32G6
33G100
34G4
35G23
36G28
37G43
38G31
39G22
\n
\n

In our data, there is a number of DNA fragments that have identical length (just look at the dataframe above). We can condense these by grouping dataframe entries first by nucleotide (['base']) and then by length (['length']). For each group we will then compute count and put it into a new column named, …, count:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "sequences_grouped_by_length = sequences.groupby(\n", " ['base','length']\n", ").agg(\n", " count=pd.NamedAgg(\n", " column='length',\n", " aggfunc='count'\n", " )\n", ").reset_index()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-54", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "sequences_grouped_by_length" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-56", "source": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
baselengthcount
0A101
1A151
2A161
3A191
4A211
5A341
6A541
7A811
8A871
9A881
10C11
11C81
12C243
13C392
14C471
15C672
16G41
17G61
18G141
19G221
20G231
21G281
22G311
23G431
24G651
25G1001
26T31
27T111
28T271
29T511
30T571
31T602
32T841
33T1002
\n
\n

The following chart is created using the alt.Chart() function and passing the data as an argument. The mark_tick() function is used to create a tick chart with a thickness of 4 pixels.

\n

The chart is encoded with two main axis:

\n\n

Finally, the chart properties are set to a width of 100 pixels and a height of 800 pixels.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "import altair as alt\n", "alt.Chart(sequences_grouped_by_length).mark_tick(thickness=4).encode(\n", " y = alt.Y('length:Q'),\n", " x = alt.X('base'),\n", " color=alt.Color('count:Q',legend=None,\n", " scale=alt.Scale(scheme=\"greys\"))\n", ").properties(\n", " width=100,\n", " height=800)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-58", "source": "
Gel rendering 1.
Open image in new tab

Figure 11: A simulated gel image
\n

And here is a color version of the same graph using just one line of the gel:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "import altair as alt\n", "alt.Chart(sequences_grouped_by_length).mark_tick(thickness=4).encode(\n", " y = alt.Y('length:Q'),\n", " color=alt.Color('base:N',#legend=None,\n", " scale=alt.Scale(scheme=\"set1\"))\n", ").properties(\n", " width=20,\n", " height=800)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-60", "source": "
Gel rendering 2.
Open image in new tab

Figure 12: A simulated gel image using \\\"colored dyes\\\"
\n

Putting everything together

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "# Generate random sequences\n", "\n", "seq = ''\n", "for _ in range(300):\n", " seq += random.choice('ATCG')" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-62", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "seq" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-64", "source": "

‘GTCGATGCCTGTTTGACCTAACTGGCGTGAAGGCTATATCAGTTATCCCAAGCGTAGGCTTTCAATTCGCCCGGTTGCGTCGCCCGATTATCAATCGCGGAAGGTGGGTGCGATTGGAAGTCCAAAACCTTTATCCTGACACACTTTCTGACTCGGCTTGGCAATGGGAAGTGTAGAACGTAGCGGGGACCTACATCATATCGTACATAACTGAGACGTGCTCACCCGCAGAGATAAGAACTGCAATACCCGGGTGAATACTTGGGGAGTCTCACCCAGATGGTTGGCCTGATCCTCCCC’

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-65", "source": [ "# Function simulating a single run of a single polymerase molecule\n", "\n", "def ddN(number_of_iterations, template, base, ddN_ratio):\n", " new_strands = []\n", " for _ in range(number_of_iterations):\n", " synthesized_strand = ''\n", " for nucleotide in template:\n", " if nucleotide == base and random.random() > ddN_ratio:\n", " synthesized_strand += nucleotide.lower()\n", " break\n", " synthesized_strand += nucleotide\n", " new_strands.append(synthesized_strand)\n", " return(new_strands)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-66", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "# Generating simulated sequencing run\n", "\n", "seq_run = {}\n", "for nt in 'ATCG':\n", " seq_run[nt] = ddN(100000,seq,nt,0.95)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-68", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "# Computing lengths\n", "\n", "seq_lengths = {'base':[],'length':[]}\n", "for key in seq_run.keys():\n", " for sequence in seq_run[key]:\n", " seq_lengths['base'].append(key)\n", " seq_lengths['length'].append(len(sequence))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-70", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-71", "source": [ "# Converting dictionary into Pandas dataframe\n", "\n", "sequences = pd.DataFrame(seq_lengths)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-72", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-73", "source": [ "# Grouping by nucleotide and length\n", "\n", "sequences_grouped_by_length = sequences.groupby(\n", " ['base','length']\n", ").agg(\n", " count=pd.NamedAgg(\n", " column='length',\n", " aggfunc='count'\n", " )\n", ").reset_index()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-74", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "# Plotting (note the quadratic scale for realism)\n", "\n", "import altair as alt\n", "alt.Chart(sequences_grouped_by_length).mark_tick(thickness=4).encode(\n", " y = alt.Y('length:Q',scale=alt.Scale(type='sqrt')),\n", " x = alt.X('base'),\n", " color=alt.Color('count:Q',legend=None,\n", " scale=alt.Scale(type='log',scheme=\"greys\")),\n", " tooltip='count:Q'\n", ").properties(\n", " width=100,\n", " height=800)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-76", "source": "
Gel rendering 3.
Open image in new tab

Figure 13: A simulated gel image using quadratic scale
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-77", "source": [ "# Plotting using color\n", "\n", "import altair as alt\n", "alt.Chart(sequences_grouped_by_length).mark_tick(thickness=4).encode(\n", " y = alt.Y('length:Q',scale=alt.Scale(type=\"sqrt\")),\n", " color=alt.Color('base:N',#legend=None,\n", " scale=alt.Scale(scheme=\"set1\")),\n", " opacity=alt.Opacity('count:N',legend=None),\n", " tooltip='count:Q'\n", ").properties(\n", " width=20,\n", " height=800)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "# The problem" ], "id": "" } } }, { "id": "cell-78", "source": "
Gel rendering 4.
Open image in new tab

Figure 14: A simulated gel image using quadratic scale with colors
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Sanger sequencing is sequencing by synthesis\n", "- Python is powerful\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/gnmx-lecture2/tutorial.html#feedback) and check there for further resources!\n" ] } ] }