Sunday, November 13, 2016

Google Sheets: Practicing the Principles of Deep Seq Alignment & Assembly

One of the more difficult concepts for me to help students initially grasp is how deep sequencing (next-generation sequencing) works, particularly in how millions of short (~35 nucleotide) DNA sequences can be used to identify the sequence of an entire chromosome (~millions of nucleotides). My perspective on why this topic is conceptually tricky is that deep seq involves huge amounts of data that are normally analyzed by "black box" computational algorithms. Thus, it has been intractable for me to develop in-class paper exercises that scaffold the process of deep seq assembly and engage each student in a group effort to solve an explicit problem.

Until DISCOVERe.

I'm currently teaching a graduate course in molecular biology as one of Fresno State's DISCOVERe (tablet computer) courses, in which each student is required to bring a tablet computer to class each week. As I've written previously, one way I'm leveraging this opportunity to improve student learning is by creating exercises that require students to use computers to understand concepts and solve problems.

I've written this post with appropriate biology background for the lay audience. Moreover, I've emphasized many aspects of the use of a shared spreadsheet that I think can be leveraged in disciplines other than biology. –JR

Last week, the primary literature article we read, for the topic of genome sequencing, was a manuscript, published in Nature, describing the genome assembly of a 45,000 year old modern human using a deep sequencing approach.

For you non-geneticists, here's the crash course on genome assembly:

Background
DNA comprises four chemical building blocks, called nucleotides, abbreviated A, C, G, and T, that are linked together in long chains we call chromosomes. The 23 human chromosomes are millions of nucleotides long, and the specific order of nucleotides encodes the instructions for building and operating cells and organisms, much like the order of letters in this blog post gives you instructions for building and operating the exercise I used in my class! By analogy, each of the 23 chromosomes contains different genes, most of which have unique roles in building and operating a living organism, and each chromosome can be considered as one chapter that comprises the book of life. "DNA sequencing" is the process of obtaining the identity and order of the nucleotides that comprise a chromosome.

The Problems

  1. We do not have the technology to take an entire chromosome and read the sequence of nucleotides from one end to the other. Rather, we might be able to read, at most, a few thousand nucleotides in a row at a time.
  2. Technologies we use to determine the sequence of a nucleotide chain (a chromosome) are prone to occasionally mis-reading a nucleotide.


Current Solutions

  1. We use technologies, like deep sequencing, that don't rely on having to obtain long stretches of nucleotide sequence information
  2. Modern approaches like deep sequencing take advantage of obtaining multiple experimental sequence reads of the same spot on the same chromosome to identify errors. For example, if one nucleotide on a chromosome is experimentally "read" twenty times, with nineteen of the sequence reads containing an "A" at that position, and one read containing a "T," then we conclude that the rare nucleotide T is an experimental error that doesn't reflect the true nucleotide found at that position on a chromosome (the A). This most-frequent value is called the consensus.

§4: A New Problem
Now, these solutions raise a new problem in itself. Continuing the analogy of a chromosome representing a chapter of information about how to build and operate a human, short DNA sequences obtained using deep seq technology might only comprise a few "words" from a chromosome. The major challenge a decade ago was to develop computer algorithms that would take the millions of short nucleotide sequences and to look for shared overlaps that could be used to assemble the short sequences into a longer contiguous sequence of an entire chromosome.

Example of Deep Sequence Assembly
For example, let's say we were sequencing the chromosome containing the paragraph I just wrote. I'll focus on the blue clause. From deep sequencing, we might have obtained a series of sequences as follows (in no particular order, as a geneticist would obtain from a deep seq experiment):

beusedtoass, rlapsthatcou, kforsharedo, letheshorts, lookforsha, atcouldbeu, laredoverla, toassembleth, ortsequences

Note that spaces have been removed, as the DNA language uses other formatting to distinguish "words" (genes) from each other. Now, with these short reads, our task is to assemble them into a meaningful clause. As I mentioned, computer algorithms look for unique overlaps to line up the short reads like this:

Example of a multiple sequence alignment

Where each of rows 1–9 contain one of our short reads. The algorithms then assemble a "consensus sequence" (which we then define as the sequence of the chromosome) by looking for the most common letter at each position. Note, for example, that the short read in row 3 begins with an "L," but the other two reads that overlap this position (column I) contain "H." Thus, the consensus sequence assembly (row 11) contains the more common letter (H) in column I. The L in sequence three represents one of those mistakes that are accidentally generated during the experimental reading of the chromosome.

One other technique used to improve assembly efficiency
When working with chromosomes, we can give ourselves one more piece of information that helps computers speed up the process of sequence assembly: we take a chromosome and break it into random pieces that are all the same length, say 1,000 nucleotides long, and we read the first and last 35 nucleotides of each pieces. Thus, we know how far apart on each chromosome each pair of sequences should be placed (~1,000 nucleotides apart).

The classroom exercise
I imagine that by now you're starting to appreciate the depth of the issue: the degree of complexity inherent in producing a chromosome-length sequence assembly comprising millions of short sequence reads. To help my students get a feel for this process, I generated a problem similar to the Google Sheets screen shot above. In this exercise, I took the known DNA sequence of the human melanocortin-1 receptor (Mc1r) gene, which is specifically relevant to the manuscript we read, and divided it into simulated deep seq reads. I will note here, though, that I didn't tell the students what gene this was - this is how I framed the activity: that we had obtained deep sequence reads from an unknown part of an ancient genome, and our job was to assemble the short reads and then determine the identity of the region of the genome we had sequenced.

In this case, for efficiency, I made each sequence 100 nucleotides long and told the students that the paired sequences were from 300 nucleotide DNA fragments. Thus, in the final assembly of the Mc1r gene, each student has a unique 100 nucleotide sequence, followed by a gap of 100 (unknown) nucleotides (the middle of their DNA fragment), followed by 100 more nucleotides. I used the rand() function to generate a random distribution of fragments for my 20 students, and then shared the Google Sheet with all of them during class.

In this case, because we worked on this exercise in class and were not using computer algorithms to find overlaps among each others' sequence reads (Google Sheets is not well-equipped to do this), I provided the consensus sequence of the Mc1r gene in the top row of the sheet. My immediate goal was not to have them practice the deep sequence assembly part of the process, but rather to use a spreadsheet as a tool that gives students the ability to hand-align short sequences to a larger consensus sequence. The students used the shared Google Sheet to align their short sequences (rows 4–27) to the consensus (row 2) in real time by cutting and pasting each sequence into the appropriate series of columns (the gene is 954 nucleotides - or columns - long).

Screenshot taken while students were editing the shared Google Sheet

After the multiple sequence alignment is produced, one region looks like this:

Example analysis of a multiple sequence alignment

where the top row is the known Mc1r consensus sequence, the second row is the consensus sequence derived from the short sequence reads the students are working with, and the bottom three sequences are individual deep seq reads. Note that position 919 has two different nucleotides: two short reads have a "G" (yellow), and one has an "A." I constructed this difference because our genome sequence manuscript noted that the ancient human genome contained a mutation in the Mc1r gene that might have produced red hair (a G instead of the usual A at position 919). Incorporated into this exercise, this finding allows students to discuss the concept of developing a consensus sequence and how low "read depth" (the consensus sequence at position 919 here is only generated from three independent sequences, only two of which agree) could produce a misleading consensus sequence.

Conclusion One
A benefit of using a spreadsheet like Google Sheets for this type of activity is that students can work interactively with more data, hand-aligning multiple sequences simultaneously by cut-and-paste (or drag-and-drop). Getting a grasp of the solution to the problem, as well as the magnitude of the problem in real life situations, is next to impossible (in my hands) to convey using a traditional paper-based exercise, or (worse) by drawing on a static white board.

Additional exercise components
Another benefit of using a Google Sheet, instead of showing students existing web-based sequence alignment tools, is that the interface is more interactive, intuitive, and also lets us easily practice calculating some basic information about our alignment. With our spreadsheet-based multiple sequence alignment, produced collaboratively by all of the students, we can:


  1. Calculate and plot the read depth of the sequence at every position
  2. Find the average, maximum and minimum read depths
  3. Quickly identify positions where a deep sequence read doesn't match the consensus


1. Read depth and plot
At each nucleotide position (1, 2, …), we just want to know how many sequences contain a letter at that position. This is easily accomplished using the countif function. For example, to calculate read depth at position 918 (we'll assume that this information is in column 919, and the 22 deep seq reads are in rows C–X), use the countif() function:

=countif(C919:X919,"?")

where each cell, in the range C919:X919, that contains any character (specified by the wildcard variable ?) is tallied. The cell would display the number 3 (the read depth). This equation can be rapidly copied and pasted (filled across) to calculate read depth for every nucleotide position.

We can then plot the read depth across the sequence:



2. Maximum, minimum, and average read depths
Across all 954 nucleotides, we can quickly identify the maximum, minimum, and average read depths by entering the following three formulae into three different cells:

=max(<range-containing-read-depth-values>)
=min(<range-containing-read-depth-values>)
=average(<range-containing-read-depth-values>)

where <range-containing-read-depth-values> would be replaced with an expression like Z1:Z954, indicating the cell range where the read depth values are located. The brackets (< >) should not be included in the entry.

3. Mismatches to the consensus sequence
It can be time-consuming (and error-prone) to visually compare each 100 nucleotide sequence to the consensus sequence to find any places where the two do not agree (like at position 919). A faster way to accomplish this using a spreadsheet is with the if() function, where you specify a condition that has to be met. When true, one value you specify is displayed in the cell; when false, a different value is displayed:

=if(<cell-containing-deep-seq-nucleotide> = <cell-containing-consensus-seq-nucleotide>, 0, 1)

In this case, we're testing the condition where our deep sequence nucleotide matches (is equal to) the consensus nucleotide at that position. After the first comma, we enter the value-if-true (the value we want displayed in the cell if the two nucleotides match): in this case, a zero. After the second comma, we enter the value-if-false, which will be shown in every cell where the two nucleotides are not the same. Filling this formula across all 954 columns will let us visually scan across that row to spot any places where a "0" is shown, visually indicating a position of mismatch. There is a more advanced technique to even more quickly learn which columns contain a 0 and not a 1, but I'll leave that for another time.

The punch line of the exercise
At the end of the exercise, the students are able to copy the consensus sequence into the web-based Basic Local Alignment Search Tool (BLAST) to search a nucleotide sequence database for matches to known sequences. This is where students discover that the sequence they've been working with is the human Mc1r gene. Finally, based on the local alignment in the figure above, they work together to identify which of the individual deep seq reads came from a modern human (A at position 919) and which from the ancient human (G at position 919).

At this point, I concluded the exercise by demonstrating one last web tool for use in analyzing DNA sequences: Transeq, which is a program that translates a DNA sequence in all six reading frames. I translate both the original consensus sequence and the student consensus (with G at 919), so that students observe how this nucleotide sequence difference impacts the sequence of the encoded protein.

We then summarize by discussing how difficult the entire process would have been - and how much longer it would have taken - if I had not initially provided the known consensus sequence, which is always what the genomicist is seeking to identify with deep sequence reads in the first place.

Conclusion
By using a shared spreadsheet instead of a paper process, students can tackle an involved problem more quickly by working together on independent tablet computers. Further, I integrated the use of several web tools that practicing geneticists use on a regular basis as part of their research. The use of a spreadsheet to make calculations and graphs and to analyze data helps students develop quantitative analysis skills that they can apply in other courses (and, of course, in life).

No comments:

Post a Comment

Have an insightful comment, best practice, or concern to share? Please do!