I have a confession. Box was the first app I ever used, on day one of my faculty professional development in tablet-based instructional approaches. That was in 2014, and the task was to record a video using the Box app. Many of my colleagues use it regularly in their classes, which is why I'm almost embarrassed to admit that it took me this long to discover a use for it: in my graduate course in molecular biology.
In its most simple form, Box (available as an app and via web interface) is a great place to collect and share files. It has a robust system for setting access permissions when inviting others to access files you place in a Box folder you create. This I already knew.
But, then I had an issue to address in class: my travel conflicted with one of our class meetings. Once again, I started exploring how the use of technology could enhance my ability to conduct course meetings remotely. The format of my molecular biology course is that students read a published research manuscript in advance of class, and then we meet to discuss, interpret, and critique the study. How would I provide students with a rendition of this format if we have one week where we don't meet in person?
There are several possible approaches, including using Zoom (or other videoconferencing platforms), but I settled on the idea of having each student provide a digital presentation of one figure or table from the assigned manuscript. Instead of discussing in person, instead each student would produce a video in which s/he presents and interprets a figure. The grading in my class is based entirely on participation, so it would be relatively easy for me to assess participation in a digital discussion.
Thus, I assigned each student one figure to present. They are allowed either to record a movie or (as a backup approach, in case of technical hurdles) to produce a written description and interpretation of that figure. I discovered, while re-exploring Box, that the Box app combines the ability to record and upload to the class Box folder a movie directly from the tablet - and the workflow is easy. More importantly, for me, Box also incorporates a commenting feature, and a way to post text.
Beyond posting one figure presentation (movie or text) to our Box folder, each student must view (and leave a critical comment or a question) one peer's presentation of each figure in the manuscript. This ensures that each student has engaged in the analysis of the data presented in the manuscript, as we do weekly in person in class. Finally, I also required that each student respond to at least one comment/question left by a peer on their own presentation.
Because students will use the Box Comment tool to leave (and respond to) both text and movie posts, this also makes it easy for me to assess student participation.
This plan kicks off in three days, and I'll post an update after I learn whether my best-laid plans work the way I hope!
My only concern is that, because the use of Box was a rather last-minute decision on my part, I didn't have time built into my course for students to practice using Box beforehand. This underscores an important best practice for any incorporation of a new technology into a class: build (and protect) time in your class to practice workflows before you employ them in any for-stakes assessment activities.
Saturday, November 19, 2016
Sunday, November 13, 2016
Google Sheets: Practicing the Principles of Deep Seq Alignment & Assembly
One of the more difficult concepts for me to help students initially grasp is how deep sequencing (next-generation sequencing) works, particularly in how millions of short (~35 nucleotide) DNA sequences can be used to identify the sequence of an entire chromosome (~millions of nucleotides). My perspective on why this topic is conceptually tricky is that deep seq involves huge amounts of data that are normally analyzed by "black box" computational algorithms. Thus, it has been intractable for me to develop in-class paper exercises that scaffold the process of deep seq assembly and engage each student in a group effort to solve an explicit problem.
Until DISCOVERe.
I'm currently teaching a graduate course in molecular biology as one of Fresno State's DISCOVERe (tablet computer) courses, in which each student is required to bring a tablet computer to class each week. As I've written previously, one way I'm leveraging this opportunity to improve student learning is by creating exercises that require students to use computers to understand concepts and solve problems.
I've written this post with appropriate biology background for the lay audience. Moreover, I've emphasized many aspects of the use of a shared spreadsheet that I think can be leveraged in disciplines other than biology. –JR
Last week, the primary literature article we read, for the topic of genome sequencing, was a manuscript, published in Nature, describing the genome assembly of a 45,000 year old modern human using a deep sequencing approach.
For you non-geneticists, here's the crash course on genome assembly:
Example of Deep Sequence Assembly
For example, let's say we were sequencing the chromosome containing the paragraph I just wrote. I'll focus on the blue clause. From deep sequencing, we might have obtained a series of sequences as follows (in no particular order, as a geneticist would obtain from a deep seq experiment):
beusedtoass, rlapsthatcou, kforsharedo, letheshorts, lookforsha, atcouldbeu, laredoverla, toassembleth, ortsequences
Note that spaces have been removed, as the DNA language uses other formatting to distinguish "words" (genes) from each other. Now, with these short reads, our task is to assemble them into a meaningful clause. As I mentioned, computer algorithms look for unique overlaps to line up the short reads like this:
Where each of rows 1–9 contain one of our short reads. The algorithms then assemble a "consensus sequence" (which we then define as the sequence of the chromosome) by looking for the most common letter at each position. Note, for example, that the short read in row 3 begins with an "L," but the other two reads that overlap this position (column I) contain "H." Thus, the consensus sequence assembly (row 11) contains the more common letter (H) in column I. The L in sequence three represents one of those mistakes that are accidentally generated during the experimental reading of the chromosome.
One other technique used to improve assembly efficiency
When working with chromosomes, we can give ourselves one more piece of information that helps computers speed up the process of sequence assembly: we take a chromosome and break it into random pieces that are all the same length, say 1,000 nucleotides long, and we read the first and last 35 nucleotides of each pieces. Thus, we know how far apart on each chromosome each pair of sequences should be placed (~1,000 nucleotides apart).
In this case, for efficiency, I made each sequence 100 nucleotides long and told the students that the paired sequences were from 300 nucleotide DNA fragments. Thus, in the final assembly of the Mc1r gene, each student has a unique 100 nucleotide sequence, followed by a gap of 100 (unknown) nucleotides (the middle of their DNA fragment), followed by 100 more nucleotides. I used the rand() function to generate a random distribution of fragments for my 20 students, and then shared the Google Sheet with all of them during class.
In this case, because we worked on this exercise in class and were not using computer algorithms to find overlaps among each others' sequence reads (Google Sheets is not well-equipped to do this), I provided the consensus sequence of the Mc1r gene in the top row of the sheet. My immediate goal was not to have them practice the deep sequence assembly part of the process, but rather to use a spreadsheet as a tool that gives students the ability to hand-align short sequences to a larger consensus sequence. The students used the shared Google Sheet to align their short sequences (rows 4–27) to the consensus (row 2) in real time by cutting and pasting each sequence into the appropriate series of columns (the gene is 954 nucleotides - or columns - long).
After the multiple sequence alignment is produced, one region looks like this:
where the top row is the known Mc1r consensus sequence, the second row is the consensus sequence derived from the short sequence reads the students are working with, and the bottom three sequences are individual deep seq reads. Note that position 919 has two different nucleotides: two short reads have a "G" (yellow), and one has an "A." I constructed this difference because our genome sequence manuscript noted that the ancient human genome contained a mutation in the Mc1r gene that might have produced red hair (a G instead of the usual A at position 919). Incorporated into this exercise, this finding allows students to discuss the concept of developing a consensus sequence and how low "read depth" (the consensus sequence at position 919 here is only generated from three independent sequences, only two of which agree) could produce a misleading consensus sequence.
1. Read depth and plot
At each nucleotide position (1, 2, …), we just want to know how many sequences contain a letter at that position. This is easily accomplished using the countif function. For example, to calculate read depth at position 918 (we'll assume that this information is in column 919, and the 22 deep seq reads are in rows C–X), use the countif() function:
=countif(C919:X919,"?")
where each cell, in the range C919:X919, that contains any character (specified by the wildcard variable ?) is tallied. The cell would display the number 3 (the read depth). This equation can be rapidly copied and pasted (filled across) to calculate read depth for every nucleotide position.
We can then plot the read depth across the sequence:
2. Maximum, minimum, and average read depths
Across all 954 nucleotides, we can quickly identify the maximum, minimum, and average read depths by entering the following three formulae into three different cells:
=max(<range-containing-read-depth-values>)
=min(<range-containing-read-depth-values>)
=average(<range-containing-read-depth-values>)
where <range-containing-read-depth-values> would be replaced with an expression like Z1:Z954, indicating the cell range where the read depth values are located. The brackets (< >) should not be included in the entry.
3. Mismatches to the consensus sequence
It can be time-consuming (and error-prone) to visually compare each 100 nucleotide sequence to the consensus sequence to find any places where the two do not agree (like at position 919). A faster way to accomplish this using a spreadsheet is with the if() function, where you specify a condition that has to be met. When true, one value you specify is displayed in the cell; when false, a different value is displayed:
=if(<cell-containing-deep-seq-nucleotide> = <cell-containing-consensus-seq-nucleotide>, 0, 1)
In this case, we're testing the condition where our deep sequence nucleotide matches (is equal to) the consensus nucleotide at that position. After the first comma, we enter the value-if-true (the value we want displayed in the cell if the two nucleotides match): in this case, a zero. After the second comma, we enter the value-if-false, which will be shown in every cell where the two nucleotides are not the same. Filling this formula across all 954 columns will let us visually scan across that row to spot any places where a "0" is shown, visually indicating a position of mismatch. There is a more advanced technique to even more quickly learn which columns contain a 0 and not a 1, but I'll leave that for another time.
At this point, I concluded the exercise by demonstrating one last web tool for use in analyzing DNA sequences: Transeq, which is a program that translates a DNA sequence in all six reading frames. I translate both the original consensus sequence and the student consensus (with G at 919), so that students observe how this nucleotide sequence difference impacts the sequence of the encoded protein.
We then summarize by discussing how difficult the entire process would have been - and how much longer it would have taken - if I had not initially provided the known consensus sequence, which is always what the genomicist is seeking to identify with deep sequence reads in the first place.
Until DISCOVERe.
I'm currently teaching a graduate course in molecular biology as one of Fresno State's DISCOVERe (tablet computer) courses, in which each student is required to bring a tablet computer to class each week. As I've written previously, one way I'm leveraging this opportunity to improve student learning is by creating exercises that require students to use computers to understand concepts and solve problems.
I've written this post with appropriate biology background for the lay audience. Moreover, I've emphasized many aspects of the use of a shared spreadsheet that I think can be leveraged in disciplines other than biology. –JR
Last week, the primary literature article we read, for the topic of genome sequencing, was a manuscript, published in Nature, describing the genome assembly of a 45,000 year old modern human using a deep sequencing approach.
For you non-geneticists, here's the crash course on genome assembly:
Background
DNA comprises four chemical building blocks, called nucleotides, abbreviated A, C, G, and T, that are linked together in long chains we call chromosomes. The 23 human chromosomes are millions of nucleotides long, and the specific order of nucleotides encodes the instructions for building and operating cells and organisms, much like the order of letters in this blog post gives you instructions for building and operating the exercise I used in my class! By analogy, each of the 23 chromosomes contains different genes, most of which have unique roles in building and operating a living organism, and each chromosome can be considered as one chapter that comprises the book of life. "DNA sequencing" is the process of obtaining the identity and order of the nucleotides that comprise a chromosome.
The Problems
- We do not have the technology to take an entire chromosome and read the sequence of nucleotides from one end to the other. Rather, we might be able to read, at most, a few thousand nucleotides in a row at a time.
- Technologies we use to determine the sequence of a nucleotide chain (a chromosome) are prone to occasionally mis-reading a nucleotide.
Current Solutions
- We use technologies, like deep sequencing, that don't rely on having to obtain long stretches of nucleotide sequence information
- Modern approaches like deep sequencing take advantage of obtaining multiple experimental sequence reads of the same spot on the same chromosome to identify errors. For example, if one nucleotide on a chromosome is experimentally "read" twenty times, with nineteen of the sequence reads containing an "A" at that position, and one read containing a "T," then we conclude that the rare nucleotide T is an experimental error that doesn't reflect the true nucleotide found at that position on a chromosome (the A). This most-frequent value is called the consensus.
§4: A New Problem
Now, these solutions raise a new problem in itself. Continuing the analogy of a chromosome representing a chapter of information about how to build and operate a human, short DNA sequences obtained using deep seq technology might only comprise a few "words" from a chromosome. The major challenge a decade ago was to develop computer algorithms that would take the millions of short nucleotide sequences and to look for shared overlaps that could be used to assemble the short sequences into a longer contiguous sequence of an entire chromosome.Example of Deep Sequence Assembly
For example, let's say we were sequencing the chromosome containing the paragraph I just wrote. I'll focus on the blue clause. From deep sequencing, we might have obtained a series of sequences as follows (in no particular order, as a geneticist would obtain from a deep seq experiment):
beusedtoass, rlapsthatcou, kforsharedo, letheshorts, lookforsha, atcouldbeu, laredoverla, toassembleth, ortsequences
Note that spaces have been removed, as the DNA language uses other formatting to distinguish "words" (genes) from each other. Now, with these short reads, our task is to assemble them into a meaningful clause. As I mentioned, computer algorithms look for unique overlaps to line up the short reads like this:
Example of a multiple sequence alignment |
Where each of rows 1–9 contain one of our short reads. The algorithms then assemble a "consensus sequence" (which we then define as the sequence of the chromosome) by looking for the most common letter at each position. Note, for example, that the short read in row 3 begins with an "L," but the other two reads that overlap this position (column I) contain "H." Thus, the consensus sequence assembly (row 11) contains the more common letter (H) in column I. The L in sequence three represents one of those mistakes that are accidentally generated during the experimental reading of the chromosome.
One other technique used to improve assembly efficiency
When working with chromosomes, we can give ourselves one more piece of information that helps computers speed up the process of sequence assembly: we take a chromosome and break it into random pieces that are all the same length, say 1,000 nucleotides long, and we read the first and last 35 nucleotides of each pieces. Thus, we know how far apart on each chromosome each pair of sequences should be placed (~1,000 nucleotides apart).
The classroom exercise
I imagine that by now you're starting to appreciate the depth of the issue: the degree of complexity inherent in producing a chromosome-length sequence assembly comprising millions of short sequence reads. To help my students get a feel for this process, I generated a problem similar to the Google Sheets screen shot above. In this exercise, I took the known DNA sequence of the human melanocortin-1 receptor (Mc1r) gene, which is specifically relevant to the manuscript we read, and divided it into simulated deep seq reads. I will note here, though, that I didn't tell the students what gene this was - this is how I framed the activity: that we had obtained deep sequence reads from an unknown part of an ancient genome, and our job was to assemble the short reads and then determine the identity of the region of the genome we had sequenced.In this case, for efficiency, I made each sequence 100 nucleotides long and told the students that the paired sequences were from 300 nucleotide DNA fragments. Thus, in the final assembly of the Mc1r gene, each student has a unique 100 nucleotide sequence, followed by a gap of 100 (unknown) nucleotides (the middle of their DNA fragment), followed by 100 more nucleotides. I used the rand() function to generate a random distribution of fragments for my 20 students, and then shared the Google Sheet with all of them during class.
In this case, because we worked on this exercise in class and were not using computer algorithms to find overlaps among each others' sequence reads (Google Sheets is not well-equipped to do this), I provided the consensus sequence of the Mc1r gene in the top row of the sheet. My immediate goal was not to have them practice the deep sequence assembly part of the process, but rather to use a spreadsheet as a tool that gives students the ability to hand-align short sequences to a larger consensus sequence. The students used the shared Google Sheet to align their short sequences (rows 4–27) to the consensus (row 2) in real time by cutting and pasting each sequence into the appropriate series of columns (the gene is 954 nucleotides - or columns - long).
Screenshot taken while students were editing the shared Google Sheet |
After the multiple sequence alignment is produced, one region looks like this:
Example analysis of a multiple sequence alignment |
where the top row is the known Mc1r consensus sequence, the second row is the consensus sequence derived from the short sequence reads the students are working with, and the bottom three sequences are individual deep seq reads. Note that position 919 has two different nucleotides: two short reads have a "G" (yellow), and one has an "A." I constructed this difference because our genome sequence manuscript noted that the ancient human genome contained a mutation in the Mc1r gene that might have produced red hair (a G instead of the usual A at position 919). Incorporated into this exercise, this finding allows students to discuss the concept of developing a consensus sequence and how low "read depth" (the consensus sequence at position 919 here is only generated from three independent sequences, only two of which agree) could produce a misleading consensus sequence.
Conclusion One
A benefit of using a spreadsheet like Google Sheets for this type of activity is that students can work interactively with more data, hand-aligning multiple sequences simultaneously by cut-and-paste (or drag-and-drop). Getting a grasp of the solution to the problem, as well as the magnitude of the problem in real life situations, is next to impossible (in my hands) to convey using a traditional paper-based exercise, or (worse) by drawing on a static white board.
Additional exercise components
Another benefit of using a Google Sheet, instead of showing students existing web-based sequence alignment tools, is that the interface is more interactive, intuitive, and also lets us easily practice calculating some basic information about our alignment. With our spreadsheet-based multiple sequence alignment, produced collaboratively by all of the students, we can:- Calculate and plot the read depth of the sequence at every position
- Find the average, maximum and minimum read depths
- Quickly identify positions where a deep sequence read doesn't match the consensus
1. Read depth and plot
At each nucleotide position (1, 2, …), we just want to know how many sequences contain a letter at that position. This is easily accomplished using the countif function. For example, to calculate read depth at position 918 (we'll assume that this information is in column 919, and the 22 deep seq reads are in rows C–X), use the countif() function:
=countif(C919:X919,"?")
where each cell, in the range C919:X919, that contains any character (specified by the wildcard variable ?) is tallied. The cell would display the number 3 (the read depth). This equation can be rapidly copied and pasted (filled across) to calculate read depth for every nucleotide position.
We can then plot the read depth across the sequence:
2. Maximum, minimum, and average read depths
Across all 954 nucleotides, we can quickly identify the maximum, minimum, and average read depths by entering the following three formulae into three different cells:
=max(<range-containing-read-depth-values>)
=min(<range-containing-read-depth-values>)
=average(<range-containing-read-depth-values>)
where <range-containing-read-depth-values> would be replaced with an expression like Z1:Z954, indicating the cell range where the read depth values are located. The brackets (< >) should not be included in the entry.
3. Mismatches to the consensus sequence
It can be time-consuming (and error-prone) to visually compare each 100 nucleotide sequence to the consensus sequence to find any places where the two do not agree (like at position 919). A faster way to accomplish this using a spreadsheet is with the if() function, where you specify a condition that has to be met. When true, one value you specify is displayed in the cell; when false, a different value is displayed:
=if(<cell-containing-deep-seq-nucleotide> = <cell-containing-consensus-seq-nucleotide>, 0, 1)
In this case, we're testing the condition where our deep sequence nucleotide matches (is equal to) the consensus nucleotide at that position. After the first comma, we enter the value-if-true (the value we want displayed in the cell if the two nucleotides match): in this case, a zero. After the second comma, we enter the value-if-false, which will be shown in every cell where the two nucleotides are not the same. Filling this formula across all 954 columns will let us visually scan across that row to spot any places where a "0" is shown, visually indicating a position of mismatch. There is a more advanced technique to even more quickly learn which columns contain a 0 and not a 1, but I'll leave that for another time.
The punch line of the exercise
At the end of the exercise, the students are able to copy the consensus sequence into the web-based Basic Local Alignment Search Tool (BLAST) to search a nucleotide sequence database for matches to known sequences. This is where students discover that the sequence they've been working with is the human Mc1r gene. Finally, based on the local alignment in the figure above, they work together to identify which of the individual deep seq reads came from a modern human (A at position 919) and which from the ancient human (G at position 919).At this point, I concluded the exercise by demonstrating one last web tool for use in analyzing DNA sequences: Transeq, which is a program that translates a DNA sequence in all six reading frames. I translate both the original consensus sequence and the student consensus (with G at 919), so that students observe how this nucleotide sequence difference impacts the sequence of the encoded protein.
We then summarize by discussing how difficult the entire process would have been - and how much longer it would have taken - if I had not initially provided the known consensus sequence, which is always what the genomicist is seeking to identify with deep sequence reads in the first place.
Conclusion
By using a shared spreadsheet instead of a paper process, students can tackle an involved problem more quickly by working together on independent tablet computers. Further, I integrated the use of several web tools that practicing geneticists use on a regular basis as part of their research. The use of a spreadsheet to make calculations and graphs and to analyze data helps students develop quantitative analysis skills that they can apply in other courses (and, of course, in life).
Sunday, November 6, 2016
Syllabi: mobile tech and the digital divide
I cannot believe how quickly this semester is flying by (and, likewise, how long it has been since I posted here!) My forays into tablet pedagogy have brought new opportunities (read: additional work), including serving as co-Chair of Fresno State's DISCOVERe Taskforce subcommittee on assessment, as well as now being the co-Coordinator of one of campus' newest Faculty Learning Communities: Advanced DISCOVERe. In fact, in both of these "co-" situations, my colleague Mary Paul and I are working together to help identify and then disseminate the value that educational technology can provide both students and instructors. Watch us discuss this briefly here.
As I teach more tablet-based classes, I have identified some specific items that I suggest including on any syllabus for a class that intends to have all students use any type of mobile technology (i.e. smartphones, tablets, laptops), whether it is a program like DISCOVERe or whether it is a BYOD (bring your own device) situation. I have written about such syllabi before, but here are some new thoughts:
If you intend on having students use technology to complete any assignment, be firm and state on your syllabus that the technology must be used, if that is your philosophy and truly your intention. If you can't defend the use of the technology over a traditional (e.g. paper) process, then the use of technology might not be warranted.
Even in the DISCOVERe program, where each student is required to bring a tablet to each class meeting, it is not uncommon to find smartphones and/or laptops being substituted. If, as the instructor, one feels particularly strongly about allowing one or two of these technologies, but not the other(s), then it is good to clearly articulate in the syllabus what will happen should a student not be using their tablet during class.
I've described before (e.g. here) how I moved test-taking into the digital realm, requiring students to annotate a PDF of the exam. What I realized at the end of last semester was that students were taking two liberties that I hadn't explicitly dealt with in my syllabus, and so (at the time), I felt like I had no recourse to intervene. Those were:
I hope that providing these thoughts will help you strategize what you place in your syllabus for the 21st century classroom!
As I teach more tablet-based classes, I have identified some specific items that I suggest including on any syllabus for a class that intends to have all students use any type of mobile technology (i.e. smartphones, tablets, laptops), whether it is a program like DISCOVERe or whether it is a BYOD (bring your own device) situation. I have written about such syllabi before, but here are some new thoughts:
If you intend on having students use technology to complete any assignment, be firm and state on your syllabus that the technology must be used, if that is your philosophy and truly your intention. If you can't defend the use of the technology over a traditional (e.g. paper) process, then the use of technology might not be warranted.
Even in the DISCOVERe program, where each student is required to bring a tablet to each class meeting, it is not uncommon to find smartphones and/or laptops being substituted. If, as the instructor, one feels particularly strongly about allowing one or two of these technologies, but not the other(s), then it is good to clearly articulate in the syllabus what will happen should a student not be using their tablet during class.
I've described before (e.g. here) how I moved test-taking into the digital realm, requiring students to annotate a PDF of the exam. What I realized at the end of last semester was that students were taking two liberties that I hadn't explicitly dealt with in my syllabus, and so (at the time), I felt like I had no recourse to intervene. Those were:
Use of multiple devices
I occasionally saw students using a laptop and tablet, and/or smartphone, during an exam. This gave me the uneasy feeling that some backchannel communication might have been going on during the exam. If you want to limit student digital access to each other during an exam, limiting the number of devices allowed to be used might at least make it less efficient to carry on digital conversations with others.
Use of tablet keyboards
Some students had purchased external keyboards for their tablets - something not required in the DISCOVERe program. This made me immediately concerned about whether this was putting some students (who might not be able or willing to afford that extra expense) at a disadvantage. I currently ban the use of audio during the course, because I feel that I would have to require students to use earphones so as not to distract other students. I don't think that requiring every student to purchase headphones is reasonable (even if most already have them - some will not), and so I am currently pondering whether to ban external keyboards as well.I hope that providing these thoughts will help you strategize what you place in your syllabus for the 21st century classroom!
Subscribe to:
Posts (Atom)