DNA sequencing is a field in molecular biology with uses ranging from understanding the genetic basis for cancer, to diagnosing genetic predispositions, to understanding the basic way in which cells function at the molecular level. I recently joined the International Cancer Genome Consortium (ICGC), where we aim to catalogue the genetic makeup of as many different cancer types as possible. To do this we must employ sequencing technologies which allow genomes to be experimentally determined. When the Human Genome Project first sequenced the entire human genome, they employed a technique called Sanger sequencing, which essentially steps through every nucleotide, one after another, and determines whether it is an A, C, G or T - the four nucleotide types from which DNA sequences are constructed.
Unfortunately the Sanger approach is both slow and costly. The Human Genome Project cost around $3b, took many years, and employed hundreds of scientists. With technological improvements this could now be done for several million dollars in a fraction of the time. However, this approach is still too costly to sequence 100's or 1000's of unique genomes.
Recently a different approach has emerged, high throughput sequencing technologies, which allow an entire human genome to be sequenced for $10,000, by a couple of scientists, in under a week. As technology improves it probably isn't unrealistic to expect that in the coming decade this could be done for merely hundreds of dollars. Several competing high throughput technologies have emerged, but they all operate in a similar fashion. Rather than sequencing an entire DNA strand from start to finish, they fragment the DNA into millions of small pieces, called 'reads', which are on the order of 50 nucleotides in length. Each of these fragments is independently sequenced, and can be sequenced in parallel, which can be done with a fraction of the time and cost compared to traditional Sanger sequencing.
So now we've sequenced millions of tiny fragments - what do we do with them? There are two approaches to utilizing this data. The first approach is to map the fragments to a reference genome. Suppose we have the entire human genome, thanks to the Human Genome Project, then we can take each read and look at where is maximally overlaps with the reference. By looking at where the mapped reads sit on the reference genome we can see what the differences are. For example a read might differ from its mapped location on the reference by a few nucleotides. These differing nucleotides are called Single Nucleotide Polymorphisms, or SNPs (pronounced 'snips'), and tell us what's different between you and me. The SNPs, which constitute only a tiny fraction of the genome, give us all sorts of useful information, like predispositions, mutations and genetic traits. The second approach to utilizing reads is to attempt 'de novo assembly'. Here, we take all the reads and look at how they overlap with one another, as opposed to how they overlap with a reference. So if the last few nucleotides of one read overlap with the first few of another, then we can conclude that those pieces fit together. We are essentially left with a huge jigsaw puzzle that we must piece together. The advantage of this approach is that it does not require a reference, so it can be used to sequence large chunks of a genome in the absence of any a priori information about the genome. Both of these approaches are useful. Mapping is useful when we have a reference genome to compare against, while de novo assembly is useful when we don't.
Unfortunately both these approaches have some limitations. In particular, sequenced data from present day sequencing technologies have quite high error rates. This makes it more difficult to map a read against a reference, and it also makes it more difficult to perform de novo assembly since the pieces in the puzzle don't always fit together. So error detection and correction mechanisms have to be employed. Having said that, there are significant improvements being made to current high throughput technologies which are incrementally reducing error rates and increasing throughput, which gives us more pieces in the puzzle, with lower error rates, and therefore a higher likelihood of finding matching pieces.
The future of DNA sequencing is looking very promising. We can do in a week what previously took years, and we can do it with many orders of magnitude reduction in cost. As this technology becomes more widespread I'm sure that economies of scale and technological improvements will continue to put downward pressure on cost and upward pressure on throughput and data quality.
Of course, the accessability of this technology carries with it some significant moral dilemas. Who should be allowed access to this kind of technology? Employers? Health insurance companies? Life insurance agencies? Already there are private companies like 23&me which allow a person's SNPs to be sequenced for a few hundred dollars, revealing a person's predisposition to physical and mental illnesses, racial background, likelihood of being intelligent, or any number of other traits that an unscrupulous employer might be interested in. There is enormous potential for misuse, which in my mind policy makers should consider addressing sooner rather than later.