There are 6 billion stories in the human genome. If you want to read one of them, you’re going to need a sequencer.
But, what to sequence? Researchers interested in the mysteries of the human genome have three options: targeted sequencing of small panels, targeted whole exome sequencing (WES) and whole genome sequencing (WGS). Targeted sequencing of small panels typically involves anything from individual exons to hundreds of genes judged likely to mutate in a given condition. WES is a targeted approach that decodes the “exome,” that most-easily-interpreted fraction of the human genome that codes for protein, which represents about 1% to 2% of the genome overall. WGS sequences the genome in its entirety, including those noncoding sequences whose functional roles are only now coming to light.
According to Joel Fellis, associate director of product marketing at Illumina, WES is currently the most popular strategy for human genome sequencing, but WGS is gaining rapidly. “It’s just a matter of the trade-offs: A whole genome giving you better genome quality, better coverage, more sensitivity [vs.] whole exome [sequencing] allowing you to process more samples at lower price point,” he says.
STAT-Seq relies heavily on data analysis to whittle the list of identified variants to a manageable number. Those algorithms cross-reference patient symptoms with known mutations to home in on the variants most likely to underlie a patient’s symptoms. For the most part, those mutations reside within protein-coding regions. So, why not simply sequence the exome? Kingsmore cites three reasons: “One, exomes are imperfect, they miss a lot of exons; two, they are slower, because you have to do enrichment steps; and three, we do pick up a lot of intronic and nongenic mutations.”
Just this past week (September 30) Kingsmore’s team published an update to its method that cuts the time from patient enrollment to diagnosis nearly in half, to just 26 hours . In part, that’s due to streamlining the sequencing process itself, Kingsmore says, but most of the improvements were algorithmic: analysis time was sliced from about 17.5 hours to 70 minutes.
The long and short of it
STAT-Seq, and most other WGS efforts, for that matter, rely on Illumina sequencing. That method produces an extraordinary number of reads, all of them short, on the order of a few hundred bases. In a sequence as complex and repetitive as the human genome, that short length can cause problems, as some reads cannot be unambiguously mapped to a single location in the genome. Those reads also can complicate the identification and mapping of structural variants, because they simply are too short to definitively determine if, say, a region is duplicated, inverted or translocated in the genome.
According to Roopom Banerjee, CEO of RainDance Technologies, which has commercialized a droplet-based method for targeted sequencing, these problems are related to the fact that most genome-analysis tools require a reference genome. Some structural variations deviate so far from the norm that they simply cannot be mapped to the reference, he explains. “One of the holy grails of human genetics has been the ability to go reassemble genomes without having a reference.” Using longer reads, it may be possible to do just that.
Banerjee says RainDance has announced a collaboration with long-read sequencing firm Pacific Biosciences (whose reads now average nearly 10 to 15 kilobases) to “basically reassemble the genome in bigger chunks in a way that doesn’t require a reference genome.”
The company also is at work on a whole genome, single-cell genomics strategy, Banerjee says. No time line has been announced yet, “but suffice to say, at some point in 2015 we’ll be in beta testing and be generating data with lead customers on these applications.”
A base of knowledge
One of the challenges in WGS (and for that matter, WES) is that every genome contains millions upon millions of bases that differ from other genomes. According to the latest report from the 1000 Genomes Project, “a typical genome differs from the reference human genome at 4.1 million to 5.0 million sites,” the vast majority of which are single-nucleotide polymorphisms, plus some 2,500 structural variants .
However a genome sequence is obtained, it’s of little value until it is analyzed. And that can be especially challenging, says Rupert Yip, director of the Ingenuity Variant Analysis product line at QIAGEN, as the typical genome harbors some 3 to 5 million variants. “To understand which of those mutations actually is responsible for [causing] a disease is really difficult, a needle-in-the-haystack problem,” he explains.
The Ingenuity Variant Analysis web application uses an expert-curated “knowledge base” to help researchers sift through that genetic haystack. “We have an army of Ph.D.s and M.D.s who read the literature and document all known genes, variants and mutations, and the relationships these molecules have with each other,” says Yip. Researchers can simply upload a variant call file (VCF) describing the variants in a particular genome to the web-based application, at which point the knowledge base kicks in to filter them.
Still, Yip estimates that fewer than half the human genomes analyzed in this manner receive a definitive cause using his software. “If you sequenced 100 people, probably around 25% to 40% of the genomes will come back with a clear [clinical] determination,” he says. “The rest we simply don’t know enough about. But the application provides scientist[s] with clear clues for where to look, thus greatly speeding up their research toward clear conclusions.”
Of course, what scientists know is constantly changing. Thus, users can always retest datasets in light of new data or hypotheses, such as new genetic associations, variant frequencies, inheritance patterns or phenotypes. “We have all these knobs you can tweak to come up with a prioritized list,” states Yip.
But to tweak those knobs, you first need to collect your sequence, and there exists a wealth of reagents and tools to help. Whole genome amplification (WGA) kits, for instance, enable researchers to squeeze more data from their precious samples. WGA, says Cynthia Potter, global product manager for genomics at GE Healthcare’s Life Sciences business, “is an insurance policy”—a way to ensure there’s enough material from a sample to reanalyze later, if necessary. (GE Healthcare offers several such kits under its GenomiPhi brand name.)
In the 15 years since the first human genome was sequenced, WGS has transitioned from Herculean to (somewhat) humdrum, and researchers today barely bat an eye at the idea of sequencing one or more human genomes. Now, says Kingsmore, the emerging challenge is one of scale. “There are literally thousands of kids who need this technology,” he says. “How can we scale this to meet their needs, and then how can we get information to doctors who are not trained to analyze or understand genome information?”
For those who would make use of those 6 billion stories, that’s a pressing problem, indeed.
 1000 Genomes Project Consortium, “A global reference for human genetic variation,” Nature, 526:68-74, 2015.
 Saunders, CJ, et al., “Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units,” Sci Transl Med, 4(154):154ra135, 2012. [PMID: 23035047]
 Miller, NA, et al., “A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases,” Genome Med, 7(100):1-16, 2015