Long Reads and the Return of Reference Genomes

The availability of inexpensive, massively parallel short-read sequencers shifted the focus of high-quality reference genome production to resequencing and draft genome assemblies, emphasizing quantity over quality. Long-read sequencing, however, now makes it possible to achieve both quantity and quality.

Reference genomes were originally deemed a necessary tool to study an organism or family of organisms, but many scientists held that the production of a single reference assembly would be sufficient for any given species. As the field of genomics matured, however, researchers have realized that these polished assemblies are critical for applications from personalized medicine to comparative genomics and more. Current efforts to produce population-specific reference assemblies for the human genome are expected to reveal a significant amount of previously undetected natural genetic variation among individuals. Similarly, scientists are moving beyond the days of aligning, for instance, goat sequence to a bovine assembly, for lack of a proper reference. They are now generating multiple reference-grade assemblies for agriculturally important plants and animals as a foundation for efforts to improve the health, hardiness and yield of these species.

These reference assemblies are providing critical new information about underlying biology, genetic mechanisms of interest and more. Scientists involved in precision medicine contend that eventually each person will be his or her own best reference: that comparing each of us to a third-party reference assembly, even a high-quality one, will be less effective for targeting treatments and understanding disease than sequencing every individual to reference quality.

The challenge with short-read sequence data lies in mapping and aligning data. Genomes of all sizes include highly repetitive regions, often hundreds or thousands of bases in length. Because short reads are typically less than 500 bases, they cannot span these challenging regions. These regions—which may have great importance for understanding disease or other phenotypes of interest—confound the process of assembling short reads, collapsing into themselves in the final assembly. Pseudogenes may not be distinguishable from the genes they mirror, and even short stretches of high-identity sequence can make it impossible for reads to be mapped accurately.

In contrast, long-read sequencing can produce reads that are tens of kilobases in length, fully spanning these difficult regions or including enough unique sequence information to facilitate accurate mapping. Scientists studying microbial genomes with long-read sequencing often produce fully closed assemblies on their first try, with the organism genome in a single piece and the accessory genome represented as well. In more complex genomes, researchers have used long-read sequencing to generate the highest-quality assemblies, many times filling gaps in existing reference genomes. Long-read sequencing has produced the most contiguous assemblies ever generated, providing an essential resource for scientists.

To maximize the information obtained from long-read sequencing, scientists have paired it with automated DNA size selection. This step removes smaller fragments from libraries, allowing sequencers to work with the longest fragments and produce longer reads. By combining these technologies, researchers have demonstrated the ability to significantly increase the average DNA fragment lengths sequenced, leading to even higher-quality assemblies.

Building better assemblies

Loomis and Eid et al.1 presented the first known sequence of the gene responsible for fragile X syndrome, a repeat expansion disorder. Previously intractable with short-read sequencers, the gene and its hundreds of triplet repeats were fully sequenced with Single Molecule, Real-Time (SMRT) Sequencing from Pacific Biosciences (Menlo Park, Calif.). The authors noted that getting an accurate repeat count is critical for patient prognosis: having more than 200 repeats causes fragile X syndrome, while 55–200 copies are indicative of a related but different syndrome.

In another example, scientists from the Icahn School of Medicine at Mount Sinai (New York, N.Y.) determined that many more structural variants could be detected in the human genome from long-read data than from shortread data.2 These elements are important for human health, especially in interrogating cancer. The findings suggest that estimates of structural variation based on short-read data alone may significantly underrepresent the actual variation across a genome.

In a large study of microbes, scientists at the U.S. Department of Agriculture (Washington, D.C.) and the National Biodefense Analysis and Countermeasures Center (Frederick, Md.) reported that long-read sequencing has made it possible to produce finished microbial genome assemblies in an automated pipeline.3 Based on a comprehensive assessment of a repetitive sequence across more than 2200 microbes, the researchers predicted that SMRT Sequencing could be used to automatically close at least 70% of known bacteria and archaea genomes. This could be crucial during a pandemic or foodborne outbreak, where correctly and rapidly identifying a microbial strain can inform the choice of treatment as well as epidemiological studies.

Long reads plus automated sizing

While long-read sequencing on its own can produce impressive assemblies, scientists have found that the use of automated DNA size selection to remove smaller fragments prior to sequencing has a marked impact on generated read length. By pairing BluePippin size selection from Sage Science (Beverly, Mass.) with SMRT Sequencing, the average length of sequenced DNA fragments has been doubled in some cases, creating assemblies with unprecedented contiguity and accuracy. In an effort to improve the quality of the human reference genome, scientists at the University of Washington (Seattle, Wash.) used BluePippin-sized PacBio sequence data to close or shrink more than half of the remaining gaps in the assembly, most of which included highly repetitive sequence.4

Scientists at the Icahn School of Medicine at Mount Sinai, Cold Spring Harbor Laboratory (Cold Spring Harbor, N.Y.) and European Molecular Biology Laboratory (Heidelberg, Germany) conducted the first analysis of a diploid human genome using single molecule technologies, including SMRT Sequencing, and produced “the most contiguous clone-free human genome assembly to date.”5 The study focused on the well-characterized NA12878 genome and produced an assembly with scaffold N50 values close to 30 Mb. Scientists used the BluePippin platform to size-select libraries, removing fragments smaller than 7 kb, prior to sequencing with the PacBio system. “Without selection, smaller 2000–7000 bp molecules dominate the zero-mode waveguide loading distribution, decreasing the sub-read length” that can be achieved with the sequencer, the authors noted.

In separate work, researchers from the Norwegian Sequencing Centre (Oslo, Norway) ran several libraries on the PacBio platform with and without BluePippin sizing to determine the impact of size selection.6 In one test, the average DNA insert length doubled from 3000 bases to 6000 (in the years since this analysis was done, typical DNA insert lengths sequenced have increased to about 10 kb without size selection and 15 kb with size selection). Without size selection, 50% of bases sequenced in this project were in reads 5 kb or smaller; after size selection, half of all bases were in reads of at least 10 kb.

Looking ahead

As advances continue for long-read sequencing and automated DNA size selection, scientists can expect read lengths to increase even more. That will allow for the automated completion and closure of even larger genomes, leading to pushbutton reference genomes. Ultimately, it will be feasible to generate platinum-quality genomes that will enable researchers to investigate the full universe of genetic diversity in any species.

References

  1. Loomis, E.; Eid, J. et al. http://genome.cshlp.org/content/23/1/121.full
  2. Ritz, A.; Bashir, A. et al. http://bioinformatics.oxfordjournals.org/content/30/24/3458.full
  3. Koren, S.; Harhay, G.P. et al. http://www.genomebiology.com/2013/14/9/R101
  4. Chaisson, M.J.P.; Huddleston, J. et al. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4317254/
  5. Pendleton, M.; Robert Sebra, et al. http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3454.html
  6. Nederbragt, L. https://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longestreads-pacbio-and-bluepippin/

T. Chris Boles, Ph.D., is chief scientific officer, Sage Science, Inc., Beverly, Mass., U.S.A. Jonas Korlach, Ph.D., is chief scientific officer, Pacific Biosciences, 1380 Willow Rd., Menlo Park, Calif. 94025, U.S.A; tel.: 650-521-8000; e-mail: [email protected].

Related Products

Comments