Long-Range Sequencing is Required to Unlock the Full Genome

Next-generation sequencing (NGS) provides unprecedented access to one of biology’s greatest mysteries—the genome. Information obtained from sequencing allows researchers to identify changes in genes, associate them with diseases and phenotypes and uncover potential new therapeutic targets.

Sequencing technologies have become faster and less expensive, but limitations remain. Short-read sequencing generates data inexpensively and with low single-base-pair error rates, but produces only a partial picture of the underlying DNA. Thus, many diseases that are likely the result of a genetic change are not yet assigned to a mutation.

While much of the genome has remained inaccessible due to the confines of current tools, newer technologies are utilizing long-range sequencing to fill in the information gaps inherent with short-read sequencing.

 Figure 1 – Critical long-range genomic information can be unlocked with the Chromium system, a microfluidics-based benchtop instrument powered by GemCode technology.

A shortcoming of most long-range sequencers when compared to short-read sequencers is greater per-base cost and higher error rate. In contrast to this, the GemCode platform (10X Genomics, Pleasanton, Calif.) builds on short-read sequencing technology, maintaining low cost and minimal error rates while providing long-range information on the scale of 100 kilobases and higher (Figure 1). Long-range sequencing retains the information present in the original DNA sample, providing the ability to phase variants, improve detection of structural changes in human genomes, increase accuracy, achieve single-molecule sensitivity and quantitate and assemble genomes de novo.

Phasing

Long-range sequencing allows phasing: characterization of the chromosomal origin of variants located within a diploid genome (diploid organisms, like humans, carry two copies of each chromosome). By identifying haplotype information, phased sequencing can be used to study complex traits that are often influenced by interactions among multiple genes and alleles. Whole-genome and whole-exome sequencing produce a single consensus sequence without differentiating between variants on homologous chromosomes. Phased sequencing can identify which alleles are on either maternal or paternal chromosomes—information that can be critical to understanding the genetics underlying a disease and for studying expression and gene regulation.

Phasing enables other downstream applications as well. Because reads can be separated by haplotype, variant-calling can be performed in a haploid, rather than in a diploid context. Thus, mutations (which are consistent with haplotype structure) can be separated from errors introduced by the sequencing technology (which are orthogonal to haplotype structure), resulting in the ability to provide variant calls that have much higher accuracy than those called by short-read sequencing. This is especially important for identifying mutations present in only a fraction of the sample, such as cancer sequencing and noninvasive prenatal testing (NIPT).

Structural variation

Short-read sequencing is sufficient for identifying point mutations in the genome, as well as small insertions and deletions at a reasonable accuracy. However, the larger variants that occur across the underlying genome (known as structural variations) are more difficult and, in many cases, impossible to detect reliably with short-read sequencing. Examples include: 1) Large-scale deletions in which 10s or more kilobases of DNA are removed and totally lost. 2) Inversions in which a section of a chromosome is not lost or amplified, but whose orientation has been flipped. These can be very long—on a multi-megabase level, approaching the length of a chromosome. 3) Interchromosomal translocations in which different chromosomes that would not normally be connected become connected. Human genetics studies and cancer research require an understanding of these variants. In fact, structural variation is the fundamental driver of the oncogenic process in many types of cancer.

One of the challenges in identifying structural variation with short-read sequencing is that the variants are typically mediated by sequences that are repeated throughout the genome. Long-range sequencing goes outside the immediate vicinity of that structural variant into a wider window where there is likely a unique sequence.

Repeat-rich areas

Another interesting application for long-range sequencing is calling variants in the nearly 10% of the genome that is repeat-rich. These regions are very similar and thus difficult to interrogate with short-read sequencers. Sometimes known as “genomic dark matter,” repeat-rich areas originated from a duplication process in which a gene or a region around a gene is copied and turned into two copies or more in the genome. As a result, either may become a pseudogene and still be very similar at the sequence level, but no longer be a functional gene due to the mutation. Alternatively, two copies may be maintained as different genes, but acquire different functions by one or more mutations.

The challenge in aligning 100–200 base reads from a short-read sequencer is that it will not be clear from which of these copies the read was generated. For example, in a gene with one pseudogene, whether reads were generated from the gene or from the pseudogene will not be known. Reads suggesting a mutation could have radically different implications, depending on which of the two copies they came from. A variant from a pseudogene likely affects nonfunctional DNA as the pseudogene no longer encodes a functional gene; if from a gene, on the other hand, it is more likely to have important biological implications.

With short-read sequencing, it is nearly impossible to find mutations confidently in these areas of the genome. What is particularly interesting about these regions is that frequently there has been a selective pressure in the regions where a duplication has occurred. Repeat-rich parts of the genome are therefore enriched for biologically and clinically important genes. For example, many of the most important pharmacogenomics genes, which regulate how drugs are metabolized, overlap repeated parts of the genome.

Long-range sequencing allows researchers to align many more of these ambiguous regions because it provides information across a broader window. If two regions are 99.9% identical, approximately every thousandth base will be different. Therefore, by gathering information across tens of thousands of bases or more, there is a clear indication as to where to place those reads. Short-read sequencing is limited to a couple of hundred bases at a time, so the placement of reads and any variants implied by these reads remains ambiguous.

De novo assembly

Two major paradigms are used in genome sequencing: 1) an alignment or reference-based methodology, which involves aligning data to a reference genome and finding mutations with respect to that reference, and 2) de novo assembly, in which the genome is inferred entirely from the data itself. Most human genetic sequencing is done using the alignment methodology. For humans, the Human Genome Project was undertaken to create this reference, which is a compilation of a number of different human sequences but is a reasonable representation of an “average” human genome. When scientists run an experiment for an individual, they first align the reads to this reference, and then call changes in the sample relative to the reference, which are then identified as mutations.

The problem with this method is that, for many parts of the genome, the reference is not a good model for the underlying variation present in the population. For example, the genes associated with human leukocyte antigen (HLA) have so much variation that the underlying assumption that people are roughly similar is false. The HLA genes carry a high mutation and variation rate because those genes modify the immune system to protect an individual from pathogens.

In these cases, it is not sufficient to work from a reference and is preferable to start with the data and reconstruct the sequence, i.e., perform a de novo assembly. This is vital for reconstructing complex regions like HLA in human germline samples, and for understanding the full set of variations in cancer, where rearrangements can be so dramatic that the reference no longer serves as a good model for the underlying genome. For species outside of human, there is frequently no reference at all.(this is true for the vast majority of species on earth), and de novo assembly is the only way to analyze the sequencing data.

Additionally, shotgun metagenomics, which involves collecting and sequencing the DNA from all the bacteria from an environmental or anatomical sample, is emerging as an important application for de novo assembly. Scientists have traditionally studied bacteria by isolating and growing individual strains in the laboratory. This method falls short because many species cannot be isolated and cultured in the laboratory. Shotgun metagenomics provides a more unbiased measurement of the species and genes present in environmental samples. These samples often contain a complex mix of species of microbes that are highly similar in multiple regions. De novo assembly with long-range sequencing effectively separates and reconstructs the genomes of the constituent microbes in environmental samples.

De novo assembly from short-read data yields incomplete results. Assemblies work on relatively short, unique stretches of the genome, but frequently encounter repeat regions that may come from multiple genomic locations. It is not possible to fully assemble anywhere near a complete genome using short-read sequencing, whereas with long-range sequencing, researchers can stitch those pieces together correctly into chromosome-scale stretches.

The new definition of sequencing

The GemCode platform provides long-range information from sequencing data and partitions large DNA molecules (on average 100 kilobases or more) into droplets, and then tags these fragments with a specific oligo that is sequenced along with the DNA. The oligo tags allow analysis software to reconstruct accurate, long-range genomic information. GemCode complements, but does not replace, existing technology, providing the additional benefits of high throughput and a low error rate. With this molecular barcoding and analysis platform, scientists can access complete and actionable data from which they can learn more about the genome than ever before.

Michael Schnall-Levin, Ph.D., is vice president, Computation Biology and Applications, 10X Genomics, 7068 Koll Center Pkwy., Ste. 401, Pleasanton, Calif. 94566, U.S.A.; tel.: 925-401-7300; e-mail: [email protected]www.10xgenomics.com

Related Products

Comments