Familial long-read sequencing increases yield of de novo mutations


  • M.D. Noyes
  • W.T. Harvey
  • D. Porubsky
  • A. Sulovari
  • R. Li
  • N.R. Rose
  • P.A. Audano
  • K.M. Munson
  • A.P. Lewis
  • K. Hoekzema
  • T. Mantere
  • T.A. Graves-Lindsay
  • A.D. Sanders
  • S. Goodwin
  • M. Kramer
  • Y. Mokrab
  • M.C. Zody
  • A. Hoischen
  • J.O. Korbel
  • W.R. McCombie
  • E.E. Eichler


  • American Journal of Human Genetics


  • Am J Hum Genet 109 (4): 631-646


  • Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children-a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10(-8) substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.