Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads


  • D. Porubsky
  • P. Ebert
  • P.A. Audano
  • M.R. Vollger
  • W.T. Harvey
  • P. Marijon
  • J. Ebler
  • K.M. Munson
  • M. Sorensen
  • A. Sulovari
  • M. Haukness
  • M. Ghareghani
  • P.M. Lansdorp
  • B. Paten
  • S.E. Devine
  • A.D. Sanders
  • C. Lee
  • M.J.P. Chaisson
  • J.O. Korbel
  • E.E. Eichler
  • T. Marschall


  • Nature Biotechnology


  • Nat Biotechnol 39 (3): 302-308


  • Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing with continuous long-read or high-fidelity sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.