driving its selection as the reference conifer genome.
Among conifers, its genetic resources are uns urpassed in
that three tree improvement coope ratives have been
breeding loblolly pine for more than 60 years and man-
age millions of trees in genetic trials. The current con-
sensus reference genetic map for loblolly pine is made
up of 2,308 genetic markers [4]. Extensive QTL and as-
sociation mapping studies in loblolly pine have revealed
a great deal about the genetic basis of complex traits
such as physical and chemical wood properties, disease
and insect resistance, growth, and adaptation to chan-
ging environments. Current research focuses on the
potential of genomic selection for continued genetic im-
provement [5].
The tree selected for sequencing, ‘20-1010’, is a mem-
ber of the North Carolina State University-Industry Co-
operative Tree Improvement Program and the property
of the Commonwealth of Virginia Department of For-
estry, which released this germplasm into the public
domain. In accordance with open access policies [6], we
released the first draft genome of loblolly pine in June
2012, which made it the first draft assembly available for
any gymnosperm. The draft described here represents a
significant advance over available gymnosperm reference
sequences [7,8].
Results and discussion
Sequencing and assembly
The loblolly pine genome [9] joins the two other conifer
reference sequences produced recently [7,8]. With an esti-
mated 22 billion base pairs [10], it is the largest genome
sequenced and assembled to date. Our experimental de-
sign leveraged a unique feature of the conifer life cycle
and new computational approaches to reduce the assem-
bly problem to a tractable scale [9,11]. From the first
whole genome shotgun (WGS) assembly of the 1.8 million
base pair Haemophilus influenzae genome in 1995 to the
orders-of-magnitude larger three-billion-base-pair mam-
malian genomes that followed years later [12], the WGS
protocol has been an efficient and effective method of pro-
ducing high quality reference genomes. This was in part
made possible by the overlap layout consensus (OLC) as-
sembly paradigm championed by Myers [13] and ubiqui-
tously implemented in first-generation WGS assemblers.
When next-generation sequencing disruptively ushered in
a new era of WGS sequencing, the extremely large num-
bers of reads exceeded the capabilities of existing OLC
assemblers. To circumvent this, new assemblers were de-
veloped, using short k-mer based methods first described
by Pevzner [14]. The giant panda [15] was the first mam-
malian species to have its genome produced using strictly
NGS reads. For loblolly pine, we utilized a hybrid assembly
method that incorporates both k-mer based and OLC as-
sembly methods.
Figure 1A illustrates the two sources of DNA that
comprised the sequencing strategy. As outlined below
(see [9] for details), the maj ority of the WGS sequence
data in Table 1 was generated from a single pine seed
megagametophyte. The small quantity of genomic DNA
obtained from the haploid megagametophyte tissue was
used to construct a series of 11 Illumina paired end li-
braries with sufficient complexity to form the basis of a
high quality WGS assembly. The use of haploid DNA
greatly simplifies assembly, but the limited quantity of
haploid DNA was insufficient for the entire project. Dip-
loid needle tissue served as an abundant source of par-
ental DNA for the construction of long-insert linking
libraries. This included 48 libraries ranging from 1 to 5.5
kilobase pairs (Kb) and nine fosmid DiTag libraries span-
ning 35 to 40 Kb.
An overview of the assembly process is presented in
Figure 1B. The combined 63× cover age from megagame-
tophyte libraries (approximately 15 billion reads) was
used for error correction and for the construction of a
database of 79-mers appearing in the haploid genome.
This database wa s used to filter highly divergent haplo-
types from the diploid sequence data. The super-read re-
duction implemented in the MaSuRCA assembler [11]
condensed most of the haploid paired-end reads into a
set of approximately 150 M longer ‘super-reads’. Each
super-read is a single contiguous haploid sequence that
contains both ends of one or more paired-end reads.
The construction process ensured that no super-read
was contained in another super-read. Critically, the
number of megagametophyte-derived reads was reduced
by a factor of 100. The combined dataset was 27-fold
smaller than the original, and was sufficie ntly reduced in
size to make overlap-based assembly using CABOG [16]
possible. The output of the MaSuRC A assembly pipeline
became assembly 1.0. Additional scaffolding methods
were implemented to improve the assembly by taking
advantage of the deeply sampled transcriptome data
[17], ultimately produc ing assembly v1.01. Finally, to fur-
ther assess completeness, a scan for the 248 conserved
core genes in the CEGMA database [18] was performed
on all con ifer assemblies (Figure 1B). The resulting an-
notations are classified as full length and partial. The
loblolly pine v1.01 assembly has the largest number of
total annotations (203) of the three conifers as well as
the largest fraction of full length annotations (91%).
For validation purposes, we used a large pool of ap-
proximately 4,600 fosmid clones to approximate a ran-
dom sample of the genome [9]. The sequenced and
assembled pool contained 3,798 contigs longer than
20,000 bp, each putatively representing more than half
of a fosmid insert, with a total span of 109 Mbp. When
aligned to the genome 98.63% of the total length of these
contigs was covered by the WGS assembly. A total of
Neale et al. Genome Biology 2014, 15:R59 Page 2 of 13
http://genomebiology.com/2014/15/3/R59