Introductory Review
Population genomics: patterns
of genetic variation within
populations
Greg Gibson
North Carolina State University, Raleigh, NC, USA
1. Polymorphism
Polymorphism at the nucleotide level ranges over at least an order of magnitude
within species, and average polymorphism ranges over two orders of magnitude
between species. Homo sapiens is among the least polymorphic of all species,
with a heterozygous single nucleotide polymorphism (SNP) generally occurring
once every 500 to 1000 bp (International SNP Map Working Group, 2001). By
contrast, marine invertebrates such as the sea squirt and echinoderms have an
astonishing level of sequence diversity with a SNP every 5 to 10 bp (Dehal et al .,
2002). Diversity is a function of organism-level factors such as population size,
generation time, and breeding structure (Aquadro et al., 2001), but variation within
and a mong chromosomes signifies that recombination and mutation rates are also
critical (Begun and Aquadro, 1992; Charlesworth et al., 1995). In most species,
centromeric and telomeric regions are less recombinogenic, hence have smaller
effective population sizes, and tend to be less polymorphic (Nachman, 2002). Even
within a locus, polymorphism can vary over an order of magnitude, according
primarily to functional constraint: synonymous substitution rates tend to be uniform,
whereas replacements can be excluded from highly conserved domains. Noncoding
gene sequences are typically more polymorphic than exons and less polymorphic
than intergenic DNA, but core regulatory sequences up to several hundred basepairs
in length may often be the most conserved of all sequences (Wray et al ., 2003).
Significant disparity between two measures of polymorphism, namely, the num-
ber of segregating sites a nd the average heterozygosity, provides evidence for
departure from “neutrality” (Hudson et al., 1987; Kreitman, 2000). However, neu-
trality comes in many flavors, and demographic processes are just as likely to
affect the difference between these two measures as is selection (Nielsen, 2001).
Heterozygosity is a function of allele frequency as well as density, so unexpectedly
high or low numbers of heterozygotes relative to the number of SNPs in a popula-
tion can arise as a result of several processes that may be superimposed on random
drift. Thus, rapid population expansion or strong purifying selection both reduce
2 Genetic Variation and Evolution
heterozygosity, whereas admixture or balancing selection will increase heterozy-
gosity. Tests such as Tajima’s D (Tajima, 1989) have remained useful descriptors
of diversity, but have been joined by a new series of tests that are more firmly
rooted in coalescent theory (Wall and Hudson, 2001). Rather than strictly interpret-
ing test scores relative to theoretical expectations, comparison of the distribution of
test scores across tens or hundreds of loci among species emphasizes that diversity
is affected by a complex interplay of factors and that it is the location of a gene
at either extreme of the continuum that marks it a s a candidate target of selection,
rather than a p-value per se (Hey, 1999; Bustamante et al ., 2002).
A trend toward empirical evaluation of significance by permutation in light
of genomic data is also seen in relation to population structure. Standard
F -statistics introduced by Sewall Wright based on differences in genotype frequen-
cies among populations (Weir and Hill, 2002) have been extended into an analysis
of molecular variance (AMOVA) framework, one popular implementation of which
is the Arlequin software (Schneider et al ., 2000). Estimates of SNP, indel, hap-
lotype, or microsatellite allele frequency differences are sensitive to sample size,
so samples of at least 100 individuals per population are recommended. Using
genomic data, the multiple comparison issue also arises: in a set of 500 sites, a sin-
gle site with a testwise p-value of 0.0001 is not unexpected, but in a large sample
this may correspond to an allele frequency difference of just 10%. Consequently,
population structure is best estimated from multilocus data. For example, Pritchard
et al . (2000) have introduced Bayesian statistics to assign individuals to likely sub-
populations with numerous applications in evolutionary, conservation, quantitative,
and human genetics. It is well known that over 90% of all human polymorphism
is common to all populations, but the ability to genotype hundreds of loci has
led to the recognition that given sufficient data there is a detectable signature of
demographic history even in our species (Rosenberg et al., 2002). Similarly, long-
held assumptions of panmixia in Drosophila melanogaster are being challenged
by deeper sampling (Glinka et al ., 2003), as are commonly held notions about the
genetic uniformity of crops such as maize (Matsuoka et al., 2002), and in fact the
power to discriminate population structure in most species will have a profound
impact on quantitative biology. An important implication of the ability to detect
population structure is inference of departure from neutrality, by comparison of
the observed F -statistics with those obtained from a collection of assumed neutral
markers (Lewontin and Krakauer, 1973; Rockman et al ., 2003).
The advent of new sequencing and genotyping technologies will only accel-
erate the data-driven nature of evolutionary genetic research (see Article 7, Sin-
gle molecule array-based sequencing, Volume 3). ABI 3730 automated DNA
sequencing machines routinely generate traces with over 1 kb of high-quality
sequence and have a throughput capacity exceeding 1 Mb per day. Single-molecule
sequencing methods are expected to make the sequencing of complete eukaryotic
genomes for $1000 each a reality, possibly in the next decade (Meldrum, 2000),
while massively parallel resequencing by hybridization to wafers of tiled oligonu-
cleotides has a lready been used to characterize polymorphism between primate
species (Frazer et al., 2003). Such studies have identified hundreds of loci that are
candidates for the adaptive evolution in the recent human lineage, some of which are
likely to contribute to the etiology of common disease (Tishkoff and Verrilli, 2003;
Introductory Review 3
Clark et al., 2003). Molecular evolutionary studies of single genes in samples of 30
individuals have been typical but will soon be dwarfed by genome-scale sampling,
and increasingly, attention will be placed on the efficient sampling design and for-
mulation of hypotheses that utilize patterns of variation across the genome to inter-
pret unusual patterns of variation at focal loci. Describing the variance of standard
population-genetic parameters at a genome-wide scale is unprecedented territory,
and developing approaches to quantify this variation across these expansive con-
tiguous regions is the challenge for the near future. This type of data will also allow
reexamination of some of the most basic assumptions underlying many population-
genetic approaches, such as the infinite sites and island migration models.
2. Recombination and linkage disequilibrium
Recombination and mutation are the two biochemical processes that influence
the distribution of molecular variation. Recombination can be directly measured
by monitoring the coinheritance of markers transmitted from parent to offspring,
but with the exception of technically demanding single sperm typing (Jeffreys
et al ., 2000); the resolution of this method is of the order of just centimorgans
or hundreds of kilobases. Since an important consequence of recombination is its
effect on linkage disequilibrium over scales from tens of bases to tens of kilo-
bases, indirect methods for measuring recombination have been introduced based
on population-genetic measurement of the cosegregation of markers (Hudson and
Kaplan, 1985; Stumpf and McVean, 2003). Linkage disequilibrium (LD) is the
nonrandom assortment of genetic markers: given two alleles each at a frequency
of 20%, just 4% of individual chromosomes should have both alleles if a ssortment
is random, but physically adjacent markers will often cosegregate more often. In
this case, the maximum possible LD would have 20% of the chromosomes with
both less c ommon alleles, and 80% with both common alleles. Two commonly
used statistics measure this departure from randomness, D
and r
2
, only the latter
of which explicitly takes allele frequencies into account (Hill and Robertson, 1966;
Weir, 1996). A further technical challenge in the measurement of LD is establishing
the linkage phase of double heterozygotes, which can be addressed directly by
studying trios of parents and their offspring (which is however impractical for many
species) or computationally with EM likelihood algorithms (Fallin and Schork,
2000; Stephens et al ., 2001).
Quantitative geneticists have long been interested in LD because detection of
association between markers and phenotypes is dependent on LD between anony-
mous markers and the causative disease or quantitative trait nucleotide(s) (Zonder-
van and Cardon, 2004). This idea has given rise to the human HapMap project,
which is an effort to describe the complete pattern of haplotypes in the human
genome (International HapMap Consortium, 2003). Haplotypes are sets of multi-
locus alleles, and because of LD they tend to be less common than chance would
predict: there are 32 possible ways that five biallelic alleles can combine, but typ-
ically just a handful of these will be at any appreciable frequency in a population.
Standard population-genetic theory predicts that LD should decay monotonically
with distance, but at least in the human genome it now appears that there are often
4 Genetic Variation and Evolution
fairly discrete boundaries that define haplotype blocks that range in length from
10 to 100 kb or more (Gabriel et al., 2002; see also Article 12, Haplotype map-
ping, Volume 3 and Article 74, Finding and using haplotype blocks in candidate
gene association studies, Volume 4). Consequently, while there are in excess of
5 million SNPs in the human genome, there may be as few as 50 000 common
haplotype blocks, and consequently it is argued that a similar number of markers
will be sufficient to perform genome scans for association with disease (Risch and
Merikangas, 1996). According to the common disease–common variant hypothe-
sis, the polymorphisms that contribute to many complex human diseases are likely
to have arisen early in human history, but sufficiently recently that they remain
embedded in observable common haplotypes. Similarly, selected phenotypes or
polymorphic traits of interest to evolutionary biologists and ecologists may be due
to nucleotide variants that can be identified by LD mapping.
There is c onsiderable debate over the reasons for the detection of haplotype
blocks, with explanations ranging from sampling variance to unequal recombination
rates and/or gene conversion hotspots within loci (Wall and Pritchard, 2003; Stumpf
and Goldstein, 2003), and study of the population structure of haplotypes are in
their infancy. With respect to evolutionary and agricultural genetics, measurement
of haplotype structure is increasingly important. Domesticated crops and livestock
are likely to have strong haplotype structure as a result of their breeding history
(Flint-Garcia et al., 2003), whereas outbred and highly polymorphic species such
as Drosophila melanogaster are almost devoid of haplotypes (see Article 10,
Linking DNA to production: the mapping of quantitative trait loci in livestock,
Volume 3). More recent is the advent of population genetics in nonmodel systems
that are important with respect to epidemiology, particularly in humans, such as
HIV and Plasmodium (malaria). The frequency of outcrossing or mixing among
these species may contribute to these organisms’ ability to evade host immunity
(Awadalla, 2003). The ability to dissect quantitative traits to the nucleotide level in
any species is ultimately dependent on the thorough characterization of haplotype
diversity.
3. Mutation, gene content, and the transcriptome
Population genomics also encompasses several novel aspects of variation that were
beyond the technical reach of classical population genetics. For example, direct
measurement of mutation rates is now possible, and will complement a large body
of literature on the genetic consequences of mutation accumulation (Keightley and
Lynch, 2003). For many species, it has been estimated that new genetic variance for
fitness or morphological traits is generated at a rate within an order of magnitude
of 0.1% of the environmental variance per generation (Clayton and Robertson,
1955; Houle et al ., 1996). Similarly, genetic evidence suggests that a typical per
locus spontaneous mutation rate is approximately 10
−6
per generation, from which
nucleotides are inferred to substitute in each meiosis at a rate close to 10
−9
.
Microsatellites evolve at a much accelerated rate, but with a high variance, as
directly measured by comparison of parent and offspring genotypes in several
studies (Ellegren, 2000). Insertion–deletion (indel) polymorphism is prevalent,