Human genetic clustering

Human genetic clustering refers to a wide range of scientific and statistical methods often used to characterize patterns and subgroups within studies of human genetic variation.

Clustering studies are thought to be valuable for characterizing the general structure of genetic variation among human populations, to better understand ancestral origins, evolutionary history, and personalized medicine. Since the mapping of the human genome, and with the availability of increasingly powerful analytic tools, cluster analyses have revealed a range of ancestral and migratory trends among human populations and individuals.^[1]

The practice of defining clusters of human populations is largely arbitrary and variable, depending on the sampled data, genetic markers, and statistical methods applied to their construction. Nevertheless, studies of human genetic clustering have been implicated in discussions of race, ethnicity, and scientific racism, as some have controversially suggested that genetic clusters may represent genetically determined races.^[2]^[3]

Genetic clustering algorithms and methods

Since at least 2001, a wide range of methods have been developed to assess the structure of human populations with the use of genetic data. Most commonly, genetic clusters can be derived by analysis of single nucleotide polymorphisms (SNPs), although other genetic data can be input and analyzed as well. Models for genetic clustering also vary by algorithms and programs used to process the data. Most methods for determining clusters can be categorized as model-based clustering methods or multidimensional summaries.^[4]^[5] By processing a large number of SNPs (or other genetic marker data) in different ways, both approaches to genetic clustering tend to converge on similar patterns by identifying similarities among individual SNPs or haplotype tracts to reveal ancestral genetic similarities.^[5]

Model-based clustering

Common model-based clustering algorithms include STRUCTURE, ADMIXTURE, and HAPMIX. These algorithms operate by finding the best fit for genetic data among an arbitrary or mathematically derived number of clusters, such that differences within clusters are minimized and differences between clusters are maximized. This clustering method is also referred to as "admixture inference," as individual genomes (or individuals within populations) can be characterized by the proportions of alleles linked to each cluster.^[1]

Multidimensional summary statistics

Where model-based clustering characterizes populations using proportions of discrete clusters, multidimensional summary statistics characterize populations on a continuous spectrum. The most common multidimensional statistical method used for genetic clustering is principal component analysis (PCA), which plots individuals by two or more axes (their "principal components") that represent aggregations of genetic markers that account for the highest variance. Clusters can then be identified by assessing the distribution of data, in discrete groups and with admixed position between groups.^[1]^[5]

Caveats and drawbacks

There are many caveats and drawbacks to genetic clustering methods of any type, given the degree of admixture and relative similarity within the human population. All genetic cluster findings are biased by the sampling process used to gather data, and by the quality and quantity of that data. Many clustering studies use data derived from populations that are geographically distinct from one another, which may present a false illusion of clearly discrete clusters.^[1] STRUCTURE in particular can be misleading by requiring the data to be sorted into a predetermined number of clusters, which may or may not reflect the actual population's distribution.^[6] Sample size also plays an important moderating role on cluster findings, as different sample size inputs can influence cluster assignment, and more subtle relationships between genotypes may only emerge with larger sample sizes.^[1]^[6]<translate>

Applications to human genetic data

</translate>Text of this section.

(Lawson & Falush, 2012 (human genome diversity project section); Novembre & Ramachandran, 2011; Kalinowski, 2011 (criticism of STRUCTURE); Bamshad et al, 2004; Bamshad & Olson 2003 gets into alu polymorphisms in a clear way)

Genetic clustering and race

</translate>Text of this section.

(Maglo et al 2016, Jorde & Wooding 2004; Bamshad articles)

Related issues

</translate>Clusters vs. clines

Brief summary of human genetic variation?

Possibly this is a "see also" section?

^ ^a ^b ^c ^d ^e Novembre, John; Ramachandran, Sohini (2011-09-22). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.
^ Jorde, Lynn B; Wooding, Stephen P (2004-10-26). "Genetic variation, classification and 'race'". Nature Genetics. 36 (S11): S28 – S33. doi:10.1038/ng1435. ISSN 1061-4036.
^ Verfasser., Marks, Jonathan (Jonathan M.), 1955-. Is science racist?. ISBN 978-0-7456-8925-8. OCLC 1037867598. {{cite book}}: |last= has generic name (help)CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
^ Novembre, John; Ramachandran, Sohini (2011-09-22). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.
^ ^a ^b ^c Lawson, Daniel John; Falush, Daniel (2012-09-22). "Population Identification Using Genetic Data". Annual Review of Genomics and Human Genetics. 13 (1): 337–361. doi:10.1146/annurev-genom-082410-101510. ISSN 1527-8204.
^ ^a ^b Kalinowski, S T (2010-08-04). "The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure". Heredity. 106 (4): 625–632. doi:10.1038/hdy.2010.95. ISSN 0018-067X.

[:0-1] Novembre, John; Ramachandran, Sohini (2011-09-22). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.

[2] Jorde, Lynn B; Wooding, Stephen P (2004-10-26). "Genetic variation, classification and 'race'". Nature Genetics. 36 (S11): S28 – S33. doi:10.1038/ng1435. ISSN 1061-4036.

[3] Verfasser., Marks, Jonathan (Jonathan M.), 1955-. Is science racist?. ISBN 978-0-7456-8925-8. OCLC 1037867598. {{cite book}}: |last= has generic name (help)CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)

[4] Novembre, John; Ramachandran, Sohini (2011-09-22). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.

[:1-5] Lawson, Daniel John; Falush, Daniel (2012-09-22). "Population Identification Using Genetic Data". Annual Review of Genomics and Human Genetics. 13 (1): 337–361. doi:10.1146/annurev-genom-082410-101510. ISSN 1527-8204.

[:2-6] Kalinowski, S T (2010-08-04). "The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure". Heredity. 106 (4): 625–632. doi:10.1038/hdy.2010.95. ISSN 0018-067X.

[1]

[2]

[3]

[4]

[5]

[6]