Republication from PLOS One


Fig 1. Ancient migrations of Iranic-speaking populations.

A) Area populated by Iranic speakers in the middle of the first millennium BC. States whose languages belonged to the Iranic and Armenian linguistic groups are shown in red (modified from [39]). B) Homeland and migration of Iranic speakers according to the major competing theories (modified from [34]).   doi:10.1371/journal.pone.0122968.g001


Y-chromosomal haplogroup G1 is a minor component of the overall gene pool of South-West and Central Asia but reaches up to 80% frequency in some populations scattered within this area. We have genotyped the G1-defining marker M285 in 27 Eurasian populations (n= 5,346), analyzed 367 M285-positive samples using 17 Y-STRs, and sequenced ~11 Mb of the Y-chromosome in 20 of these samples to an average coverage of 67X. This allowed detailed phylogenetic reconstruction. We identified five branches, all with high geographical specificity: G1-L1323 in Kazakhs, the closely related G1-GG1 in Mongols, G1-GG265 in Armenians and its distant brother clade G1-GG162 in Bashkirs, and G1-GG362 in West Indians.

The haplotype diversity, which decreased from West Iran to Central Asia, allows us to hypothesize that this rare haplogroup could have been carried by the expansion of Iranic speakers northwards to the Eurasian steppe and via founder effects became a predominant genetic component of some populations, including the Argyn tribe of the Kazakhs. The remarkable agreement between genetic and genealogical trees of Argyns allowed us to calibrate the molecular clock using a historical date (1405 AD) of the most recent common genealogical ancestor. The mutation rate for Y-chromosomal sequence data obtained was 0.78×10-9 per bp per year, falling within the range of published rates. The mutation rate for Y-chromosomal STRs was 0.0022 per locus per generation, very close to the so-called genealogical rate. The “clan-based” approach to estimating the mutation rate provides a third, middle way between direct farther-to-son comparisons and using archeologically known migrations, whose dates are subject to revision and of uncertain relationship to genetic events.


Despite multiple studies of the phylogeography of individual Y-chromosomal haplogroups, haplogroup G1-M285 has not received attention so far. This is partly explained by its relatively low frequency in its main area of distribution in South-West Asia [10,42], and partly by its uneven geographic distribution with a maximum frequency in the Madjar population in Kazakhstan [5]. For this reason, study of the phylogeography of haplogroup G [44] dealt mainly with the G2 sub-branch, and the only statement about G1 is an estimate of its age from Y-STR markers (19,000 ± 6,000 years). However, newly accumulated data indicate that G1 is present over a wider area in the Eurasian steppe than in Madjars only [10], and it also reaches very high frequencies in geographically distant populations of the Armenian plateau (Table 1). Thus, haplogroup G1 might mark an ancient genetic link between Iranic speakers of South-West Asia and populations of the Central Asian steppes where Iranian speech predominated in the second and first millennia BC (Fig 1A). However, the place of origin of this haplogroup remains unclear, and it is unknown whether South-West Asians and Madjars have the same or different subbranches of haplogroup G1, what the age of the branch(es) are, and which ancient migrations contributed to the contemporary distribution and diversity of this haplogroup.

These details of haplogroup G1 phylogeography have been hard to answer, because existing methods allowed only slow progress in discovering phylogenetically informative SNPs. Fortunately, during recent years the possibility for full resequencing of the Y-chromosome [17,41,43,49,50], and more particularly the Y-capture technologies which became commercially available in the year 2013, stimulated intensive discovery of phylogenetically informative SNPs. For example, during the last decade (from the first extensive papers in 2000 till 2011) only 485 SNPs were placed on the global Y-chromosomal phylogenetic tree, while in the three following years the number of SNPs has exceed 9,000 (

Within the last decade, there has been significant uncertainty in dating Y-chromosomal haplogroups due to a three-fold difference between so-called “genealogical” and “evolutionary” mutation rates of Y-STRs. The former rates were repeatedly obtained in a set of studies [18,22,46] comparing father-son pairs, while the latter was obtained in single study [54] where calibration was done using population events with known historical dates. Increasing datasets of complete Y-chromosomal sequences allowed new calculations of the mutation rates, this time focused on SNPs. Four mutation rates have been suggested so far, ranging from 0.6 to 1.0 ×10-9 per bp per year: the pedigree-based rate [50], calibrations based on peopling of the Americas [41] and Sardinia [17], and the rate adopted from the pedigree rate for autosomal SNPs [37]. The two-fold difference between these rates makes further estimations necessary. In the current study we had the chance to calibrate the Y-chromosomal molecular clock using a historically reliable date of the most recent common genealogical ancestor of carriers of haplogroup G1 in Kazakh clans.

Migration of Iranic-speaking populations between the Central Asian steppes and South-West Asian uplands is an important issue in human population history, directly related to the much-debated problem of the homeland and early migrations of Indo-Europeans. Followers of the Kurgan theory propose that the carriers of Iranic languages expanded from the Eurasian steppe southward to present-day Iran, from which region these languages received their name (Fig 1B). The competing theory locating the Indo-European homeland in Eastern Anatolia proposes that the Iranic branch migrated from the Iranian plateau northward to the steppes (Fig 1B). Thus, both theories agree on the area populated by ancient Iranic-speakers (both the Iranian-Armenian plateau and Central Asia steppes) and later replacement of Iranic languages in the steppes by the Turkic ones. But these theories suggested opposite directions of the population movements between the steppes and uplands [34].

This study presents a deep phylogeographic analysis of haplogroup G1 by combining traditional approaches with the new powerful options emerging from complete sequencing of the Y-chromosome. We set out to provide a new independent estimate of the mutation rate using the tight links between haplogroups and clans typical in patrilineal nomadic societies. In addition, we aimed to find which direction of the ancient migration of Iranic speakers better fits the haplogroup G1 phylogenetic pattern.