Introduction
As the last continent settled by modern humans, the peopling of the Americas and subsequent dispersals within the continent have been the focus of intense interest by geneticists.1,2,3,4,5,6 Previous studies have shown that the ancestors of Indigenous Americans, also called Native Americans (NAs), originated in Asia, most likely in the eastern part of Asia,3,6,7,8,9 and settled in the Americas by means of multiple dispersals through Siberia/Beringia10 via the coastal route and possibly the inland ice-free corridor,11 followed by later divergence into sub-groups.12
The origin of early NAs, to date, has been attributed to a complex process involving multiple dispersals from different source places. As indicated by substantial investigations, besides the widely recognized Siberian ancestry, ancestries from other places, although limited, have also been identified, including North Asia,6,9 East Asia,6,13 Southeast Asia,14 and even Australo-Melanesia.15 In agreement with these observations, evidence from uniparental markers further indicates that the majority of NAs show closer genetic affinity to Siberians, as manifested by NA founder types, e.g., mitochondrial DNA (mtDNA) haplogroups A2, B2, C1, C4c, D1, etc.,16,17,18,19 and Y chromosome haplogroups Q-L54 (Q-Z780, Q-M848, and Q-M4303) and C-L1373 (C-MBP373),19,20,21,22,23,24 and thus may trace their ancestral sources in Siberia. In contrast, a sister lineage of the NA matrilineal founder D4h3a,25,26 viz., D4h3b, has been so far observed only in China25 and Thailand,27,28 suggesting that the ancestral maternal sources for early NAs were not restricted to Siberia but were from a much wider Asian geographic range.
To address this issue, an investigation integrating all available D4h data from a large-scale dataset covering the whole of Eurasia is needed. Given that D4h3 and its ancestor type D4h are relatively rare in contemporary populations (∼0.5%),29 we surveyed a total of 101,319 Eurasian individuals and identified the mtDNAs belonging to D4h3 and its ancestral node D4h. These included 60,979 samples for which partial sequence data, mainly hypervariable segment (HVS) data (Table S1), were available and 40,340 samples with the complete (or almost complete) mitogenome sequence (Table S2; Figure 1). This survey identified 110 mtDNAs that could be assigned unambiguously to haplogroup D4h based on mitogenome information as well as 112 mtDNAs likely belonging to D4h based on their HVS or genotyping data (Table S3), whose complete sequencing revealed 106 additional D4h mitogenomes (Figure S1). Furthermore, to reconstruct the evolutionary history of D4h, we also searched this haplogroup in 15,460 ancient samples compiled by indo-european.eu (https://indo-european.eu/ancient-dna/),30 thus covering virtually all global reported ancient mtDNA data, as well as additional 232 recently reported ancient mtDNA data from East Asia.31,32 This survey yielded 39 ancient D4h samples (30 with the entire mitogenome and nine with HVS data) (Figure 1; Tables S4 and S5), which reflected the rarity of Dh4 in ancient times. Therefore, we integrated these ancient and contemporary data of this rare haplogroup to fully investigate its origin and expansion history.

- Download : Download high-res image (890KB)
- Download : Download full-size image
Figure 1. Geographic sources of mtDNA data employed in this study
Circles: populations surveyed for HVS variation are in light blue, while those surveyed for the variation of the entire mitogenome are in yellow. Only data from population surveys (99,722 samples from 1,135 populations) are shown. The remaining 1,597 mtDNAs are not shown on the map either because they were sporadic samples or because geographic information was lacking. For more details concerning the 101,319 samples, see Tables S1 and S2. Triangles: D4h samples, including published (hollow triangles) and newly sequenced samples (filled triangles). Ancient Asian samples harboring D4h mtDNAs were indicated by arrows, with the information shown on the right. The ancient samples from the Americas (see Table S4) are not shown.
ResultsDifferentiation of D4h3 and D4h in Central and North China
To shed light on the origin of the NA founder D4h3a, we explored its ancestor D4h3. Our findings allowed an update of the D4h3 phylogeny and its branches (Figures 2A and S2). Specifically, to avoid any confusion, we kept the names of D4h3a and D4h3b and tentatively named their upstream nodes 「pre-D4h3a」 and 「pre-D4h3b,」 respectively. Different from the NA founder D4h3a, the other branches of D4h3 are mainly distributed in China. In detail, D4h3b1 (root type in Hebei Province in North China) is found in North and Central China, while D4h3b2 (root type in Hubei Province) is mainly distributed in Central China. Coincidentally, among the reported ancient mtDNA data from different locations in Eurasia, we found three ancient samples belonging to D4h3 dated as early as 14–15 kilo years ago (kya) in the Amur River Valley (located in northern North China).33 One of these mtDNAs, sample NE-5 (∼14 kya), derives from pre-D4h3a and is phylogenetically the closest (six mutations apart; Figure S2) to the NA founder D4h3a mitogenome. The remaining two, samples NE34 (∼14 kya) and NE-18 (∼7 kya), are both members of pre-D4h3b. Overall these findings indicate that the ancestral homeland of D4h3 is most likely Central and North China and that both branches of D4h3 were there during the Paleolithic period. These branches locate in Central/North China and reflect the closest Asian matrilineal link to D4h3a, one of the founder pan-American mtDNA haplogroups.25,26

- Download : Download high-res image (1MB)
- Download : Download full-size image
Figure 2. Phylogeography of haplogroup D4h and its sub-lineages
(A) Phylogenetic tree of D4h, with branch lengths proportional to number of variants. Circles: mitogenomes from this study; diamonds: previously published mitogenomes; black outlines: present-day samples; red outlines: ancient samples. The different colors, consistent with those in (B), refer to the different geographic source regions.
(B) Geographic sources of D4h mitogenomes in (A). Numbers on the map refer to the codes of samples and correspond to those in Figure 2A and Table S3.
We then shifted our attention to haplogroup D4h, the most recent common ancestor of D4h3. Except for the NA D4h3a, the other D4h mtDNAs were predominantly found in China, mainly in North (48 out of 150 contemporary samples, discarding four with unknown geographic information) and Central China (44 out of 150) (Table S4; Figure S3). A relatively small number of D4h mtDNAs were also identified in Northwest China (n = 14), Southwest China (n = 16), South China (n = 5), Japan (n = 13), Southeast Asia (n = 7), and North Asia (n = 2) (Table S4; Figure S3). Interestingly, the majority of the ancient D4h samples were detected in the northern regions of China (Figure 1), supporting a similar D4h distribution in the past. Further phylogeographic analyses revealed that the ancient and current samples from the same geographic region tend to cluster together in the same sub-branch, e.g., D4h1a, D4h1c, D4h1d, and D4h3. Meanwhile, most sub-haplogroups of D4h are predominant in North/Central China (i.e., D4h1b, D4h1d, D4h1e, and pre-D4h3b) or showed connections between North/Central China and other regions, including western China (D4h1c and D4h4), Japan (D4h1a), North Asia and Japan (D4h2), and even the Americas (D4h3a) (Figures 2, 3A, and S4). Moreover, samples from South China, Southwest China, Northwest China, Southeast Asia, and North Asia were sporadically distributed across the whole D4h haplogroup and primarily located on the terminal branches (Figures 2A and S2), most likely as a result of gene flow. Finally, the peculiar distributions of certain lineages, for instance D4h1a in Japan and D4h1c in Southwest China (Figures 2A, 3A, and S2), likely indicate founder events.

- Download : Download high-res image (1MB)
- Download : Download full-size image
Figure 3. Geographic distribution and schematic tree of haplogroup D4h
(A) Geographic distributions of different branches of D4h. Each circle represents one sample, with geographic origin of samples shown by different colors, consistent with those in Figure 2B. Contour maps display spatial frequency distributions of haplogroups (see Table S7). Circles without outlines represent datasets from phylogenetic rather than population studies and thus were excluded in calculations of spatial frequencies.
(B) Bayesian age estimates using complete mitogenomes. Sizes of triangles are proportional to sub-haplogroup sample sizes. Colors represent different geographic regions, consistent with Figure 2B. Ancient samples are indicated in red. Green and yellow diamonds show the divergences within the LGM and the last deglaciation, respectively.
(C) Extended Bayesian skyline plot (EBSP) of D4h, showing effective population size changes through time.
Given that some of the mitogenome data from literature are from phylogenetic rather than population studies, and given the relative scarcity of mitogenomes from Siberia, which will introduce bias to the phylogeographic analyses, we also collected and analyzed mtDNA HVS data from population studies (Table S1). Only few potential D4h samples were found in North Asian samples (n = 4,176) (for example, two belonging to D4h1d, which is defined by T16172C and C16174T, and one belonging to D4h1e, which is defined by C16174T and A16343G) (Figure S4), lending support to its rarity in North Asia. The median-joining network based on HVS data (Figure S4) revealed instead a much wider distribution range of D4h in Asia. Indeed, the majority of Asian D4h mtDNAs are observed in Central (58/228; 25.43%) and North (44/228; 21.05%) China, followed by Southwest China (35/228; 15.35%), Northwest China (15/228; 6.57%), Japan (29/228; 12.72%), Southeast Asia (11/228; 4.82%), South China (6/228; 2.63%), and North Asia (9/228; 3.94%). Moreover, the root types of the major branches, e.g., D4h1b, D4h1c, D4h1d, D4h1e, and D4h3b, are primarily found in Central and North China, while the terminal branches mainly contain samples from other regions, e.g., Southwest China, Northwest China, Southeast Asia, South Asia, and Central Asia. Finally, D4h1a and D4h2 are restricted to Japan and its surroundings, lending support to the founder events.
Taken together, these results indicate that the phylogenetic differentiation of D4h occurred somewhere in Central or North China, most likely in a region geographically close to the northern coast of China. In fact, among the North/Central China samples, more than half (64/92, 69.57%) were found in provinces along (Hebei, Liaoning, Tianjin, Shandong, Jiangsu, Shanghai, and Zhejiang) or near (Heilongjiang, Jilin, Beijing, Anhui, and Jiangxi) the northern coast of China (Table S4). Therefore, we propose that the northern coast of China might have played a critical role in the divergence and spread of D4h and its sub-haplogroups.
Two radiation events from northern coastal China contributed to NA and Japanese gene pools
Coalescent age estimations, updated by calibrated radiocarbon dates of ancient DNA samples using tip dating in BEAST (Tables 1 and S6), indicate that the radiations of D4h lineages (aged 32.39 kya, 95% highest probability density [HPD], 24.04–41.45 kya) occurred mainly within two time periods (Figure 3B). The first period fell within the Last Glacial Maximum (LGM; 26.5–19.0 kya),34 during which D4h3 (26.39 kya, 95% HPD, 20.19–33.21 kya), pre-D4h3a (22.29 kya, 95% HPD, 17.24–27.68 kya), pre-D4h3b (21.55 kya, 95% HPD, 16.18–27.94 kya), D4h1 (21.83 kya, 95% HPD, 15.56–29.08 kya), and D4h2 (20.05 kya, 95% HPD, 12.12–29.48 kya) (Tables 1 and S6) differentiated into separate sub-haplogroups (Figure 3B). Among these sub-haplogroups, D4h3a further dispersed and became one of the pan-American haplogroup of NAs (Figure 4A). This radiation echoes well with the divergence of basal American branches from ancient eastern Asians 23–20 kya,3 which was likely due to the LGM』s inhospitable climate in the northern regions of Asia.35 During the last deglaciation (19.0–11.5 kya), after the LGM, a second radiation of D4h occurred somewhere near the northern coast of China, as documented by D4h4 (18.11 kya, 95% HPD, 12.67–24.28 kya), D4h1c (16.17 kya, 95% HPD, 10.66–22.36 kya), D4h1a (15.59 kya, 95% HPD, 11.43–20.92 kya), D4h3b (13.22 kya, 95% HPD, 7.55–19.93 kya), D4h1c1 (12.77 kya, 95% HPD, 8.21–17.79 kya), and D4h1e (12.10, 95% HPD, 7.16–17.50 kya) (Figure 4B). Concordant with this phylogenetic radiation, a rapid increase in the effective population size of D4h ∼15 kya was observed in the extended Bayesian skyline plot (EBSP) (Figure 3C), probably due to the post-LGM climate improvement. These results uncover two waves of previously unknown population dispersals along the northern coast of China during the LGM and last deglaciation, which led to the origin and expansion of different D4h lineages (Figure 4). The regions around the Bohai, Huanghai, and East China Seas, which were still connected by land along the northern coast before the Holocene,36 probably allowed these expansions to occur.
Table 1. Coalescent ages of D4h and its sublineages
Haplogroups/sub-haplogroups | Number of mitogenomesa | Age (mean [95% HPD]) (kya)b |
---|
D4h | 237 | 32.39 (24.04–41.45) |
>D4h1 | 112 | 21.83 (15.56–29.08) |
>>D4h1a | 13 | 15.59 (11.43–20.92) |
>>>D4h1a1 | 12 | 12.24 (6.72–15.87) |
>>>>D4h1a1a | 5 | 5.07 (1.83–8.56) |
>>>>D4h1a1b | 7 | 5.66 (2.87–8.87) |
>>D4h1b | 28 | 10.63 (6.26–15.53) |
>>>D4h1b1 | 25 | 7.56 (4.41–11.06) |
>>D4h1c | 40 | 16.17 (10.66–22.36) |
>>>D4h1c1 | 35 | 12.77 (8.21–17.79) |
>>>>D4h1c1a | 34 | 10.5 (7.01–14.61) |
>>>D4h1c2 | 5 | 7.54 (3.43–12.46) |
>>D4h1d | 18 | 10.26 (6.47–14.26) |
>>D4h1e | 13 | 12.10 (7.16–17.50) |
>D4h2 | 8 | 20.05 (12.12–29.48) |
>>D4h2a | 7 | 10.78 (6.57–15.46) |
>D4h3 | 96 | 26.39 (20.19–33.21) |
>>Pre-D4h3a | 73 | 22.29 (17.24–27.68) |
>>>D4h3a | 71 | 19.40 (15.11–24.05) |
>>Pre-D4h3b | 24 | 21.55 (16.18–27.94) |
>>>D4h3b | 22 | 13.22 (7.55–19.93) |
>>>>D4h3b1 | 3 | 1.93 (0.29–4.05) |
>>>>D4h3b2 | 19 | 6.18 (3.31–9.56) |
>D4h4 | 21 | 18.11 (12.67–24.28) |
a
Ancient mitogenome data were included in coalescent age estimations. Incomplete sequences were excluded from age estimations (see Table S4 for details).
b
The mutation rate was recalibrated using the tip dating method. The best-fitting model was evaluated as previously described.37

- Download : Download high-res image (516KB)
- Download : Download full-size image
Figure 4. Two subsequent population radiations in the northern coastal regions of China contributed to NA and Japanese matrilineal ancestry
(A) The first radiation occurred during the LGM and involved D4h3, pre-D4h3b, and pre-D4h3a (from which D4h3a, typical of NAs, was derived).
(B) A later population expansion in the same general geographic area occurred in the deglaciation period and involved D4h1a1 and D4h2, whose derivatives are found in modern Japanese and ancient Jomons.
Intriguingly, two haplogroups, D4h1a1 (12.24 kya, 95% HPD, 6.72–15.87 kya) and D4h2 (20.05 kya, 95% HPD, 12.12–29.48 kya), exhibited prevalent distributions in the Japanese Archipelago (Figures 2, 3A, and S4), suggesting that the expansions from the northern coast of China also exerted an influence in Japan. The discovery of D4h1a in ancient samples dated ∼11 kya from the Nenjiang River valley38 further supports its advent in regions close to Japan at least 11 kya. Similarly, D4h2 has been observed in ancient Jomons,39 who are considered the descendants of Paleolithic settlers in the Japanese Archipelago.40 The median-joining network (Table S5; Figure S4) showed that one branch of D4h2 (namely D4h2a) in China and Southeast Asia, while the other (D4h2b) is distributed in Siberians as well as the Ainu population (indigenous Japanese, 3 of 50 samples) and ancient Jomons. This further supports a genetic contribution possibly from China to different populations including Southeast Asians and ancient Japanese. Therefore, probably both D4h1a1 and D4h2 dispersed from China to Japan after the LGM, possibly via the land bridges that connected China and the Japanese Archipelago until 12 kya.41,42
Potential supports from Y chromosome data
The origin of mtDNA D4h in northern coastal China of NAs echoes well also with the differentiation of Y chromosome haplogroup C2a-L1373 (ancestor to NA founder lineages C-MPB373 and C-P39) in low-latitude regions of northern Asia.43 To further evaluate the potential radiation center of C2a-L1373, we assessed the frequencies of C2a-L1373 and its sub-lineages in different provinces of China based on Y chromosome genotyping data from 23Mofang Biotechnology Co., Ltd (totally 458,805 individuals, each with 33,000 Y chromosome SNPs genotyped). We detected the root type (C2a-L1373∗) only in North China (including Beijing [0.020%], Tianjin [0.031%], Henan [0.004%], Heilongjiang [0.030%], Jilin [0.063%], Liaoning [0.071%], Shaanxi [0.035%]) and northwest China (Gansu [0.016%]; Table S8). It is worth underscoring that the highest C2a-L1373∗ frequencies were observed in Liaoning, Jilin, Heilongjiang, Tianjin, and Beijing (Table S8), which are all located close to northern coastal China. Moreover, the majority of other C2a-L1373 sub-lineages, including C-FGC28903, which is a sister branch of C-P39, harbor their highest frequencies in North China (Table S8). Moreover, samples belonging to C2a-L1373 in other places like South Asia, Central Asia, Europe, etc., were sporadically found or mainly occupied the terminal branches.43 This evidence strongly suggests that C2a-L1373 differentiated in northern China, especially in the regions near the coast, similarly to mtDNA D4h.
In addition, two ancient samples from Songnen Plain in northern China, dated ∼14,000 years ago, were found to belong to mtDNA D4h3 and Y chromosome C2a-L1373,33 thus revealing the coexistence of both maternal and paternal ancestor lineages of NAs in northern coastal China. Interestingly, C2-M217 (∼39.3 [34.7–44.5] kya)22 and D4h (∼32.39 [24.04–41.45] kya) had similar coalescent ages, and the divergence time of C2a-L1373 (about 21.6 [19.1–24.4] kya)22 is similar to the time of the first D4h radiation estimated in this study, making it likely that an ancestral population from this region contributed to both the maternal and paternal gene pools of NAs. In fact, besides lineages of mtDNA D4h and Y chromosome C2-M217, substantial maternal and paternal lineages have also been observed in this region, e.g., Y chromosome lineages C-F106744 and mtDNA haplogroups A5, D4a, D4b, D4e, N9a, etc.,29 most of which arose around the LGM.44,45 This lends support to the scenario that this region was a differentiation center in East Asia after the LGM, which probably facilitated the expansions of different lineages including mtDNA D4h3 and Y chromosome C2a-L1373. Meanwhile, Y chromosome haplogroup C2-M217 has also been observed at a higher frequency in the Ainu (15%) than in other Japanese (3%).46 Additionally, the coexistence of mtDNA D4h3 and Y chromosome C2 had also been reported in the same archaeological site in South America (∼8 kya).12 These observations collectively suggest that an ancestral population from northern China carrying mtDNA D4h and Y chromosome haplogroup C2 also spread into the Americas and the Japanese Archipelago.
Discussion
In this study, by integrating ancient and contemporary mitogenomes of D4h from large-scale dataset covering virtually the whole of Eurasia, we traced the ancestry of one rare NA founder lineage (D4h3a) to a lower latitude region in northern coastal China around the Bohai and Huanghai Seas. This region is different from the geographical sources in Siberia hypothesized so far by the common maternal components, including mtDNA haplogroups A2, B2, C1, D4b1a2a1a, etc.7,17,19 Our study thus uncovered an additional ancestral source for the ancestors of NAs beyond Siberia from the matrilineal perspective. This ancestry, although only contributed to a small proportion of the mtDNA gene pool of NAs (D4h3a),25 would be important in complementing the whole picture of origination histories of early NAs.
Further support comes from the Y chromosome C2-M217, which harbors an age (∼40 kya) that is similar to the one of mtDNA D4h and also probably radiated in northern coastal China during the LGM (as indicated by C2a-L1373), when the first radiation of D4h occurred. Interestingly, these uniparental ancestries echo well with the ancient ancestry in eastern Asia (∼35 kya) that gave rise to East Asians, Siberians, and NAs at ∼26 kya.3 Meanwhile, it had also been inferred that about 40–23 kya, the ancestors of Jomons split from the ancient ancestry in eastern Asia.47 This evidence strongly supports the existence of an old ancestry source, arising between 40 and 23 kya, that contributed to populations including East Asians, Jomons, Eastern Siberians, and NAs (Figure S5). We propose that mtDNA D4h was one of the matrilineal lineages that witnessed these population splits and expansions. However, different from this East Asian ancestry contributing substantially to eastern Siberia,47 D4h has been rarely found in this area. One explanation would be the loss of D4h during the expansions by genetic drift or matrilineal replacement.48 More mtDNA data from Siberia will be of help to further assess this expansion process.
In addition, this genetic connection among China, the Americas, and Japan during the Pleistocene period parallels archaeological similarities, as early as Pleistocene period, among these regions. For instance, in the terminal Pleistocene period, Japanese microblades (18–17 kya), which exhibit similarities to those in Northeast Asia (including North China), display commonalities with contemporaneous stemmed points from incipient Jomon sites (∼15.5 kya).49 Importantly, stemmed points were well distributed around the Pacific rim from Japan to South America with close affinities with each other.50 Recent findings on stemmed projectile points in North America (Cooper』s Ferry site, ∼15–16 kya) show closer affinity to the nonfluted projectile points in Japan than to those in North Asia.51 We attribute this similarity in Paleolithic technology, as well as the phylogenetic relationships of D4h sub-lineages in China, the Americas, and Japan, to a probable Pleistocene connection among these regions (Figure 4B).
Our results also shed important light on the dispersal route of early NAs into the Americas. Given that mtDNA D4h radiated from northern coastal China, which is geographically close to Pacific coastal rim, we speculate that D4h would have documented LGM and post-LGM dispersals along the eastern Pacific coast. This echoes well with the dispersal D4h3a along the Pacific coastal path25 when the ice-free corridor was closed.11,52 Similarly, Y chromosome C-L1373, which probably radiated in parallel with mtDNA D4h, has also been reported in South Koreans (http://koreangenome.org/) and the Nivkh,53 thus lending support to a coastal population expansion scenario initiated from northern coastal China. This, together with the Paleolithic cultural affinities along the Pacific, e.g., stemmed points,51 and the palaeoecological feasibility of maritime dispersals (e.g., kelp highway hypothesis)54 lends further support to the coastal route hypothesis of early NAs.50,51,55
Limitations of the study
By dissecting the origin of a rare NA founder lineage, our study revealed an ancestral root of both NAs and the Japanese in northern coastal China. However, some detailed expansions from this region into the Americas need to be further dissected. First, more data concerning mtDNA D4h from both ancient and contemporary samples are needed to elucidate the detailed expansion history of this lineage, especially from Siberia, where a relatively low number of mitogenomes have been assessed. Second, high-resolution Y chromosome data of C-L1373 from large-scale population dataset will help to verify this radiation from paternal perspective. Third, investigations integrating mitogenomes, the Y chromosome, and autosomal genomes are also essential to explore whether there are differences between maternal, paternal, and autosomal markers and thus complement the whole picture of origination history of NAs.
STAR★MethodsKey resources table
Resource availabilityLead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Qing-Peng Kong (kongqp@mail.kiz.ac.cn).
Materials availability
This study did not generate new unique reagents.
Experimental model and subject details
Complete mitogenomes of 106 samples belonging to D4h were sequenced in this study. The information of the subjects including the geographical origin, gender and age are shown in Table S3. The sample collection and experimental protocol were approved by the Ethics Committee at the Kunming Institute of Zoology, Chinese Academy of Sciences (Approval No. KIZRKX-2021-011), as well as by the Office of Human Genetic Resource Administration (OHGRA), The Ministry of Science and Technology (MOST), China (Approval No. 2022SLCJ0017). Informed consent was obtained from each individual before the study. For the genotyping data from 23 Mofang Inc., we only used Y chromosome data from customers who had signed the informed consents online to participate in this study and agree to share their genotyping information.
Method detailsScreening of D4h mitogenomes from dataset
To unravel the evolutionary history of this haplogroup and the expansion of its sublineage pre-D4h3a into the Americas, we performed a search of D4h mtDNAs in a large-scale dataset: 101,319 Eurasian individuals, including 60,979 for which only hypervariable segment (HVS) data (Table S1) were available, and 40,340 samples with whole mtDNA (sequencing and genotyping) (Table S2). For the HVS and genotyping data, the motif-search strategy75 was adopted to identify mtDNAs harboring diagnostic variants of D4h and its sublineages. This allowed the identification of 112 potential D4h mtDNAs (Table S3), which after complete mitogenome sequencing revealed 106 new Asian D4h mitogenomes (see below). These were added to 110 previously published D4h mitogenomes from contemporary populations and 30 D4h mitogenomes (Table S4) screened out from ancient samples for further phylogenetic analyses and coalescent age estimations.
DNA extraction, library construction, sequencing, and quality control
Total genomic DNA was isolated by using the genomic DNA extraction kit (Axygen). DNA yield and purity were measured via UV spectroscopy. Libraries were prepared with a standard library kit (MyGenostics Inc., Beijing, China). Sequencing was carried out using an Illumina HiSeq X Ten platform at MyGenostics, with sequencing depths ranging from 3 823.63× to 15 727.02× (average of 7 359.76×; Figure S1). The Cutadapt software65 was used to trim adapters and to filter low quality sequences (including short reads, and reads with low mean quality score and many ambiguous (N) bases in fastq files. Reads were then aligned to the human reference genome version GRCH38 (which has the revised Cambridge Reference Sequence (rCRS)76 as mtDNA reference) by 「bwa mem」 (v0.7.10) (http://bio-bwa.sourceforge.net/). Duplications were detected and removed using the MarkDuplicates module of GenomeAnalysisTK (GATK) and the GATK HaplotypeCaller module was employed to generate the variant file (vcf) using standard parameters. The final variants of each sample relative to the revised rCRS were recorded (Table S3).
Ancient mtDNA acquisition
Fasta files of ancient mtDNA sequences were downloaded from the literature or public database, with the exception of the mtDNA sequence of I4012,77 which was extracted from the whole genome sequencing data of that sample. In detail, we downloaded raw fastqof I4012 from ENA (European Nucleotide Archive) and then trimed adapters using leeHom v1.2.1666 and aligned to rCRS by using aln and samse commands of BWA v0.7.867 with parameters -n 0.01, -o 2, and -l 16500. Reads with mapping quality score (<30) were filtered by samtools v1.13.68 Finally, we obtained endogenous mtDNA fasta file and allocated it to haplogroup D4h1c according to the variants. However, due to the contamination rate of about 0.99 (calculated by schmutzi v1.5.6),69 we removed this sample in the subsequent analyses.
Quantification and statistical analysisHaplogroup affiliations and phylogenetic updates based on newly obtained and previously published data
Haplogroup affiliation of each sample was carried out according to mtDNA tree Build 17 (http://phylotree.org/).78 The phylogeny of D4h was reconstructed manually and checked using mtPhyl v5.003 (https://sites.google.com/site/mtphyl/). Many previously classified branches were confirmed, including D4h4, D4h1, D4h1a, and D4h1b, while others were updated, including D4h1c, D4h1c1, D4h1d, and pre-D4h3b. In addition, several novel branches were also defined, e.g., D4h3b1, D4h3b2, D4h1c2, and D4h1e (Figure S2).
Coalescent age estimations
Modern and ancient sequences (Table S4) were aligned using MUSCLE v3.8.379 in MEGA6.80 Mutations including 309.1C(C), 315.1C(C), AC indels at 515–522, A16182C, A16183C, 16193.1C(C) and C16519T/T16519C were excluded in age estimations. bModelTest72 package implemented in BEAST 2.6.6 was used to select the most appropriate substitution model for our data. As a result, TN93 model (121,131) with gamma rate heterogeneity (G) and proportion of invariant sites (I) was supported through visualization output in Tracer v1.7.2. The Bayes Factor (logBF = 15) computed by BEAST NS package (32 particles) indicated the strict clock model is suitable for our data than the uncorrelated lognormal relaxed clock model. The midpoints of calibrated radiocarbon dates or archaeological periods of the ancient samples (Table S4) were used as the tip date.81 A date-randomization test82 using BEAST 2.6.6 showed the clockRate parameter from the original dataset of 95% HPD intervals (highest posterior distribution) did not overlap the date-randomized datasets, indicating there was sufficient tip date signal to calibrate the clock rate. The Chain Monte Carlo (MCMC) runs of 100,000,000 steps were performed with a sampling of parameters every 10,000 steps and the initial 10% steps were discarded as burn-in. Coalescent Constant Population was adopted as tree Prior.37 BEAUti within the package of BEAST was used to set the model and parameters. The convergences of MCMC were evaluated according to the effective sample size (ESS) by Tracer v1.7.2 (with ESS >200 as acceptable). As a result, whole mitogenomes without partitions into codon positions were adopted due to general higher ESS values (with only two ESS values between 100 and 200). The 95% HPD intervals of coalescent age estimates were recorded in FigTree v1.4.4.
In addition, Rho (ρ) statistics,83,84 which provides unbiased and overlapping estimates of coalescent ages,85 was also used to evaluate the coalescent ages of each clade in haplogroup D4h (Table S6).
Spatial geographic distribution
Geographic locations of mtDNAs belonging to D4h and its sublineages were plotted using Surfer v8.0 (Golden Software Inc. Golden, Colorado, USA). Contour maps of spatial frequencies were constructed using the Kriging algorithm in Surfer v8.0. Samples non-deriving from population studies were excluded.
Extended Bayesian skyline plots
An Extended Bayesian Skyline plot (EBSP)86 for effective population size (Nef) through time was reconstructed using BEAST v2.6.6,71 as described elsewhere.87,88 The midpoints of calibrated radiocarbon dates or archaeological periods of the ancient samples (Table S4) were used as tip dates,81 assuming 25 years for one generation. Each Markov Chain Monte Carlo (MCMC) simulation was run for 500,000,000 generations and sampled every 5,000 generations, with the first 50,000,000 generations discarded as burn-in. The EBSPs were reconstructed using EBSPAnalyser (10% buin-in) and visualized using an in-house R s**t.
Median-joining network reconstruction
A dataset of D4h HVS sequences (n = 62), which encompasses 53 mtDNAs from contemporary populations and nine from ancient samples (Table S5), together with the corresponding HVS sequences extracted from the complete mitogenomes, was used to reconstruct a D4h median-joining network (Figure S4). The median-joining network was firstly constructed by Network 4.510 (http://www.fluxus-engineering.com/sharenet.htm) and then checked and reconstructed manually.89
We thank Dr. Qiaomei Fu for providing the ancient mtDNA sequencing data. This work was supported by the National Natural Science Foundation of China (31620103907, 81625013, 32170633, and 32150410355); the National Key R&D Program of China (2022YFC3302004); the Second Tibetan Plateau Scientific Expedition and Research (2019QZKK0607); the Strategic Priority Research Program (XDA20040102); Young Scientists in Basic Research (YSBR-076); the Key Research Program of Frontiers Science (QYZDB-SSW-SMC020); CAS 「Light of West China」 Program (Y.-C.L.) of the Chinese Academy of Sciences; the Digitalization, Development, and Application of Biotic Resource Program (202002AA100007); the Italian Ministry of Education, University and Research (MIUR) for Dipartimenti di Eccellenza Program (2018–2022) - Department of Biology and Biotechnology 「L. Spallanzani,」 University of Pavia (A.A., A.T., and O.S.) and Progetti PRIN2017 20174BTC4R (A.A.); the High-level Talent Promotion and Training Project of Kunming (Spring City Plan; 2020SCP001); Yunling Scholar of the Yunnan Province (Q.-P.K.); the Yunnan Ten Thousand Talents Plan Young & Elite Talents Project (Y.-C.L.); and Yunnan Fundamental Research Projects (202201AW070012).
Q.-P.K. conceived the study; Z.-L.G., Y.-C.L., and Q.-P.K. performed the analyses; Z.-L.G., B.-Y.Y., and L.-Q.Y. performed the mitogenome sequencing; Y.-C.L., Z.-L.G., K.-J.L., J.-Y.T., and Z.U.R. collected the data; Y.-C.L., Z.-L.G., K.-J.L., Q.-P.K., S.-H.Z., C.-T.L., A.A., O.S., and A.T. contributed to interpretation of results; Y.-C.L. and Q.-P.K. wrote the manus**t.
The authors declare no competing interests. The authors from 23Mofang Biotechnology Co., Ltd, also declare no commercial or associative interests in connection with this work.
We support inclusive, diverse, and equitable conduct of research.