RandomWalk: Finding K

As I mentioned in earlier post, the goal was finding optimal K for Razid Khan's "PHYLOCORE" genotype dataset for use with ADMIXTURE utility. Or in other words, for determining ancestral populations. Another goal was to get an idea how choice of K, the number of ancestral populations, affects results and gain overall familiarity with the tools.

Well I'm putting some of the results such as they are in boring table format here. Each 10-row table section represents a run for single K, starting with 2. For each cluster or "ancestral population" I've listed up to 10 samples from the analysis that were closest to the center of said population, ie. most typical of it, as long as their estimated amount is at least 0.5 (ie. they're at least 50% of said population cluster).

For purposes of finding K, the admixture manual is pretty clear. There exists a switch called "Cross Verification" in the program, which makes it split the whole genome in (by default) 5 sections or "folds". It will the estimate the ancestry of all the individuals in the analysis leaving one fold at time out of the genome. In theory, if the analysis has been successful, there should be high degree of agreement between the estimates.

In practice, here's how it went:


PHYLOCORE K Cross Validation

By ADMIXTURE manual, 13 where the cross-validation error is at its smallest is the optimal choice. At K=13 the clusters form around Lithuania, Sardinia, Balochi, China, Georgia, Beduin, Nganassan, Hadza, Koryak, Maya, Paniya, Yoruba and San bushmen. At this level my Finnish ancestry maps to Lithuania, but my Germanic ancestry to Sardinian cluster, which is almost certainly incorrect. In fact at different levels of K the center of the cluster with my ancestry jumps between Sardinia and Spanish, until at K=25 they break to separate clusters.

I would suggest that the manual's recommendation for minimum CV error is little counter-intruitive; because very few people are unadmixed (ie. have genome from just one ancestral population) the cross-validation error is going to keep increasing the more detailed the admixture analysis gets. K values which are a good fit to the underlying population structure are going to pull the cross-validation downwards, but not enough to overcome the overall trend. But looking at the data, local minimas in the graph - at K=16, K=23, the wide one at K=25 etc. seem to generally be better quality (Admittedly, I'm mostly looking at the Sardinian cluster which is doing weird things).

Unfortunately, at K=19 where 4 Dusadh form a cluster by themselves there's first hint of something being seriously wrong. At K=28 where the Spanish split in two and three other clusters have only handful of people in them this starts to become rather obvious. I just had a look at the PHYLOCORE data, and it would appear it hasn't had even basic sample quality control done on it. So that'll have to be my next "project". I do consider this run to have been useful though, and at the very least it illustrates what happens if there are directly related individuals among the admixture analysis. I'll save other observations for later.

RandomWalk

Sunday, September 1, 2013

Finding K

No comments:

Post a Comment