Tuesday, September 10, 2013

Quality Control 1

I've been procrastinating long enough, though as before my computer's been hard at work churning over the data, so the time's hardly been wasted. Except for the rash of short blackouts over last two days; unfortunately my current rig takes too much power to run off my old uninterruptable power supplies, so I've lost a few days worth of progress. Just when you thought they had this "power generation" thing down... oh well.

Last time around, I was analyzing Razib Khan's PHYLOCORE genotype dataset he provided for personal ADMIXTURE runs on the Gene Expression blog with different values of K - the number of expected ancestral populations, or clusters. At high K values, several clusters turned out to contain two or three individuals - people whose genetic variation was covered by each other. This is where I should claim I was planning for that all along, and offer a condescending tidbit of foresight like "This is why you never run an analysis before doing quality control".

Alas, I was simply hoping to cut around the corners a bit, as that dataset was explictly provided for such analysis, so I though I'd be okay running it without quality controls and get straight to the business. Well, for future reference, people who google for the dataset or concepts will hopefully find this blog and warning. Of course, there remains something of a practical question of quality control for what - or just "how much" quality control is enough. As the last data indicated, the original dataset may be just fine for runs with low K.

As we presumably want to get sensible results out of high K runs, how exactly do we perform quality control? One important question at this point would be what has already been done. Safest way to know their provenance would be to construct my own dataset, but I've chosen to use this dataset as a shortcut. Besides, individual genotypes are somewhat closely guarded - a couple of well known sources like Human Genetic Diversity Project and 1000 Genomes form the basis of PHYLOCORE, but there seem to be a few other sources I'm not familiar with.

So let's keep playing with the PHYLOCORE. The Swiss army knife of genetic analysis is PLINK which does pretty much everything that's needed - but it's not as simple as clicking a single button and waiting. A number of steps, and decisions with regards to thresholds in particular need to be made. First and most important are the missing lists, generated with:

plink --bfile PHYLOCORE --missing --out PHYLOCORE

After this, PHYLOCORE.lmiss will contain statistics on missing genotypes by SNP. A quick look at this file reveals that PHYLOCORE appears to have been cleaned up to contain only SNP's which have less than 1% missing, ie. couldn't be determined or weren't tested for. This seems good enough. The PHYLOCORE.imiss file on the other hand lists missing genotypes by individual.  While each SNP has a missing rate of at most 1%, it would be possible for 1% of the individuals to be missing ALL the genotypes, so we need to check for this, too. And here we find our first problem:

                            FID                IID MISS_PHENO   N_MISS   N_GENO   F_MISS
                          HADZA              BAR11          Y    21448   134459   0.1595
                          HADZA              END21          Y    20528   134459   0.1527
                          HADZA              END09          Y    19267   134459   0.1433
                          HADZA              BAR09          Y    17261   134459   0.1284
                          HADZA              BAR04          Y    13558   134459   0.1008
                          HADZA              END22          Y    11094   134459  0.08251
                          HADZA              BAR13          Y     7679   134459  0.05711
                          HADZA              BAR08          Y     7597   134459   0.0565
                          HADZA              BAR01          Y     6439   134459  0.04789
                         Yoruba            NA19248          Y     5149   134459  0.03829

A number of the Hadza in the dataset are missing upwards to 16% of the genotyped SNP's. In practice, the missing SNP's act like wildcards, in the extreme case of individuals with just one SNP genotyped, they would explain or cover everybody else in the data with that single SNP variation. Optimally, one would carefully analyze the algorithm and determine what effect the missing SNP's have, and then statistically determine a cut-off point for missing SNP's where expected error margins are met. More generally, this falls to age-old statistical method of "Whoa, that looks bad" which I've decided to employ here.

The Hadza of Tanzania, numbering around 1000 people, live 50 kilometers from the "Cradle of Humankind" and according to their oral tradition have lived there since the dawn of time when world was inhabited by hairy giants and fire was against the laws of nature. They're also likely the oldest branch of humankind still in existence, so while their significance to modern genealogy is small, I'd rather keep as many of them as possible in the analysis. I therefore opted for a cut-off point of 10% missing SNP's.

Are all the SNP's in the dataset even necessary? Well, the simple answer is no. As people may recall, the DNA strands get tangled at given spots during the meiosis, forming a crossover or recombination event. While the spots are random, they follow certain probability, and the strands of DNA and thus SNP's between those spots pass together to an offspring. The probability two SNP's are passed together is modeled by Linkage Disequilibrium. Estimating it with PLINK's indep-pairwise 50 5 X, I obtained following table for SNP's that would remain after pruning with specified R^2 values:

At LD R^2 1.0, the number of SNP's pruned would be 45, this low number is most likely because the beadchips used for genotyping are generally LD-pruned already and don't test many neighboring SNP's. It doesn't seem to me like the dataset has been LD pruned; the ADMIXTURE paper recommends pruning it, but on the other hand I'm currently experimenting with the dataset with very wide swath, so I will leave experimenting with LD pruning for later.

For this blog entry, finally, the last but not least quality control measure I'll mention is "Indentical By Descent" - or, whether individuals are relatives or not. This was the specific problem mentioned in the beginning, and admixture is meant to be run on individuals who are not related to each other. This can be determined with "plink --bfile PHYLOCORE --genome --out PHYLOCORE". After a long while, PHYLOCORE.genome is outputted, which contains the calculated amount of indentity by descent between each pair of individuals (this can, naturally, also be quite large file - for 2000 individuals there are 2000*2000/2 or 2 million pairs, since order doesn't matter, for example).

Here, we concern ourselves mostly with PI_HAT, the "Proportion of Identical by Descent". A parent and offspring, for example, are 50% identical. In typical Genome Wide Association Studies the cut-off point seems to be 25%, which is equivalent to first cousins, but I'm not running a GWAS right now, genotype data is hard to come by for an amateur researcher and I'm just exploring the dataset at wide range, so I decided to make the cut-off point 50%.

One quick script to pick out of related individuals the one with lower genotyping rate and a few Hadza with higher than 10% missing genotype rate later I pruned 127 individuals out of the the test data, and got back to crunching. That list is little long and as it depends on the thresholds chosen, I'll provide the list on request only. Unfortunately the power-cuts were AFTER that and I didn't save the script, so I'll have to return to the script later, when I need it again.

Sunday, September 1, 2013

Finding K

As I mentioned in earlier post, the goal was finding optimal K for Razid Khan's "PHYLOCORE" genotype dataset for use with ADMIXTURE utility. Or in other words, for determining ancestral populations. Another goal was to get an idea how choice of K, the number of ancestral populations, affects results and gain overall familiarity with the tools.

Well I'm putting some of the results such as they are in boring table format here. Each 10-row table section represents a run for single K, starting with 2. For each cluster or "ancestral population" I've listed up to 10 samples from the analysis that were closest to the center of said population, ie. most typical of it, as long as their estimated amount is at least 0.5 (ie. they're at least 50% of said population cluster).

For purposes of finding K, the admixture manual is pretty clear. There exists a switch called "Cross Verification" in the program, which makes it split the whole genome in (by default) 5 sections or "folds". It will the estimate the ancestry of all the individuals in the analysis leaving one fold at time out of the genome. In theory, if the analysis has been successful, there should be high degree of agreement between the estimates.

In practice, here's how it went:
PHYLOCORE K Cross Validation

By ADMIXTURE manual, 13 where the cross-validation error is at its smallest is the optimal choice. At K=13 the clusters form around Lithuania, Sardinia, Balochi, China, Georgia, Beduin, Nganassan, Hadza, Koryak, Maya, Paniya, Yoruba and San bushmen. At this level my Finnish ancestry maps to Lithuania, but my Germanic ancestry to Sardinian cluster, which is almost certainly incorrect. In fact at different levels of K the center of the cluster with my ancestry jumps between Sardinia and Spanish, until at K=25 they break to separate clusters.

I would suggest that the manual's recommendation for minimum CV error is little counter-intruitive; because very few people are unadmixed (ie. have genome from just one ancestral population) the cross-validation error is going to keep increasing the more detailed the admixture analysis gets. K values which are a good fit to the underlying population structure are going to pull the cross-validation downwards, but not enough to overcome the overall trend. But looking at the data, local minimas in the graph - at K=16, K=23, the wide one at K=25 etc. seem to generally be better quality (Admittedly, I'm mostly looking at the Sardinian cluster which is doing weird things).

Unfortunately, at K=19 where 4 Dusadh form a cluster by themselves there's first hint of something being seriously wrong. At K=28 where the Spanish split in two and three other clusters have only handful of people in them this starts to become rather obvious. I just had a look at the PHYLOCORE data, and it would appear it hasn't had even basic sample quality control done on it. So that'll have to be my next "project". I do consider this run to have been useful though, and at the very least it illustrates what happens if there are directly related individuals among the admixture analysis. I'll save other observations for later.