I've been procrastinating long enough, though as before my computer's been hard at work churning over the data, so the time's hardly been wasted. Except for the rash of short blackouts over last two days; unfortunately my current rig takes too much power to run off my old uninterruptable power supplies, so I've lost a few days worth of progress. Just when you thought they had this "power generation" thing down... oh well.
Last time around, I was analyzing Razib Khan's PHYLOCORE genotype dataset he provided for personal ADMIXTURE runs on the Gene Expression blog with different values of K - the number of expected ancestral populations, or clusters. At high K values, several clusters turned out to contain two or three individuals - people whose genetic variation was covered by each other. This is where I should claim I was planning for that all along, and offer a condescending tidbit of foresight like "This is why you never run an analysis before doing quality control".
Alas, I was simply hoping to cut around the corners a bit, as that dataset was explictly provided for such analysis, so I though I'd be okay running it without quality controls and get straight to the business. Well, for future reference, people who google for the dataset or concepts will hopefully find this blog and warning. Of course, there remains something of a practical question of quality control for what - or just "how much" quality control is enough. As the last data indicated, the original dataset may be just fine for runs with low K.
As we presumably want to get sensible results out of high K runs, how exactly do we perform quality control? One important question at this point would be what has already been done. Safest way to know their provenance would be to construct my own dataset, but I've chosen to use this dataset as a shortcut. Besides, individual genotypes are somewhat closely guarded - a couple of well known sources like Human Genetic Diversity Project and 1000 Genomes form the basis of PHYLOCORE, but there seem to be a few other sources I'm not familiar with.
So let's keep playing with the PHYLOCORE. The Swiss army knife of genetic analysis is PLINK which does pretty much everything that's needed - but it's not as simple as clicking a single button and waiting. A number of steps, and decisions with regards to thresholds in particular need to be made. First and most important are the missing lists, generated with:
plink --bfile PHYLOCORE --missing --out PHYLOCORE
After this, PHYLOCORE.lmiss will contain statistics on missing genotypes by SNP. A quick look at this file reveals that PHYLOCORE appears to have been cleaned up to contain only SNP's which have less than 1% missing, ie. couldn't be determined or weren't tested for. This seems good enough. The PHYLOCORE.imiss file on the other hand lists missing genotypes by individual. While each SNP has a missing rate of at most 1%, it would be possible for 1% of the individuals to be missing ALL the genotypes, so we need to check for this, too. And here we find our first problem:
FID IID MISS_PHENO N_MISS N_GENO F_MISS
HADZA BAR11 Y 21448 134459 0.1595
HADZA END21 Y 20528 134459 0.1527
HADZA END09 Y 19267 134459 0.1433
HADZA BAR09 Y 17261 134459 0.1284
HADZA BAR04 Y 13558 134459 0.1008
HADZA END22 Y 11094 134459 0.08251
HADZA BAR13 Y 7679 134459 0.05711
HADZA BAR08 Y 7597 134459 0.0565
HADZA BAR01 Y 6439 134459 0.04789
Yoruba NA19248 Y 5149 134459 0.03829
A number of the Hadza in the dataset are missing upwards to 16% of the genotyped SNP's. In practice, the missing SNP's act like wildcards, in the extreme case of individuals with just one SNP genotyped, they would explain or cover everybody else in the data with that single SNP variation. Optimally, one would carefully analyze the algorithm and determine what effect the missing SNP's have, and then statistically determine a cut-off point for missing SNP's where expected error margins are met. More generally, this falls to age-old statistical method of "Whoa, that looks bad" which I've decided to employ here.
The Hadza of Tanzania, numbering around 1000 people, live 50 kilometers from the "Cradle of Humankind" and according to their oral tradition have lived there since the dawn of time when world was inhabited by hairy giants and fire was against the laws of nature. They're also likely the oldest branch of humankind still in existence, so while their significance to modern genealogy is small, I'd rather keep as many of them as possible in the analysis. I therefore opted for a cut-off point of 10% missing SNP's.
Are all the SNP's in the dataset even necessary? Well, the simple answer is no. As people may recall, the DNA strands get tangled at given spots during the meiosis, forming a crossover or recombination event. While the spots are random, they follow certain probability, and the strands of DNA and thus SNP's between those spots pass together to an offspring. The probability two SNP's are passed together is modeled by Linkage Disequilibrium. Estimating it with PLINK's indep-pairwise 50 5 X, I obtained following table for SNP's that would remain after pruning with specified R^2 values:
At LD R^2 1.0, the number of SNP's pruned would be 45, this low number is most likely because the beadchips used for genotyping are generally LD-pruned already and don't test many neighboring SNP's. It doesn't seem to me like the dataset has been LD pruned; the ADMIXTURE paper recommends pruning it, but on the other hand I'm currently experimenting with the dataset with very wide swath, so I will leave experimenting with LD pruning for later.
For this blog entry, finally, the last but not least quality control measure I'll mention is "Indentical By Descent" - or, whether individuals are relatives or not. This was the specific problem mentioned in the beginning, and admixture is meant to be run on individuals who are not related to each other. This can be determined with "plink --bfile PHYLOCORE --genome --out PHYLOCORE". After a long while, PHYLOCORE.genome is outputted, which contains the calculated amount of indentity by descent between each pair of individuals (this can, naturally, also be quite large file - for 2000 individuals there are 2000*2000/2 or 2 million pairs, since order doesn't matter, for example).
Here, we concern ourselves mostly with PI_HAT, the "Proportion of Identical by Descent". A parent and offspring, for example, are 50% identical. In typical Genome Wide Association Studies the cut-off point seems to be 25%, which is equivalent to first cousins, but I'm not running a GWAS right now, genotype data is hard to come by for an amateur researcher and I'm just exploring the dataset at wide range, so I decided to make the cut-off point 50%.
One quick script to pick out of related individuals the one with lower genotyping rate and a few Hadza with higher than 10% missing genotype rate later I pruned 127 individuals out of the the test data, and got back to crunching. That list is little long and as it depends on the thresholds chosen, I'll provide the list on request only. Unfortunately the power-cuts were AFTER that and I didn't save the script, so I'll have to return to the script later, when I need it again.

No comments:
Post a Comment