RandomWalk

Monday, December 9, 2013

Bitcoin: Which way is down?

Lets try to keep this short... I mean, shorter, I still had to try to keep the last Bitcoin post short :) However, it would seem I have misjudged at least the short term direction of Bitcoin, as Monday saw some moderate (which is generally good for stability, with less chance of rapid correction in the other direction) rise. I'm still not convinced the crash is over however, as I can only anticipate more negative news as businesses react to the volatility and regulators think of new things to rein Bitcoin down.

There is, however, opposing views, one just recently posted on Washington Post blog: Here's Why Volatility Isn't a Big Problem for Bitcoin. However, the article fails slightly at the delivery point, summarizing at the end "Bitcoin's extreme volatility is a problem right now, but that won't last forever." So which is it, a problem or not? From my present viewpoint if the volatility continues, maybe even if it doesn't, it'll be solved by majority losing all interest in Bitcoin.

And here's Fridays Bitcoin flash-crash immortalized in media. The author, of course, seemingly misjudges the action - Bitcoin price didn't lose its mind, the last moments of the flash-crash were just playing in an auto-repeat loop during the almost 3 hour downtime, which judging by the internet chatter fooled a lot of people into thinking things were okay and there were huge profits to be made. I don't think that was Mt.Gox's doing however, possibly just a flaw in the price-graphing site as my own price-feed showed transactions in standstill.

For those who still don't even know what Bitco is, I'll just simply summarize it's a decentralized, alternative (to the established currencies) digital cryptocurrency that's been around for over 4 years. Certain very complex calculations are performed to certify all previous transactions as valid after they've been verified as valid. First person to succesfully do so gets a reward, currently set at 25 bitcoins. This reward halves every 4 years to keep inflation at bay, and the calculation complexity is adjusted so that one is completed on average every 10 minutes.

And that's that for now.

Sunday, December 8, 2013

On Bitcoin

Time to make a short digression from my usual subject. When I bought my current computer rig, I got it with a decent display card for the purpose of calculating Bitcoins to offset the hardware and electricity costs of keeping it running. If you haven't heard of Bitcoins yet, you really should get out more (or less, depends). I'll not rehearse the concept as you're likely to find out all about them whether you want to or not just looking for recent news (or Googling should you have found this blog entry in the distant future... Hi, future me!).

However, I was first sold defective memories (though I now hear all of OCZ's memories and SSD's were defective anyhow... though the computer store adviced me it couldn't be the memory, and after trying three sets of OCZ I concured.) and couldn't get it running stable, so I barely got any run-time out of it. It wasn't until several years later that I got different set of memory just to test it out, and got it running. By then, Bitcoin's prize has already inflated so high it could no longer offset the price of the electricity to generate it...

Or so I thought. Little did I know, as this week Bitcoin price hit 1200 dollars or 900 euros per Bitcoin. Had I kept the computer on churning out Bitcoins I suppose I'd more than made up for the electricity cost, at least until this week.

Anyway, despite the faulty memories I had managed to mint out a few Bitcoins and all but forgotten about them, but reading the news on the price and seeing exponential rise price graph I figured it was high time to cash them out before the inevitable - even if possibly temporary - crash.

The news media were all over Mt. Gox, the trading-card-site-turned-money-exchange, reputed as the most popular exchange with the highest exchange rate, so it was natural I would send my hard-earned Bitcoins over to Mt.Gox. Now if you've been following Bitcoins at all you should know you need to check, doublecheck and then check once again the reputability and terms of anywhere you're sending yout bitcoins out to.

I suppose I have myself only to blame, but I tried to do this with Mt. Gox, though apparently little poorly, as after I had exchanged my bitcoins to cash I found out that it was practically impossible to get my money out of the service. It would take weeks to - possibly - clear my identification and proof of residence, and only then could I even enter a months-long waiting-list to transfer part of my meager price on my bank-account.

Naturally I then converted my money back to Bitcoins to get them out and cash them out through some other service - incurring another 0.6% exchange fee. It was then that I discovered you can't even get your Bitcoins out of Mt.Gox before your identifaction & proof of residence clear. This is the main reason for the blog post, to enchance the change of anyone else finding warning that Mt.Gox is basically one-way street; as of the writing of this blog entry Mt.Gox is basically one way street, you can check your Bitcoins in but you can't get them out in reasonable time or amount, and I want to help to get the word out.

Anyway, the next day China forbade official trade and use of Bitcoins, leaving it still legal for citizens "pending further review", sending Bitcoin prices tumbling as China's market is currently driving the prices. This, in turn, has had sveral major players in China including Baidu (aka "Google of China") and China Telecom to announce they no longer accept Bitcoins as payment due to its proven volatility and the very real risk they're left holding to worthless currency.

Naturally, this sent Bitcoins price down for further tumbles, including as of this counting at least three or four flash-crashes. During the worst of them Bitcoin's price hit about 400 euros (for those keeping count, that's less than half of the highest point just a day earlier). In the ensuing panic it would probably have dropped to near zero, but Mt.Gox also has a bad (or some would say serendipitious) habit of starting to lag showing prices and reject transactions when that happens.

During the worst flash-crash it went down for almost 3 hours showing frozen price-ticker everywhere until other exchanges finally recovered to stability... Fortunately, one would say, as otherwise there might be no Bitcoin today (okay, I exagerate, it keeps coming back, like a zombie). Almost as if by design, though it's certainly not unheard of for brick & mortar stockmarkets to suspend trading when faced with high volatility, nor is it for poorly designed web-services to fail under load.

Unfortunately, that outage also deprived Mt.Gox'ers from possible profits to be made from the volatility during that time, as well as protecting their money should the other exchanges have continued their fall towards zero meanwhile. Consequently from my short experience with Mt.Gox I certainly can't recommend it for anyone for any purpose.

Presently what remains of my Bitcoin assets are firmly in the Euros camp, I could turn a profit from the volatility of the market but that wasn't enough to fight the overall trend of their value and I found their net worth dwindling along with the value of Bitcoins. As a result I'm now down to half of where I intended to cash out.

Though I could re-buy at a low point and regain losses in number of Bitcoins, I don't think I will risk it. As of this writing Bitcoins have been rising again, but I can't imagine that trend continuing past Monday, as more and more businesses are likely to announce they're suspending Bitcoin payments pending technical review due to the volatility. With a verified Bitcoin transaction taking about a hour, I can't imagine any sane business taking that risk.

About only thing that could turn around the current downward spiral would be China announcing on Monday morning that upon review they've decided to re-allow banks and other government institutions to deal in Bitcoins and fully endorse them. However, that is to say extremely unlikely. There is even higher chance after the Bitcoin price has fallen low enough to make it palatable, China is going to announce that to "protect the citizens" they're completely forbidding dealing in Bitcoins.

Is this the end for Bitcoin? I doubt it, the technology is still solid (though comparatively slow and un-economical), but the volatility and burst bubble seem almost by neccessity forcing Bitcoin back to nerd-curiosity. There are numerous alt-coins based on same principles as Bitcoin but with different problematic aspects fixed, one of them could overtake Bitcoins, but it's hard to fix their prime fail - price volatility. With that fault they're neither feasible currency nor investment, but pure gamble.

Saturday, November 23, 2013

Not quite dead, not quite alive

The analysis of the data took much longer than I anticipated, largely because the power-outage I had didn't turn out to be an isolated incident. My computer rig outgrew my office Uninterruptable Power Supplies long ago, but I never thought that would be an issue because we experience power disruptions maybe once or twice a year at most. For some reason, the power disruptions carried out daily for days, then stretched to weeks... Made me feel like living in some third world country .

The analysis had reached a phase where each new run took almost a day, and with a daily power outage you can do the math. Worse, there was some filesystem corruption from the sudden computer reboots, so I after a few days I just gave up and decided to wait out the blackouts. In truth I finished the current run of analysis a while ago already, but I've been keeping busy with my job - and still am, so I don't have time for much finesse (as usual).

Incidentally I guess that's not a huge problem; this is more a research journal than a true blog, though I'm hoping to offer more digestible content and possibly regular updates as time goes by. But for now this resembles more a log of records of my research, and I don't expect to have many readers, beyond those that may find this by googling some of the specific things mentioned here. I'll say welcome to any potential readers, and feel free to say hello.

Speaking of which, a rather sizeable dump of the results from my latest run are available. The time crunch means I still haven't got a decent way to visualize it; instead I'm just listing the 25 highest affinity samples for each genetic similarity cluster for each tested K value. As before the clusters are formed on their own, unsupervised, and vary - often back and forth - betweek K values, so this time I've simply ordered the clusters from left to right according to my own genetic makeup in the analysis rather than try to line them up in any absolute sense. I've ran the analysis for K 1 to 40 for now. Incidentally, that is also the first K value at which Orcadians separate into their own cluster in the analysis. That is a good point to pause and reflect on whether this analysis is generating anything sensible. I expect to have more to write on these things in the future.

There's a reason I'm hurrying this update out - 23andMe has finally announced and started to roll out their new Ancestry Composition update. Other genetic ancestry blogs have already broken the news all around, so I probably shouldn't waste much words on it. But the 23andMe Ancestry Composition is being expanded to recognize 31 different populations; additional ones are Japanese, Korean, Yakut, Mongolian, Chinese, Southeast Asian, West African, South African and Central & South African. With these additions there will be less need for anyone to run their own ADMIXTURE analysis except for learning (or research) purposes. But for now I understand no customers have yet received their results on the new populations.

Tuesday, September 10, 2013

Quality Control 1

I've been procrastinating long enough, though as before my computer's been hard at work churning over the data, so the time's hardly been wasted. Except for the rash of short blackouts over last two days; unfortunately my current rig takes too much power to run off my old uninterruptable power supplies, so I've lost a few days worth of progress. Just when you thought they had this "power generation" thing down... oh well.

Last time around, I was analyzing Razib Khan's PHYLOCORE genotype dataset he provided for personal ADMIXTURE runs on the Gene Expression blog with different values of K - the number of expected ancestral populations, or clusters. At high K values, several clusters turned out to contain two or three individuals - people whose genetic variation was covered by each other. This is where I should claim I was planning for that all along, and offer a condescending tidbit of foresight like "This is why you never run an analysis before doing quality control".

Alas, I was simply hoping to cut around the corners a bit, as that dataset was explictly provided for such analysis, so I though I'd be okay running it without quality controls and get straight to the business. Well, for future reference, people who google for the dataset or concepts will hopefully find this blog and warning. Of course, there remains something of a practical question of quality control for what - or just "how much" quality control is enough. As the last data indicated, the original dataset may be just fine for runs with low K.

As we presumably want to get sensible results out of high K runs, how exactly do we perform quality control? One important question at this point would be what has already been done. Safest way to know their provenance would be to construct my own dataset, but I've chosen to use this dataset as a shortcut. Besides, individual genotypes are somewhat closely guarded - a couple of well known sources like Human Genetic Diversity Project and 1000 Genomes form the basis of PHYLOCORE, but there seem to be a few other sources I'm not familiar with.

So let's keep playing with the PHYLOCORE. The Swiss army knife of genetic analysis is PLINK which does pretty much everything that's needed - but it's not as simple as clicking a single button and waiting. A number of steps, and decisions with regards to thresholds in particular need to be made. First and most important are the missing lists, generated with:

plink --bfile PHYLOCORE --missing --out PHYLOCORE

After this, PHYLOCORE.lmiss will contain statistics on missing genotypes by SNP. A quick look at this file reveals that PHYLOCORE appears to have been cleaned up to contain only SNP's which have less than 1% missing, ie. couldn't be determined or weren't tested for. This seems good enough. The PHYLOCORE.imiss file on the other hand lists missing genotypes by individual. While each SNP has a missing rate of at most 1%, it would be possible for 1% of the individuals to be missing ALL the genotypes, so we need to check for this, too. And here we find our first problem:

                            FID                IID MISS_PHENO   N_MISS   N_GENO   F_MISS
                          HADZA              BAR11          Y    21448   134459   0.1595
                          HADZA              END21          Y    20528   134459   0.1527
                          HADZA              END09          Y    19267   134459   0.1433
                          HADZA              BAR09          Y    17261   134459   0.1284
                          HADZA              BAR04          Y    13558   134459   0.1008
                          HADZA              END22          Y    11094   134459 0.08251
                          HADZA              BAR13          Y     7679   134459 0.05711
                          HADZA              BAR08          Y     7597   134459   0.0565
                          HADZA              BAR01          Y     6439   134459 0.04789
                         Yoruba            NA19248          Y     5149   134459 0.03829

A number of the Hadza in the dataset are missing upwards to 16% of the genotyped SNP's. In practice, the missing SNP's act like wildcards, in the extreme case of individuals with just one SNP genotyped, they would explain or cover everybody else in the data with that single SNP variation. Optimally, one would carefully analyze the algorithm and determine what effect the missing SNP's have, and then statistically determine a cut-off point for missing SNP's where expected error margins are met. More generally, this falls to age-old statistical method of "Whoa, that looks bad" which I've decided to employ here.

The Hadza of Tanzania, numbering around 1000 people, live 50 kilometers from the "Cradle of Humankind" and according to their oral tradition have lived there since the dawn of time when world was inhabited by hairy giants and fire was against the laws of nature. They're also likely the oldest branch of humankind still in existence, so while their significance to modern genealogy is small, I'd rather keep as many of them as possible in the analysis. I therefore opted for a cut-off point of 10% missing SNP's.

Are all the SNP's in the dataset even necessary? Well, the simple answer is no. As people may recall, the DNA strands get tangled at given spots during the meiosis, forming a crossover or recombination event. While the spots are random, they follow certain probability, and the strands of DNA and thus SNP's between those spots pass together to an offspring. The probability two SNP's are passed together is modeled by Linkage Disequilibrium. Estimating it with PLINK's indep-pairwise 50 5 X, I obtained following table for SNP's that would remain after pruning with specified R^2 values:

At LD R^2 1.0, the number of SNP's pruned would be 45, this low number is most likely because the beadchips used for genotyping are generally LD-pruned already and don't test many neighboring SNP's. It doesn't seem to me like the dataset has been LD pruned; the ADMIXTURE paper recommends pruning it, but on the other hand I'm currently experimenting with the dataset with very wide swath, so I will leave experimenting with LD pruning for later.

For this blog entry, finally, the last but not least quality control measure I'll mention is "Indentical By Descent" - or, whether individuals are relatives or not. This was the specific problem mentioned in the beginning, and admixture is meant to be run on individuals who are not related to each other. This can be determined with "plink --bfile PHYLOCORE --genome --out PHYLOCORE". After a long while, PHYLOCORE.genome is outputted, which contains the calculated amount of indentity by descent between each pair of individuals (this can, naturally, also be quite large file - for 2000 individuals there are 2000*2000/2 or 2 million pairs, since order doesn't matter, for example).

Here, we concern ourselves mostly with PI_HAT, the "Proportion of Identical by Descent". A parent and offspring, for example, are 50% identical. In typical Genome Wide Association Studies the cut-off point seems to be 25%, which is equivalent to first cousins, but I'm not running a GWAS right now, genotype data is hard to come by for an amateur researcher and I'm just exploring the dataset at wide range, so I decided to make the cut-off point 50%.

One quick script to pick out of related individuals the one with lower genotyping rate and a few Hadza with higher than 10% missing genotype rate later I pruned 127 individuals out of the the test data, and got back to crunching. That list is little long and as it depends on the thresholds chosen, I'll provide the list on request only. Unfortunately the power-cuts were AFTER that and I didn't save the script, so I'll have to return to the script later, when I need it again.

Sunday, September 1, 2013

Finding K

As I mentioned in earlier post, the goal was finding optimal K for Razid Khan's "PHYLOCORE" genotype dataset for use with ADMIXTURE utility. Or in other words, for determining ancestral populations. Another goal was to get an idea how choice of K, the number of ancestral populations, affects results and gain overall familiarity with the tools.

Well I'm putting some of the results such as they are in boring table format here. Each 10-row table section represents a run for single K, starting with 2. For each cluster or "ancestral population" I've listed up to 10 samples from the analysis that were closest to the center of said population, ie. most typical of it, as long as their estimated amount is at least 0.5 (ie. they're at least 50% of said population cluster).

For purposes of finding K, the admixture manual is pretty clear. There exists a switch called "Cross Verification" in the program, which makes it split the whole genome in (by default) 5 sections or "folds". It will the estimate the ancestry of all the individuals in the analysis leaving one fold at time out of the genome. In theory, if the analysis has been successful, there should be high degree of agreement between the estimates.

In practice, here's how it went:


PHYLOCORE K Cross Validation

By ADMIXTURE manual, 13 where the cross-validation error is at its smallest is the optimal choice. At K=13 the clusters form around Lithuania, Sardinia, Balochi, China, Georgia, Beduin, Nganassan, Hadza, Koryak, Maya, Paniya, Yoruba and San bushmen. At this level my Finnish ancestry maps to Lithuania, but my Germanic ancestry to Sardinian cluster, which is almost certainly incorrect. In fact at different levels of K the center of the cluster with my ancestry jumps between Sardinia and Spanish, until at K=25 they break to separate clusters.

I would suggest that the manual's recommendation for minimum CV error is little counter-intruitive; because very few people are unadmixed (ie. have genome from just one ancestral population) the cross-validation error is going to keep increasing the more detailed the admixture analysis gets. K values which are a good fit to the underlying population structure are going to pull the cross-validation downwards, but not enough to overcome the overall trend. But looking at the data, local minimas in the graph - at K=16, K=23, the wide one at K=25 etc. seem to generally be better quality (Admittedly, I'm mostly looking at the Sardinian cluster which is doing weird things).

Unfortunately, at K=19 where 4 Dusadh form a cluster by themselves there's first hint of something being seriously wrong. At K=28 where the Spanish split in two and three other clusters have only handful of people in them this starts to become rather obvious. I just had a look at the PHYLOCORE data, and it would appear it hasn't had even basic sample quality control done on it. So that'll have to be my next "project". I do consider this run to have been useful though, and at the very least it illustrates what happens if there are directly related individuals among the admixture analysis. I'll save other observations for later.

Saturday, August 31, 2013

Admixtures - what they are, and aren't

Well, I promised some results for this week, so I guess I better get writing. Piecing together genetic genealogy is something that sometimes recalls Firesign Theatre's old (and by old, I mean 60's) sketch "Where're you from?" "Nairobi, ma'am. Isn't everybody?" Regardless, "genetic genealogy blogs" have been proliferating, and as hinted earlier there are some tools to help make things lot easier.

In this post I'll turn out attention to ADMIXTURE; to most people who've looked at personal genetics this can be familiar from GEDmatch's "Ad-Mix Utilities". Unfortunately there are some misconceptions about them, which I'm in part trying to set straight here, while exploring my own ancestry. The first tidbit should be self-evident but doesn't always seem to: Since all of the admixture calculators on GEDmatch give different results, they can't possibly all be correct.

So what, then, does ADMIXTURE do? To give a fairly technical summary, it tries to determine the frequency of each Single Nucleotide Polymorphism in K different populations, and the contributions of those K populations in the genetic makeup of each individual in the analysis. Ie. if SNP 1 has frequency of 25% in population A and 75% in population B, then an individual with a copy of SNP 1 from both parents has 6.25% chance of being from population A and 92.75% chance of being from population B.

But if they had the SNP - single base change - from just one parent then odds would be 50/50. Still, this wouldn't necessarily mean that they had one parent from each population. Now, if you apply this analysis with 100.000 different SNP's and determine their contribution into 20 different populations for 3000 individuals we're much closer to the real situation. The results are still more a probability than a fraction, though, and it all hinges on correct selection of the sample individuals and the number of populations K.

One good article from Eurogenes blog deals mainly with my opening pitch: Because ADMIXTURE compares (or classifies) individuals into different population clusters, it is only going to suggest differences from those clusters. Put another way, if there was a "Finnish" cluster, a Finnish individual might get out a result of "100% Finnish" with no information of their ancestral makeup. It turns out this is usually what people want out of genetic genealogy, ie. to display only recent admixture, but it can still provide for surprises.

With those caveats in mind, I set out to start to run out some experiments myself. To cut to the and provide a gentle introduction both for myself and any readers, I opted to use a dataset prepared by Razid Khan, over at Discovery Magazine's Gene Expression. Incidentally, since there's limited source of public SNP genotypes with accompanying ancestry information, that may also be pretty much only way to start off. I expect to revisit that issue later, however.

Since I'm also interested in my own genetic origins, I merged in my own genotypes. A peculiarity about this is that my known ancestry is about 3/4ths Finnish and 1/4ths Germanic, while Razid's dataset contains basically no Northern europeans to compare with. I was curious to find out if I would form my own population cluster (isn't everybody?) or what other populations I would be mapped to. So with that, onward to Finding K...

Thursday, August 29, 2013

New background

Huge news! Changed the blog background.

Well okay, maybe that's not so huge, but at least it may be little easier on the eye. I'll look at customizing the background at some point.

In other news, GEDmatch's Neanderthal/Denisova genome comparison tool I covered in last post has been disabled for a few days. Wonder what's up with that? I doubt that can be due to too high load, as the site suffers under high load all the time, and those features can't be more demanding than many of the others they offer.

Hopefully they'll be back soon with little more detailed break-down of the similarity percentages.

No posts on the blog doesn't neccessarily mean nothing's going on in the background... I've been running some experimental analysis on admixtures, as briefly explained in earlier blog post. As of today, my computer's been crunching on the data 2 weeks straight. The idea was to get some basic sample data to illustrate a few point about admixtures, in particular as implemented by ADMIXTURE utility.

I didn't expect to spend so much computer time on initial exploration, but it's been producing little more varied results and running little faster than I expected, so I've decided to let it run longer, but am fast running out of reasonable scope. I may get around to posting some basics about it here this week.

The blog has so far been about genetic testing and genetic genealogy, I'm considering keeping it that way as this may make it easier for people interested in the topic to find it. The blog name refers to class of algorithms that is often used in complex simulations and analysis (which was to be my major at university) in any case, but it's also oftentimes and apt description of my methods, in particular with regards to blogs and such :)