The Heritage Health Prize
I started work on a second data competition, the Heritage Health Prize, which is well-known in the community as it has a very large purse, $3 million to the winning team. The objective of this competition is to predict hospitalizations for patients, given health insurance claims data for those patients in previous years. It is a tremendous application of data analysis, as I think healthcare is extremely fertile ground for increasing efficiency by being smarter about care and prescription and procedure. I may be off-and-on with this one, working on it for a while and then letting it sit for a while; as before, my objective is to learn as much as I can, not realistically to win, and if I feel like I’m spinning my wheels I’ll drop it for a while.
What I particularly like about this competition is the “Milestone Prizes” that the organizers also award. The competition will last for two years, and every 6 months the top 2 entrants win a much smaller but not insubstantial prize, in the five-digit dollar range. In order to claim the Milestones, the winning teams must submit a write-up of their methodology, to the organizers’ satisfaction. Here are links to the Milestone 1 and Milestone 2 papers. (You can only read those PDFs if you are in the competition, unfortunately, and I don’t intend to re-share them if the organizers don’t want them to be shared.)
Two Milestones have passed, with the third coming up in a few weeks. The papers have been tremendously helpful in getting started; my initial approach has been a highly simplified version of their procedures, and it’s good enough to get to 211th place out of 1268 (though only 818 entries right now clear a naïve-ish benchmark where every entry is predicted at an optimized constant value, and I say “naïve-ish” because the method for deducing that optimized constant is thoughtful). Unfortunately my efforts at sophisticating my models along the lines of the papers have not yielded much improvement beyond my initial go, but hopefully I’ll figure something out.
Although two Milestones have passed, it is helpful to read the first Milestone papers first, because the later ones build on/make reference to the previous ones. I was surprised by the similarity of the papers’ structure, despite being written independently:
Features: from the raw data supplied by the competition, what variables became the input into your prediction algorithms? In some cases, there is no transformation; you feed the competition data right through. In other cases, the papers calculated per patient averages, minimums and maximums, etc. and fed those through.
Algorithms: in general, strong entries use more than one (see “Ensembling” below). This is where some of the ornery mathematics comes into play, and to really do a good job here you need to read some academic papers. But many of the established statistical models have already been implemented in languages such as R, so if you simply want to get an entry on the leaderboard you actually don’t need to know too much about the models; download them and run them as a black box. (I’m still learning about these models and yet I’ve managed to write implementations that use them.) R in particular has strong community development of these statistical models and is what I’ve been using. The algorithms that are new to me that I’ve been trying to learn so far are called gradient boosting and random forests.
Feature selection: a model is a combination of an algorithm and a subset of the available features. You might run the same algorithm on two different subsets of the features and call those two separate models. Models with the same algorithms may benefit less from the ensembling step (see below) because they may perform similarly well or similarly poorly on a given data point, but the papers both seem to employ this strategy to generate better predictions.
Ensembling: it seems the established way to get a strong overall model is to harness many different prediction models and ensemble them with a top-level algorithm that weights the models accordingly. The idea is that different models may perform well on different subsets of the data (for whatever reason; the “why” may not be well understood), so if you can combine them in a manner that uses the best suited model for each data point, you’ll have a very strong predictor. I actually find the papers to be a little sparse on some details here (maybe because I’m inexperienced) but I think the procedure followed by the Milestone winners is to run what’s called a ridge regression to calculate weightings for each model and for the final prediction to be a linear combination of the models.
Miscellany: One of the Milestone papers interestingly pointed out that the distribution for one feature changed sharply in the last year of available data. In finance we’d call this a “regime change.” The authors decided to toss that feature entirely as a result. They illustrated what clearly does appear to be a change in the nature of the feature’s statistical distribution but did not provide a concrete quantitative test for it, and my own efforts to write such a screen haven’t been successful so far. The issue is that you may not worry about a change in the mean or variance or even a few higher-order moments of a feature’s statistical distribution, but you may be worried if the variable’s family of distributions changed; if something used to be normally distributed and suddenly becomes uniformly distributed, that’s a real problem.
There was little attempt to impose a real-world interpretation on the raw data. The winners generally didn’t try to say something about why their models do what they do with drug prescriptions, hospitalization locations, etc. With minor exceptions, they focused on getting good data and good data mining algorithms. To some degree the selection of features induces some kind of interpretation – why did you calculate this feature? why are you picking this subset of features? – but that is not explained in much depth, and I interpret that to mean that it was not done on the basis of heavy thinking about real-world meaning of the data.
Having already been through a rookie stumbling phase with Amazon EC2, I am pleased to say I’m using it a bit more efficiently now. I’ve already got a “base” snapshot of a Linux install (Ubuntu) sitting around, and I’ve done all my work on a separate EC2 volume. If I ever want to cease work for a while, I can just detach the drive and stop the instance, and only pay for storage. If I want a lot of computing power or I want to try more than one thing in parallel, I can duplicate the volume, create some new higher-powered instances, attach the volumes to the new instances, and go. It’s pleasantly easy at this point to get started.