Posts Tagged ‘data’
The Heritage Health Prize
I started work on a second data competition, the Heritage Health Prize, which is well-known in the community as it has a very large purse, $3 million to the winning team. The objective of this competition is to predict hospitalizations for patients, given health insurance claims data for those patients in previous years. It is a tremendous application of data analysis, as I think healthcare is extremely fertile ground for increasing efficiency by being smarter about care and prescription and procedure. I may be off-and-on with this one, working on it for a while and then letting it sit for a while; as before, my objective is to learn as much as I can, not realistically to win, and if I feel like I’m spinning my wheels I’ll drop it for a while.
What I particularly like about this competition is the “Milestone Prizes” that the organizers also award. The competition will last for two years, and every 6 months the top 2 entrants win a much smaller but not insubstantial prize, in the five-digit dollar range. In order to claim the Milestones, the winning teams must submit a write-up of their methodology, to the organizers’ satisfaction. Here are links to the Milestone 1 and Milestone 2 papers. (You can only read those PDFs if you are in the competition, unfortunately, and I don’t intend to re-share them if the organizers don’t want them to be shared.)
Two Milestones have passed, with the third coming up in a few weeks. The papers have been tremendously helpful in getting started; my initial approach has been a highly simplified version of their procedures, and it’s good enough to get to 211th place out of 1268 (though only 818 entries right now clear a naïve-ish benchmark where every entry is predicted at an optimized constant value, and I say “naïve-ish” because the method for deducing that optimized constant is thoughtful). Unfortunately my efforts at sophisticating my models along the lines of the papers have not yielded much improvement beyond my initial go, but hopefully I’ll figure something out.
Although two Milestones have passed, it is helpful to read the first Milestone papers first, because the later ones build on/make reference to the previous ones. I was surprised by the similarity of the papers’ structure, despite being written independently:
Features: from the raw data supplied by the competition, what variables became the input into your prediction algorithms? In some cases, there is no transformation; you feed the competition data right through. In other cases, the papers calculated per patient averages, minimums and maximums, etc. and fed those through.
Algorithms: in general, strong entries use more than one (see “Ensembling” below). This is where some of the ornery mathematics comes into play, and to really do a good job here you need to read some academic papers. But many of the established statistical models have already been implemented in languages such as R, so if you simply want to get an entry on the leaderboard you actually don’t need to know too much about the models; download them and run them as a black box. (I’m still learning about these models and yet I’ve managed to write implementations that use them.) R in particular has strong community development of these statistical models and is what I’ve been using. The algorithms that are new to me that I’ve been trying to learn so far are called gradient boosting and random forests.
Feature selection: a model is a combination of an algorithm and a subset of the available features. You might run the same algorithm on two different subsets of the features and call those two separate models. Models with the same algorithms may benefit less from the ensembling step (see below) because they may perform similarly well or similarly poorly on a given data point, but the papers both seem to employ this strategy to generate better predictions.
Ensembling: it seems the established way to get a strong overall model is to harness many different prediction models and ensemble them with a top-level algorithm that weights the models accordingly. The idea is that different models may perform well on different subsets of the data (for whatever reason; the “why” may not be well understood), so if you can combine them in a manner that uses the best suited model for each data point, you’ll have a very strong predictor. I actually find the papers to be a little sparse on some details here (maybe because I’m inexperienced) but I think the procedure followed by the Milestone winners is to run what’s called a ridge regression to calculate weightings for each model and for the final prediction to be a linear combination of the models.
Miscellany: One of the Milestone papers interestingly pointed out that the distribution for one feature changed sharply in the last year of available data. In finance we’d call this a “regime change.” The authors decided to toss that feature entirely as a result. They illustrated what clearly does appear to be a change in the nature of the feature’s statistical distribution but did not provide a concrete quantitative test for it, and my own efforts to write such a screen haven’t been successful so far. The issue is that you may not worry about a change in the mean or variance or even a few higher-order moments of a feature’s statistical distribution, but you may be worried if the variable’s family of distributions changed; if something used to be normally distributed and suddenly becomes uniformly distributed, that’s a real problem.
There was little attempt to impose a real-world interpretation on the raw data. The winners generally didn’t try to say something about why their models do what they do with drug prescriptions, hospitalization locations, etc. With minor exceptions, they focused on getting good data and good data mining algorithms. To some degree the selection of features induces some kind of interpretation – why did you calculate this feature? why are you picking this subset of features? – but that is not explained in much depth, and I interpret that to mean that it was not done on the basis of heavy thinking about real-world meaning of the data.
Having already been through a rookie stumbling phase with Amazon EC2, I am pleased to say I’m using it a bit more efficiently now. I’ve already got a “base” snapshot of a Linux install (Ubuntu) sitting around, and I’ve done all my work on a separate EC2 volume. If I ever want to cease work for a while, I can just detach the drive and stop the instance, and only pay for storage. If I want a lot of computing power or I want to try more than one thing in parallel, I can duplicate the volume, create some new higher-powered instances, attach the volumes to the new instances, and go. It’s pleasantly easy at this point to get started.
Big data autodidacticism
The aforementioned Facebook data mining contest ends today. The contest was, given a directed graph with missing edges and a list of nodes, to predict up to 10 new edges for each node in the list to point to. This is the first time I’ve tried a Kaggle competition. I picked it up as a way to teach myself about machine learning and data analysis techniques. I’ve also done a bit of reading from Toby Segaran’s Programming Collective Intelligence (I also have Drew Conway and John Myles White’s Machine Learning for Hackers but haven’t really gone beyond the intros yet). And I’ve also been trying out a machine learning course from Coursera, given by Stanford professor Andrew Ng, which is just finishing up as well.
On Kaggle I’m somewhere around the 75th to 80th percentile, although I’m afraid to say my solution is essentially the same as one posted (possibly against the rules?) in the discussion forums, so not really an original idea on my part. For an early description of my attempts, see the previous post. As it turns out, those attempts all fared worse than a PageRank-like algorithm that operated as follows, given a node for which you want to predict outgoing edges:
- Every other node is initially scored zero.
- Send out a value of 1/(# of edges) out along each edge to each neighbor, both on outgoing and incoming edges. So both nodes that point to and are pointed to this node will receive this value, and a neighbor node that both points to and is pointed to by the node in question will receive 2x this value.
- Add the value received by each neighboring node to its score.
- Repeat steps 2 and 3 recursively twice, going out to the neighbors’ neighbors and the neighbors’ neighbors’ neighbors, but in these cases, if sending a value across an incoming edge (in the reverse direction that the edge points), do not add the value received by the neighbor to its score.
Note that this is not a probability distribution across nodes. I avoided looking at the forum-posted solution and implementation for a while, then finally when I thought I was kind of spinning my wheels I read it through and punted around a few random improvements, but none of them really worked. (I did re-implement the solution in my own code framework, of course.)
Prior to starting on Kaggle, I had been sort of following along and plugging away at the examples in Segaran’s book, reproducing the code, running the examples myself, etc. I was learning, but I think it really helps to have some kind of project or target to go after. It’s the difference between, say, learning music by listening to lots of songs and reading scores and charts and theory, and learning by actually picking up an instrument and playing. (During this time I actually picked up guitar as well – it’s not a bad change of pace when you need one, and it’s nice to fiddle around with one while a slow-moving program is running.) I do still plan to return to his book and continue along with more examples, hopefully with a better appreciation and faster learn rate now that I’ve tried a project.
Participating in the competition was definitely educational, but as mentioned, it does lend itself to some wheel spinning. When submitting predictions, the competition does compute your overall score (using a metric publicly defined in the rules), but no details about what you did right and what you did wrong, as you might actually have in a real-life situation. Obviously they have to do this so that people don’t just submit a solution that is overfit to the test data. But this does mean, I think, that you’re just going to learn at that much of a slower pace.
Kaggle did give me the chance to use Amazon EC2 for what is ostensibly its “real” purpose, which is to purchase computing power by the hour. The algorithm described above is slow (at least my implementation of it was slow, maybe someone out there has a smarter and speedier version), and would take hours and possibly days to run on my laptop (a MacBook Air). Once I started getting to the point where my algorithms were taking this long, I took it to the cloud, spinning up a high powered Linux instance, uploading the code, and running it there. It still would take a few hours by the end, but that’s a bearable runtime.
To take full advantage of the multiple cores on the high-end EC2 instances I had to rewrite the code to support multithreading, which was something I hadn’t done before, and which was in my opinion generally a frustrating experience, lending itself to unpredictable crashes and more challenging debugging.
A word or two about Coursera, whose machine learning course I’m finishing up now: I liked it enough to try some more courses, but at times it felt like I was just following along the motions. To extend my music analogies, it felt like I was indeed actually playing guitar, but someone was sitting behind me holding my hands making me strum and finger all the chords. I’m not positive how much I will retain and how much will slide out my ears within the coming weeks. The slides and the presentations are good reads, but the programming exercises aren’t all that. The benefits you get from taking an in-person, structured class is that you also have close contact and cooperation with classmates; maybe you realistically can’t do Courseras unless they’re coupled with Meetups.
Edge prediction
Recently I’ve been working on a Kaggle competition sponsored by Facebook. Kaggle is a website onto which firms and organizations can upload their own data mining competitions open to the public. They will provide some sort of input/output data set, named a training set in the lingo of machine learning, which competitors use to create their predictive algorithms. They will also provide a test set of inputs and some metric for scoring predicted output versus true output. Competitors submit their predictions and Kaggle scores them against the true outputs and ranks the leaders.
I don’t have a realistic hope of winning this competition – it’s my first time trying and there are pro data scientists working on this stuff – but it has been a good way to learn about the design of machine learning algorithms. Additionally, while it’s not a truly Big Data set (the uncompressed training data set is 142 MB), it’s big enough that you can’t go with brute force methods; you need to be thoughtful about what you do and do not spend time computing.
The Facebook competition is an edge prediction problem. Facebook provides a data file describing some kind of social network (it isn’t the Facebook graph, and obviously it’s anonymous, with graph nodes represented by numbers; someone in the forums put up a decent guess that it’s Instagram) that has had some of its edges deleted. The graph is directed, meaning that every connection is from one node and to another node; A can connect to B independently of B connecting to A. Facebook provides a list of nodes and asks you to make 0 to 10 ranked recommendations as to what other nodes it should follow, or in other words, what missing edges you would recommend drawing from that node to the rest of the graph.
The training set consists of about 1.86 million nodes connected by about 9.44 million edges. There are no self-connections; an edge always connects two distinct nodes. Theoretically you want to be able to assess a score to every pair of nodes (from-node, to-node) and grab the top several pairs for which edges do not already exist. However this requires a couple trillion score calculations, which for any computationally costly score calculation will become infeasible, and in any case will often produce a poor score that is subsequently discarded. So you have to cut your scope down; for each node, you might consider only nodes within a certain number of connections. (In fact my highest-ranking effort at present writing only attempts to connect node A to node B if node B is already connected to node A; perhaps this implies that my more sophisticated attempts are super lame, but hey, it’s currently 64th percentile, so you could do worse.)
There are two papers I’ve found informative for the same reason, namely, that they provide a broad overview of edge prediction methodology. Liben-Nowell and Kleinberg’s “The Link Prediction Problem for Social Networks” I found to be more readable. Cukierski, Hamner, and Yang’s “Graph-based Features for Supervised Link Predictions” I found to be drier, but it specifically addresses directed graphs and it directly recounts the authors’ successful entry into a similar competition for Flickr (in fact Hamner now works for Kaggle).
My first attempts (before really reading the above papers) were based off of a simple tip in the Kaggle forums. He proposed simply suggesting every connection A -> B for which B -> A (if A not already -> B). This actually would already get you to the 30th percentile as of the present writing, though this figure will of course drop over time. My best result is still a refined version of this approach, which simply ranks these predictions in a more intelligent fashion. Subsequent attempts at something “smarter” have not yielded improvements in score.
The general approach I’m taking is to define a relevant neighborhood for each node from-node in the test set and then assess a score on each potential edge (from-node, to-node). In the brute-force case each node’s relevant neighborhood would be the entire graph; in the aforementioned strategy of completing bilateral connections, the relevant neighborhood would be any parent nodes of from-node. If you’re just computing one feature, you can just rank the nodes by score, optionally truncate the list based on some kind of cutoff, and return the top 10 nodes as your recommendations (or fewer if there are fewer than 10). The determination of the cutoff is a problem with an unclear answer; I think you have to do some kind of analysis on the distribution of scores, but even then you’re ultimately drawing a line in the sand.
Alternatively, particularly if you’d like to combine more than one feature into your analysis, you could run a logistic regression, which is what I’ve been doing. Briefly, a linear regression attempts to fit a linear equation to a set of input variables to predict the value of an outcome variable; this can give distorted results if the outcome variables all fall within a band, such as if you’re trying to predict a 0-or-1 outcome. A logistic regression transforms the outcome variables from the range [0,1] to the full number line using a function called the logistic function; you would then invert it on any predictions back to the range [0,1] to get a meaningful number.
In our case we can say we’re trying to predict the probability that an edge has been deleted between two nodes, and score node pairs based on this predicted probability. If you are only using one feature to predict, the logistic regression will be trivial, since the logistic function is monotonic; if one pair scores higher than another then it’ll still score higher after being passed through the logistic function. But you can run a regression on multiple features, such as if you wanted to use both two nodes’ common neighbors and their combined number of neighbors, and you can also add square and cube terms and cross terms and all the usual jazz that people do with regressions. Viewing the ranking score as a probability also gives you some intuition behind where you might set a cutoff.
The highest score I’ve gotten so far involved plugging the nodes into a regression based on the numbers of children and parent connections on both the from-node and the to-node. There are a bunch of other methodologies in the above papers that I’d like to try – I’m currently working on a PageRank-based calculation, PageRank being the algorithm underlying how Google ranks web query relevancy.
Statistically translating phrases with unusual translations
Google Translate, according to Wikipedia and my own empirical observations, is based on the statistical machine translation paradigm. Rather than constructing its translations by learning dictionaries and rules of grammar, a statistical machine translator will analyze texts for which it has known good translations in multiple languages and will learn how to translate new phrases from them. Statistical translation is descriptivist, reflecting how people actually write and speak rather than how rules of syntax dictate they should write and speak. (To the extent that the source texts themselves reflect how people actually write and speak, of course.)
Consequently, phrases that are in practice translated in a manner that differs strongly from their literal translations are translated as done in practice, not in the literal sense. Idioms are certainly one type of phrase that match this criterion:
- L’habit ne fait pas le moine (French) translates to The clothes do not make the man (English), although the literal translation is The robe does not make the monk. (What is quite interesting is if you start with a lower case, l’habit ne fait pas le moine, you get a non-idiomatic translation, appearances can be deceiving.)
- I’m pulling your leg (English) translates to Yo estoy tomando el pelo (Spanish), with pulling your leg translated to a phrase that in Spanish literally means pulling your hair (but has the same meaning as the English idiom).
- I guess either Google did not source the news about the Costa Concordia disaster or faced too much diversity of translation when sourcing it, because the infamous phrase Vada a bordo, cazzo (Italian) is translated as Go on board, fucking (English), which is sort of broken and clearly was not parsed as a single phrase. This phrase was shouted by an Italian Coast Guard officer at the boat’s captain when the captain proved unwilling to go back and help the rescue; from what I’ve read it seems the right translation might be Get on board, dammit (what the press said) or Get the fuck on board (I suspect this is more unbowdlerizedly accurate, it sounds like the kind of thing a seafaring officer would have said in the stress of that situation if he were speaking English) or Get on board, you dick (apparently more literal, but I think the second phrase sounds slightly more natural).
Another kind of phrase falling into this category is titles. 千と千尋の神隠し (Japanese), a beautiful 2001 animated movie directed by Hayao Miyazaki, translates to Spirited Away (English), which was how the studios translated its title when releasing it to English-speaking countries. The same title also translates to Voyage de Chihiro (French), which was its title in French-speaking countries (almost; it was more precisely Le voyage de Chihiro, and I wonder if there’s a non-statistical rule at play on Google’s side that made it drop the article?).
I don’t speak a word of Japanese, but I found this article regarding the translation of the title, which more directly translates it into English as “Sen and Chihiro’s (experience of) being spirited away.” (In the film Chihiro is at one point renamed Sen, which has significance in her need to hold on to her identity.) Hence both of the above “official” translations differ from the literal translation, and in different ways; the English translation drops most of the title but retains the “spiriting away,” and the French translation drops Sen and converts the “spiriting away” into “the voyage.”
This poses a bit of a problem when you actually want a literal translation. On my tumblr I recently referenced the fact that the Chinese title of Infernal Affairs, the Hong Kong movie upon which The Departed is based, apparently more directly translates to “the non-stop path,” a reference to Buddhist/Chinese hell. But when you feed 無間道 into Google Translate, you get The Departed in English (Infernal Affairs is actually listed as an alternate translation and not the first choice! Interesting that that happened). To underscore Google’s proper-noun interpretation of this phrase, French and Spanish also translate this to The Departed, which I guess means that most of Google’s source text in these languages reused the untranslated English title. (Translating to Portuguese, on the other hand, produces Os Infiltrados, which according to IMDB was the title under which the film was released in Brazil.)
In any case you can’t get any other English translation from Google on this count. I do greatly prefer the data-driven, descriptivist approach of statistical translation over a rules-based approach (and the success of Google Translate is a testament to the validity of the statistics); this is a small but interesting area where it falls a little short. You’re only as good as your data.
Personal data organization
At work today I heard someone debating to himself whether a certain email belonged in this folder or that folder. I use Gmail for my personal mail and it kills me how much better my personal mail is than my work email, particularly in light of the fact that I get so much more mail at work than at home and need modern email client features more badly there. I don’t organize my personal email at all; everything hits my inbox and sits there, and I move nothing to a folder and tag nothing with a label. Every time I have needed to find email on a certain topic, all I have had to do is search my inbox.
Gmail’s biggest positive contribution to my life was freeing me from the need to impose a manual organizational scheme on my email. The value of folders as a method to help locate emails is already negated with a quality search engine. Tags and labels are certainly an improvement over folders because an email could logically belong to many different semantic categories, as my co-worker acknowledged. But a search engine that can parse emails and identify relevant to search strings is effectively tagging everything automatically. There’s no need to hand-label emails regarding your pickup soccer games if searching for “soccer” returns all the relevant email for you.
As far as personal data organization goes, email is a comparatively simple problem to solve (on the scale of things), because the data contained in emails is predominantly textual and is mostly written in conversational prose. The problem of searching one’s email can be generalized to searching one’s personal data, including email, documents, music, software, etc., at which point you encounter new and difficult challenges. How do you attack the problem of searching photos, music, movies, and other media? How do you search data that may require some contextual interpretation?
To the latter point, I think an interesting challenge to personal data organization would be to get computers to recognize relevance that may be true for you but not for others. As a simple example, if you search your data and documents for “soccer teammates,” you might want to see any emails and other data regarding your teammate John Doe even if the email makes no direct mention of soccer or a team; the search engine could have figured out that John is someone relevant to you as a soccer teammate. On the other hand, John’s parents probably don’t think of him as a soccer teammate but as their son. So if they were to query their personal data for “soccer teammates” the search engine should not return John’s emails.