Big data autodidacticism
The aforementioned Facebook data mining contest ends today. The contest was, given a directed graph with missing edges and a list of nodes, to predict up to 10 new edges for each node in the list to point to. This is the first time I’ve tried a Kaggle competition. I picked it up as a way to teach myself about machine learning and data analysis techniques. I’ve also done a bit of reading from Toby Segaran’s Programming Collective Intelligence (I also have Drew Conway and John Myles White’s Machine Learning for Hackers but haven’t really gone beyond the intros yet). And I’ve also been trying out a machine learning course from Coursera, given by Stanford professor Andrew Ng, which is just finishing up as well.
On Kaggle I’m somewhere around the 75th to 80th percentile, although I’m afraid to say my solution is essentially the same as one posted (possibly against the rules?) in the discussion forums, so not really an original idea on my part. For an early description of my attempts, see the previous post. As it turns out, those attempts all fared worse than a PageRank-like algorithm that operated as follows, given a node for which you want to predict outgoing edges:
- Every other node is initially scored zero.
- Send out a value of 1/(# of edges) out along each edge to each neighbor, both on outgoing and incoming edges. So both nodes that point to and are pointed to this node will receive this value, and a neighbor node that both points to and is pointed to by the node in question will receive 2x this value.
- Add the value received by each neighboring node to its score.
- Repeat steps 2 and 3 recursively twice, going out to the neighbors’ neighbors and the neighbors’ neighbors’ neighbors, but in these cases, if sending a value across an incoming edge (in the reverse direction that the edge points), do not add the value received by the neighbor to its score.
Note that this is not a probability distribution across nodes. I avoided looking at the forum-posted solution and implementation for a while, then finally when I thought I was kind of spinning my wheels I read it through and punted around a few random improvements, but none of them really worked. (I did re-implement the solution in my own code framework, of course.)
Prior to starting on Kaggle, I had been sort of following along and plugging away at the examples in Segaran’s book, reproducing the code, running the examples myself, etc. I was learning, but I think it really helps to have some kind of project or target to go after. It’s the difference between, say, learning music by listening to lots of songs and reading scores and charts and theory, and learning by actually picking up an instrument and playing. (During this time I actually picked up guitar as well – it’s not a bad change of pace when you need one, and it’s nice to fiddle around with one while a slow-moving program is running.) I do still plan to return to his book and continue along with more examples, hopefully with a better appreciation and faster learn rate now that I’ve tried a project.
Participating in the competition was definitely educational, but as mentioned, it does lend itself to some wheel spinning. When submitting predictions, the competition does compute your overall score (using a metric publicly defined in the rules), but no details about what you did right and what you did wrong, as you might actually have in a real-life situation. Obviously they have to do this so that people don’t just submit a solution that is overfit to the test data. But this does mean, I think, that you’re just going to learn at that much of a slower pace.
Kaggle did give me the chance to use Amazon EC2 for what is ostensibly its “real” purpose, which is to purchase computing power by the hour. The algorithm described above is slow (at least my implementation of it was slow, maybe someone out there has a smarter and speedier version), and would take hours and possibly days to run on my laptop (a MacBook Air). Once I started getting to the point where my algorithms were taking this long, I took it to the cloud, spinning up a high powered Linux instance, uploading the code, and running it there. It still would take a few hours by the end, but that’s a bearable runtime.
To take full advantage of the multiple cores on the high-end EC2 instances I had to rewrite the code to support multithreading, which was something I hadn’t done before, and which was in my opinion generally a frustrating experience, lending itself to unpredictable crashes and more challenging debugging.
A word or two about Coursera, whose machine learning course I’m finishing up now: I liked it enough to try some more courses, but at times it felt like I was just following along the motions. To extend my music analogies, it felt like I was indeed actually playing guitar, but someone was sitting behind me holding my hands making me strum and finger all the chords. I’m not positive how much I will retain and how much will slide out my ears within the coming weeks. The slides and the presentations are good reads, but the programming exercises aren’t all that. The benefits you get from taking an in-person, structured class is that you also have close contact and cooperation with classmates; maybe you realistically can’t do Courseras unless they’re coupled with Meetups.