Recently I’ve been working on a Kaggle competition sponsored by Facebook. Kaggle is a website onto which firms and organizations can upload their own data mining competitions open to the public. They will provide some sort of input/output data set, named a training set in the lingo of machine learning, which competitors use to create their predictive algorithms. They will also provide a test set of inputs and some metric for scoring predicted output versus true output. Competitors submit their predictions and Kaggle scores them against the true outputs and ranks the leaders.
I don’t have a realistic hope of winning this competition – it’s my first time trying and there are pro data scientists working on this stuff – but it has been a good way to learn about the design of machine learning algorithms. Additionally, while it’s not a truly Big Data set (the uncompressed training data set is 142 MB), it’s big enough that you can’t go with brute force methods; you need to be thoughtful about what you do and do not spend time computing.
The Facebook competition is an edge prediction problem. Facebook provides a data file describing some kind of social network (it isn’t the Facebook graph, and obviously it’s anonymous, with graph nodes represented by numbers; someone in the forums put up a decent guess that it’s Instagram) that has had some of its edges deleted. The graph is directed, meaning that every connection is from one node and to another node; A can connect to B independently of B connecting to A. Facebook provides a list of nodes and asks you to make 0 to 10 ranked recommendations as to what other nodes it should follow, or in other words, what missing edges you would recommend drawing from that node to the rest of the graph.
The training set consists of about 1.86 million nodes connected by about 9.44 million edges. There are no self-connections; an edge always connects two distinct nodes. Theoretically you want to be able to assess a score to every pair of nodes (from-node, to-node) and grab the top several pairs for which edges do not already exist. However this requires a couple trillion score calculations, which for any computationally costly score calculation will become infeasible, and in any case will often produce a poor score that is subsequently discarded. So you have to cut your scope down; for each node, you might consider only nodes within a certain number of connections. (In fact my highest-ranking effort at present writing only attempts to connect node A to node B if node B is already connected to node A; perhaps this implies that my more sophisticated attempts are super lame, but hey, it’s currently 64th percentile, so you could do worse.)
There are two papers I’ve found informative for the same reason, namely, that they provide a broad overview of edge prediction methodology. Liben-Nowell and Kleinberg’s “The Link Prediction Problem for Social Networks” I found to be more readable. Cukierski, Hamner, and Yang’s “Graph-based Features for Supervised Link Predictions” I found to be drier, but it specifically addresses directed graphs and it directly recounts the authors’ successful entry into a similar competition for Flickr (in fact Hamner now works for Kaggle).
My first attempts (before really reading the above papers) were based off of a simple tip in the Kaggle forums. He proposed simply suggesting every connection A -> B for which B -> A (if A not already -> B). This actually would already get you to the 30th percentile as of the present writing, though this figure will of course drop over time. My best result is still a refined version of this approach, which simply ranks these predictions in a more intelligent fashion. Subsequent attempts at something “smarter” have not yielded improvements in score.
The general approach I’m taking is to define a relevant neighborhood for each node from-node in the test set and then assess a score on each potential edge (from-node, to-node). In the brute-force case each node’s relevant neighborhood would be the entire graph; in the aforementioned strategy of completing bilateral connections, the relevant neighborhood would be any parent nodes of from-node. If you’re just computing one feature, you can just rank the nodes by score, optionally truncate the list based on some kind of cutoff, and return the top 10 nodes as your recommendations (or fewer if there are fewer than 10). The determination of the cutoff is a problem with an unclear answer; I think you have to do some kind of analysis on the distribution of scores, but even then you’re ultimately drawing a line in the sand.
Alternatively, particularly if you’d like to combine more than one feature into your analysis, you could run a logistic regression, which is what I’ve been doing. Briefly, a linear regression attempts to fit a linear equation to a set of input variables to predict the value of an outcome variable; this can give distorted results if the outcome variables all fall within a band, such as if you’re trying to predict a 0-or-1 outcome. A logistic regression transforms the outcome variables from the range [0,1] to the full number line using a function called the logistic function; you would then invert it on any predictions back to the range [0,1] to get a meaningful number.
In our case we can say we’re trying to predict the probability that an edge has been deleted between two nodes, and score node pairs based on this predicted probability. If you are only using one feature to predict, the logistic regression will be trivial, since the logistic function is monotonic; if one pair scores higher than another then it’ll still score higher after being passed through the logistic function. But you can run a regression on multiple features, such as if you wanted to use both two nodes’ common neighbors and their combined number of neighbors, and you can also add square and cube terms and cross terms and all the usual jazz that people do with regressions. Viewing the ranking score as a probability also gives you some intuition behind where you might set a cutoff.
The highest score I’ve gotten so far involved plugging the nodes into a regression based on the numbers of children and parent connections on both the from-node and the to-node. There are a bunch of other methodologies in the above papers that I’d like to try – I’m currently working on a PageRank-based calculation, PageRank being the algorithm underlying how Google ranks web query relevancy.