Archive for June 2012
Recently I’ve been working on a Kaggle competition sponsored by Facebook. Kaggle is a website onto which firms and organizations can upload their own data mining competitions open to the public. They will provide some sort of input/output data set, named a training set in the lingo of machine learning, which competitors use to create their predictive algorithms. They will also provide a test set of inputs and some metric for scoring predicted output versus true output. Competitors submit their predictions and Kaggle scores them against the true outputs and ranks the leaders.
I don’t have a realistic hope of winning this competition – it’s my first time trying and there are pro data scientists working on this stuff – but it has been a good way to learn about the design of machine learning algorithms. Additionally, while it’s not a truly Big Data set (the uncompressed training data set is 142 MB), it’s big enough that you can’t go with brute force methods; you need to be thoughtful about what you do and do not spend time computing.
The Facebook competition is an edge prediction problem. Facebook provides a data file describing some kind of social network (it isn’t the Facebook graph, and obviously it’s anonymous, with graph nodes represented by numbers; someone in the forums put up a decent guess that it’s Instagram) that has had some of its edges deleted. The graph is directed, meaning that every connection is from one node and to another node; A can connect to B independently of B connecting to A. Facebook provides a list of nodes and asks you to make 0 to 10 ranked recommendations as to what other nodes it should follow, or in other words, what missing edges you would recommend drawing from that node to the rest of the graph.
The training set consists of about 1.86 million nodes connected by about 9.44 million edges. There are no self-connections; an edge always connects two distinct nodes. Theoretically you want to be able to assess a score to every pair of nodes (from-node, to-node) and grab the top several pairs for which edges do not already exist. However this requires a couple trillion score calculations, which for any computationally costly score calculation will become infeasible, and in any case will often produce a poor score that is subsequently discarded. So you have to cut your scope down; for each node, you might consider only nodes within a certain number of connections. (In fact my highest-ranking effort at present writing only attempts to connect node A to node B if node B is already connected to node A; perhaps this implies that my more sophisticated attempts are super lame, but hey, it’s currently 64th percentile, so you could do worse.)
There are two papers I’ve found informative for the same reason, namely, that they provide a broad overview of edge prediction methodology. Liben-Nowell and Kleinberg’s “The Link Prediction Problem for Social Networks” I found to be more readable. Cukierski, Hamner, and Yang’s “Graph-based Features for Supervised Link Predictions” I found to be drier, but it specifically addresses directed graphs and it directly recounts the authors’ successful entry into a similar competition for Flickr (in fact Hamner now works for Kaggle).
My first attempts (before really reading the above papers) were based off of a simple tip in the Kaggle forums. He proposed simply suggesting every connection A -> B for which B -> A (if A not already -> B). This actually would already get you to the 30th percentile as of the present writing, though this figure will of course drop over time. My best result is still a refined version of this approach, which simply ranks these predictions in a more intelligent fashion. Subsequent attempts at something “smarter” have not yielded improvements in score.
The general approach I’m taking is to define a relevant neighborhood for each node from-node in the test set and then assess a score on each potential edge (from-node, to-node). In the brute-force case each node’s relevant neighborhood would be the entire graph; in the aforementioned strategy of completing bilateral connections, the relevant neighborhood would be any parent nodes of from-node. If you’re just computing one feature, you can just rank the nodes by score, optionally truncate the list based on some kind of cutoff, and return the top 10 nodes as your recommendations (or fewer if there are fewer than 10). The determination of the cutoff is a problem with an unclear answer; I think you have to do some kind of analysis on the distribution of scores, but even then you’re ultimately drawing a line in the sand.
Alternatively, particularly if you’d like to combine more than one feature into your analysis, you could run a logistic regression, which is what I’ve been doing. Briefly, a linear regression attempts to fit a linear equation to a set of input variables to predict the value of an outcome variable; this can give distorted results if the outcome variables all fall within a band, such as if you’re trying to predict a 0-or-1 outcome. A logistic regression transforms the outcome variables from the range [0,1] to the full number line using a function called the logistic function; you would then invert it on any predictions back to the range [0,1] to get a meaningful number.
In our case we can say we’re trying to predict the probability that an edge has been deleted between two nodes, and score node pairs based on this predicted probability. If you are only using one feature to predict, the logistic regression will be trivial, since the logistic function is monotonic; if one pair scores higher than another then it’ll still score higher after being passed through the logistic function. But you can run a regression on multiple features, such as if you wanted to use both two nodes’ common neighbors and their combined number of neighbors, and you can also add square and cube terms and cross terms and all the usual jazz that people do with regressions. Viewing the ranking score as a probability also gives you some intuition behind where you might set a cutoff.
The highest score I’ve gotten so far involved plugging the nodes into a regression based on the numbers of children and parent connections on both the from-node and the to-node. There are a bunch of other methodologies in the above papers that I’d like to try – I’m currently working on a PageRank-based calculation, PageRank being the algorithm underlying how Google ranks web query relevancy.
A travel game I came up with (and by “came up with” I mean “ripped off of Cranium and my college economics classes”) that I’ve been thinking about a lot in elevators and subways recently is what I call the coordination game. You need at least two people to play, as you need a partner, and three people to play competitively. To play around, first come up with some kind of category. You and your partner think of/write down a list of five items that fit in that category and then share your lists. For each item that is on both of your lists, you each get a point. With two players you can just try to get a high score; with three players you could rotate partners and try to individually get a high score, and so on. Because you are trying to coordinate with your partner, you will want to include the five items that he is most likely to include on his list, but at the same time he will be trying to do the same thing with you. It will help if you know a bit about your partner’s background, education, psychology, and likes and dislikes, and also any relevant shared experiences.
A good example is the category “elements of the periodic table.” There is a well-known natural ordering of the periodic table, by atomic number. If two chemists were playing each other, they might simply write the first five: hydrogen, helium, lithium, beryllium, and boron. However these are far from the five best-known elements, with the last two in particular being somewhat obscure. If one partner is not sure whether the other knows the periodic table or not, then she might be inclined to go with more commonly-known elements (and note that this may happen even if both partners do in fact know the periodic table, since both people are trying to guess what the other will do).
The organic chemistry elements might be the ones that jump most readily to mind for someone with a casual background in chemistry, since most people learn the periodic table in high school and those elements are the ones that appear most often in such classes. So someone might write hydrogen, carbon, nitrogen, and oxygen, with the fifth being a bit tricky because those first four are the Big Four elements that appear in organic contexts. You might try sulfur or phosphorus.
The context of a chemistry class taken in one’s past is important, because these aren’t the best-known elements either. Elements such as iron, gold, silver, copper, tin, and lead have been known to humanity for thousands of years and as such have “everyday,” non-scientific-sounding names. People with no chemistry background whatsoever know what these elements are. But in fact this may not be the best approach to take, because people do not generally think of these things as “elements of the periodic table,” a phrase which leads one to think of laboratory chemistry and not the silverware in your kitchen.
Another very different example is James Bond movies. You could again go with the first five: Dr. No, From Russia With Love, Goldfinger, Thunderball, and You Only Live Twice. Or you could go with the most recent five: Quantum of Solace, Casino Royale, Die Another Day, The World is Not Enough, and Tomorrow Never Dies. (Would you include the not-yet-released Skyfall in there? Your choice.) You could also try to judge what the most currently famous five Bond movies are, which has no objectively right answer; a film buff or an older partner would probably be more biased towards the “classic” films like Goldfinger, whereas the man on the street, especially if he is younger, might be more biased towards relatively recent hits like Casino Royale or GoldenEye.
Categories should be broad and include many qualified items, far more than five, and have no really obvious natural ordering that everyone will easily seize on (“numbers” is a bad category). They should not be so obscure as to make naming five a challenge; the fun is not in trying to stump people but in trying to tacitly determine where to coordinate out of a large pool of possibilities.
Google Translate, according to Wikipedia and my own empirical observations, is based on the statistical machine translation paradigm. Rather than constructing its translations by learning dictionaries and rules of grammar, a statistical machine translator will analyze texts for which it has known good translations in multiple languages and will learn how to translate new phrases from them. Statistical translation is descriptivist, reflecting how people actually write and speak rather than how rules of syntax dictate they should write and speak. (To the extent that the source texts themselves reflect how people actually write and speak, of course.)
Consequently, phrases that are in practice translated in a manner that differs strongly from their literal translations are translated as done in practice, not in the literal sense. Idioms are certainly one type of phrase that match this criterion:
- L’habit ne fait pas le moine (French) translates to The clothes do not make the man (English), although the literal translation is The robe does not make the monk. (What is quite interesting is if you start with a lower case, l’habit ne fait pas le moine, you get a non-idiomatic translation, appearances can be deceiving.)
- I’m pulling your leg (English) translates to Yo estoy tomando el pelo (Spanish), with pulling your leg translated to a phrase that in Spanish literally means pulling your hair (but has the same meaning as the English idiom).
- I guess either Google did not source the news about the Costa Concordia disaster or faced too much diversity of translation when sourcing it, because the infamous phrase Vada a bordo, cazzo (Italian) is translated as Go on board, fucking (English), which is sort of broken and clearly was not parsed as a single phrase. This phrase was shouted by an Italian Coast Guard officer at the boat’s captain when the captain proved unwilling to go back and help the rescue; from what I’ve read it seems the right translation might be Get on board, dammit (what the press said) or Get the fuck on board (I suspect this is more unbowdlerizedly accurate, it sounds like the kind of thing a seafaring officer would have said in the stress of that situation if he were speaking English) or Get on board, you dick (apparently more literal, but I think the second phrase sounds slightly more natural).
Another kind of phrase falling into this category is titles. 千と千尋の神隠し (Japanese), a beautiful 2001 animated movie directed by Hayao Miyazaki, translates to Spirited Away (English), which was how the studios translated its title when releasing it to English-speaking countries. The same title also translates to Voyage de Chihiro (French), which was its title in French-speaking countries (almost; it was more precisely Le voyage de Chihiro, and I wonder if there’s a non-statistical rule at play on Google’s side that made it drop the article?).
I don’t speak a word of Japanese, but I found this article regarding the translation of the title, which more directly translates it into English as “Sen and Chihiro’s (experience of) being spirited away.” (In the film Chihiro is at one point renamed Sen, which has significance in her need to hold on to her identity.) Hence both of the above “official” translations differ from the literal translation, and in different ways; the English translation drops most of the title but retains the “spiriting away,” and the French translation drops Sen and converts the “spiriting away” into “the voyage.”
This poses a bit of a problem when you actually want a literal translation. On my tumblr I recently referenced the fact that the Chinese title of Infernal Affairs, the Hong Kong movie upon which The Departed is based, apparently more directly translates to “the non-stop path,” a reference to Buddhist/Chinese hell. But when you feed 無間道 into Google Translate, you get The Departed in English (Infernal Affairs is actually listed as an alternate translation and not the first choice! Interesting that that happened). To underscore Google’s proper-noun interpretation of this phrase, French and Spanish also translate this to The Departed, which I guess means that most of Google’s source text in these languages reused the untranslated English title. (Translating to Portuguese, on the other hand, produces Os Infiltrados, which according to IMDB was the title under which the film was released in Brazil.)
In any case you can’t get any other English translation from Google on this count. I do greatly prefer the data-driven, descriptivist approach of statistical translation over a rules-based approach (and the success of Google Translate is a testament to the validity of the statistics); this is a small but interesting area where it falls a little short. You’re only as good as your data.