Posts Tagged ‘statistics’
Weak assumptions
This post may be easier to read if you have some comfort with financial mathematics.
Thousands of people across the history of finance have dutifully memorized one of the most famous results in financial mathematics, the Black-Scholes formula for pricing a European option. For the sake of completeness (skip ahead if you like), here is the formula for pricing a European call (C) or put (P) on a non-dividend-paying asset, which you can also find in countless textbooks and on countless websites:
where
and S is the underlying asset price, K is the strike price of the option, t is the time to option expiry, r is the interest rate out to time t, σ is the volatility of the underlying asset, and N() represents the cdf of a standard normal distribution.
It is important to remember that while this is a ubiquitous formula used to price options, so much so that option prices are thought of by many traders in terms of their Black-Scholes volatility rather than their dollar price, it is only a mathematical model and is only correct insofar as its assumptions are met. And as with all models, real life matches the model assumptions imperfectly. You could come up with another option pricing model based off of different assumptions and in some sense it would be no more “right” or “wrong” than Black-Scholes; the area of debate would be how well those assumptions fit reality.
For example, let’s say that you had an option on a small pharmaceutical company that was awaiting FDA approval on its only product, a drug upon which the entire firm’s fortunes rested. If the FDA approved, the stock would go to $100, and if not, the stock would go to $0. In this case Black-Scholes’s assumptions about the dynamics of the stock price are very poorly met, and it would not be a great model to use.
Some financiers who are particularly dutiful have also memorized formulas for the basic Black-Scholes greeks. For example, the deltas (sensitivities to underlying asset price) of a call and a put are
The relationship between the delta of a call and a put of the same strike and expiry is therefore: call delta – put delta = 1. The formulas for the deltas are strictly Black-Scholes; you can get them by taking the derivative of the Black-Scholes pricing formula, and they might not be accurate under a different option pricing model. But the relationship between the two is not, depending solely on put-call parity.
Put-call parity states that the price of a call minus the price of a put equals the discounted present value of the asset price minus the strike price. It is a much weaker assumption than those that underlie Black-Scholes. You don’t need to say anything about volatility, or Brownian motion, or continuous-time hedging. Not only that, it’s very intuitive and logical: if you have the right to buy a stock above $100 at some point in the future, and someone has the right to sell a stock to you below $100 at that same point in time, you essentially have a forward agreement to buy the stock at $100, which at that point in time will be worth the expected value of the stock less $100, and which today will be worth the stock price less the discounted value of $100 at expiry. It’s much harder to imagine scenarios in which put-call parity would be violated than in which Black-Scholes assumptions are violated (in fact Black-Scholes assumptions imply put-call parity).
What this means is that any options model that accepts the weak and almost always realistic assumption of put-call parity must have the same relationship between call delta and put delta. Let’s look at another slightly trickier example, regarding vega (sensitivity to volatility) and theta (sensitivity to the passage of time). The Black-Scholes formulas for vega and theta of a call are:
(The negative sign in the theta is there because I have represented t as time to expiry, and theta is typically thought of as how value changes as time moves forward, in which case t would be decreasing.) Let’s further assume that the interest rate is zero, so that the theta simplifies to:
In this case, the relationship between vega and theta is:
This relationship, though under the further assumption of a zero interest rate, holds under a weaker assumption than Black-Scholes: it requires that your volatility parameter (however you define that) and your time to expiry are used in the price solely in the form of an intermediate parameter σ * sqrt(t). To see this mathematically, let’s write the call price as some unspecified function of this intermediate parameter:
Then if we take derivatives with the chain rule:
and you can see that the relationship holds. If interest rates are zero, Black-Scholes does satisfy this weaker assumption; if we define V = σ * sqrt(t), the d1 and d2 terms can be rewritten as:
We might call V “total” volatility. The intuition behind tying σ and t together is that an option price depends on the probability distribution of the asset out to time t, which in turn depend on a) the value of t is and b) how “innately” volatile the asset is, represented by σ. A high-volatility asset will have a wider distribution than a low-volatility asset over the same time frame, but the low-volatility asset will have a wider distribution at some point if you examine it over a sufficiently longer time frame than the high-volatility asset. Combining the two parameters as V = σ * sqrt(t) is to say that you’ve defined your σ as a per-root-time measure of volatility, or, more simply, you’ve defined σ2 as a per-time measure of volatility. For those who have taken some stochastic math, you’ll know that this is indeed true of standard Brownian motion: variance at time t is σ2t.
Why might you be interested in this (which otherwise seems like a small mathematical exercise to kick at financial interview candidates)? Of course, the fewer assumptions your models need, the better, and we can more broadly and confidently apply any aspects of our modeling framework that depend on only a subset of the full assumptions. It’s not simply that we need to worry that much less about matching assumptions and reality, but also that these aspects of the model will be robust to changes in a real-world environment. In times of financial crisis, certain assumptions that were a very strong fit to reality for a long time may suddenly fall apart. Rather than either relying on violable assumptions or throwing out a model that does actually work most of the time, we can assess what aspects of our models rely on exactly what assumptions and be aware of what will and will not hold up in a changing environment.
The Heritage Health Prize
I started work on a second data competition, the Heritage Health Prize, which is well-known in the community as it has a very large purse, $3 million to the winning team. The objective of this competition is to predict hospitalizations for patients, given health insurance claims data for those patients in previous years. It is a tremendous application of data analysis, as I think healthcare is extremely fertile ground for increasing efficiency by being smarter about care and prescription and procedure. I may be off-and-on with this one, working on it for a while and then letting it sit for a while; as before, my objective is to learn as much as I can, not realistically to win, and if I feel like I’m spinning my wheels I’ll drop it for a while.
What I particularly like about this competition is the “Milestone Prizes” that the organizers also award. The competition will last for two years, and every 6 months the top 2 entrants win a much smaller but not insubstantial prize, in the five-digit dollar range. In order to claim the Milestones, the winning teams must submit a write-up of their methodology, to the organizers’ satisfaction. Here are links to the Milestone 1 and Milestone 2 papers. (You can only read those PDFs if you are in the competition, unfortunately, and I don’t intend to re-share them if the organizers don’t want them to be shared.)
Two Milestones have passed, with the third coming up in a few weeks. The papers have been tremendously helpful in getting started; my initial approach has been a highly simplified version of their procedures, and it’s good enough to get to 211th place out of 1268 (though only 818 entries right now clear a naïve-ish benchmark where every entry is predicted at an optimized constant value, and I say “naïve-ish” because the method for deducing that optimized constant is thoughtful). Unfortunately my efforts at sophisticating my models along the lines of the papers have not yielded much improvement beyond my initial go, but hopefully I’ll figure something out.
Although two Milestones have passed, it is helpful to read the first Milestone papers first, because the later ones build on/make reference to the previous ones. I was surprised by the similarity of the papers’ structure, despite being written independently:
Features: from the raw data supplied by the competition, what variables became the input into your prediction algorithms? In some cases, there is no transformation; you feed the competition data right through. In other cases, the papers calculated per patient averages, minimums and maximums, etc. and fed those through.
Algorithms: in general, strong entries use more than one (see “Ensembling” below). This is where some of the ornery mathematics comes into play, and to really do a good job here you need to read some academic papers. But many of the established statistical models have already been implemented in languages such as R, so if you simply want to get an entry on the leaderboard you actually don’t need to know too much about the models; download them and run them as a black box. (I’m still learning about these models and yet I’ve managed to write implementations that use them.) R in particular has strong community development of these statistical models and is what I’ve been using. The algorithms that are new to me that I’ve been trying to learn so far are called gradient boosting and random forests.
Feature selection: a model is a combination of an algorithm and a subset of the available features. You might run the same algorithm on two different subsets of the features and call those two separate models. Models with the same algorithms may benefit less from the ensembling step (see below) because they may perform similarly well or similarly poorly on a given data point, but the papers both seem to employ this strategy to generate better predictions.
Ensembling: it seems the established way to get a strong overall model is to harness many different prediction models and ensemble them with a top-level algorithm that weights the models accordingly. The idea is that different models may perform well on different subsets of the data (for whatever reason; the “why” may not be well understood), so if you can combine them in a manner that uses the best suited model for each data point, you’ll have a very strong predictor. I actually find the papers to be a little sparse on some details here (maybe because I’m inexperienced) but I think the procedure followed by the Milestone winners is to run what’s called a ridge regression to calculate weightings for each model and for the final prediction to be a linear combination of the models.
Miscellany: One of the Milestone papers interestingly pointed out that the distribution for one feature changed sharply in the last year of available data. In finance we’d call this a “regime change.” The authors decided to toss that feature entirely as a result. They illustrated what clearly does appear to be a change in the nature of the feature’s statistical distribution but did not provide a concrete quantitative test for it, and my own efforts to write such a screen haven’t been successful so far. The issue is that you may not worry about a change in the mean or variance or even a few higher-order moments of a feature’s statistical distribution, but you may be worried if the variable’s family of distributions changed; if something used to be normally distributed and suddenly becomes uniformly distributed, that’s a real problem.
There was little attempt to impose a real-world interpretation on the raw data. The winners generally didn’t try to say something about why their models do what they do with drug prescriptions, hospitalization locations, etc. With minor exceptions, they focused on getting good data and good data mining algorithms. To some degree the selection of features induces some kind of interpretation – why did you calculate this feature? why are you picking this subset of features? – but that is not explained in much depth, and I interpret that to mean that it was not done on the basis of heavy thinking about real-world meaning of the data.
Having already been through a rookie stumbling phase with Amazon EC2, I am pleased to say I’m using it a bit more efficiently now. I’ve already got a “base” snapshot of a Linux install (Ubuntu) sitting around, and I’ve done all my work on a separate EC2 volume. If I ever want to cease work for a while, I can just detach the drive and stop the instance, and only pay for storage. If I want a lot of computing power or I want to try more than one thing in parallel, I can duplicate the volume, create some new higher-powered instances, attach the volumes to the new instances, and go. It’s pleasantly easy at this point to get started.
Big data autodidacticism
The aforementioned Facebook data mining contest ends today. The contest was, given a directed graph with missing edges and a list of nodes, to predict up to 10 new edges for each node in the list to point to. This is the first time I’ve tried a Kaggle competition. I picked it up as a way to teach myself about machine learning and data analysis techniques. I’ve also done a bit of reading from Toby Segaran’s Programming Collective Intelligence (I also have Drew Conway and John Myles White’s Machine Learning for Hackers but haven’t really gone beyond the intros yet). And I’ve also been trying out a machine learning course from Coursera, given by Stanford professor Andrew Ng, which is just finishing up as well.
On Kaggle I’m somewhere around the 75th to 80th percentile, although I’m afraid to say my solution is essentially the same as one posted (possibly against the rules?) in the discussion forums, so not really an original idea on my part. For an early description of my attempts, see the previous post. As it turns out, those attempts all fared worse than a PageRank-like algorithm that operated as follows, given a node for which you want to predict outgoing edges:
- Every other node is initially scored zero.
- Send out a value of 1/(# of edges) out along each edge to each neighbor, both on outgoing and incoming edges. So both nodes that point to and are pointed to this node will receive this value, and a neighbor node that both points to and is pointed to by the node in question will receive 2x this value.
- Add the value received by each neighboring node to its score.
- Repeat steps 2 and 3 recursively twice, going out to the neighbors’ neighbors and the neighbors’ neighbors’ neighbors, but in these cases, if sending a value across an incoming edge (in the reverse direction that the edge points), do not add the value received by the neighbor to its score.
Note that this is not a probability distribution across nodes. I avoided looking at the forum-posted solution and implementation for a while, then finally when I thought I was kind of spinning my wheels I read it through and punted around a few random improvements, but none of them really worked. (I did re-implement the solution in my own code framework, of course.)
Prior to starting on Kaggle, I had been sort of following along and plugging away at the examples in Segaran’s book, reproducing the code, running the examples myself, etc. I was learning, but I think it really helps to have some kind of project or target to go after. It’s the difference between, say, learning music by listening to lots of songs and reading scores and charts and theory, and learning by actually picking up an instrument and playing. (During this time I actually picked up guitar as well – it’s not a bad change of pace when you need one, and it’s nice to fiddle around with one while a slow-moving program is running.) I do still plan to return to his book and continue along with more examples, hopefully with a better appreciation and faster learn rate now that I’ve tried a project.
Participating in the competition was definitely educational, but as mentioned, it does lend itself to some wheel spinning. When submitting predictions, the competition does compute your overall score (using a metric publicly defined in the rules), but no details about what you did right and what you did wrong, as you might actually have in a real-life situation. Obviously they have to do this so that people don’t just submit a solution that is overfit to the test data. But this does mean, I think, that you’re just going to learn at that much of a slower pace.
Kaggle did give me the chance to use Amazon EC2 for what is ostensibly its “real” purpose, which is to purchase computing power by the hour. The algorithm described above is slow (at least my implementation of it was slow, maybe someone out there has a smarter and speedier version), and would take hours and possibly days to run on my laptop (a MacBook Air). Once I started getting to the point where my algorithms were taking this long, I took it to the cloud, spinning up a high powered Linux instance, uploading the code, and running it there. It still would take a few hours by the end, but that’s a bearable runtime.
To take full advantage of the multiple cores on the high-end EC2 instances I had to rewrite the code to support multithreading, which was something I hadn’t done before, and which was in my opinion generally a frustrating experience, lending itself to unpredictable crashes and more challenging debugging.
A word or two about Coursera, whose machine learning course I’m finishing up now: I liked it enough to try some more courses, but at times it felt like I was just following along the motions. To extend my music analogies, it felt like I was indeed actually playing guitar, but someone was sitting behind me holding my hands making me strum and finger all the chords. I’m not positive how much I will retain and how much will slide out my ears within the coming weeks. The slides and the presentations are good reads, but the programming exercises aren’t all that. The benefits you get from taking an in-person, structured class is that you also have close contact and cooperation with classmates; maybe you realistically can’t do Courseras unless they’re coupled with Meetups.
Edge prediction
Recently I’ve been working on a Kaggle competition sponsored by Facebook. Kaggle is a website onto which firms and organizations can upload their own data mining competitions open to the public. They will provide some sort of input/output data set, named a training set in the lingo of machine learning, which competitors use to create their predictive algorithms. They will also provide a test set of inputs and some metric for scoring predicted output versus true output. Competitors submit their predictions and Kaggle scores them against the true outputs and ranks the leaders.
I don’t have a realistic hope of winning this competition – it’s my first time trying and there are pro data scientists working on this stuff – but it has been a good way to learn about the design of machine learning algorithms. Additionally, while it’s not a truly Big Data set (the uncompressed training data set is 142 MB), it’s big enough that you can’t go with brute force methods; you need to be thoughtful about what you do and do not spend time computing.
The Facebook competition is an edge prediction problem. Facebook provides a data file describing some kind of social network (it isn’t the Facebook graph, and obviously it’s anonymous, with graph nodes represented by numbers; someone in the forums put up a decent guess that it’s Instagram) that has had some of its edges deleted. The graph is directed, meaning that every connection is from one node and to another node; A can connect to B independently of B connecting to A. Facebook provides a list of nodes and asks you to make 0 to 10 ranked recommendations as to what other nodes it should follow, or in other words, what missing edges you would recommend drawing from that node to the rest of the graph.
The training set consists of about 1.86 million nodes connected by about 9.44 million edges. There are no self-connections; an edge always connects two distinct nodes. Theoretically you want to be able to assess a score to every pair of nodes (from-node, to-node) and grab the top several pairs for which edges do not already exist. However this requires a couple trillion score calculations, which for any computationally costly score calculation will become infeasible, and in any case will often produce a poor score that is subsequently discarded. So you have to cut your scope down; for each node, you might consider only nodes within a certain number of connections. (In fact my highest-ranking effort at present writing only attempts to connect node A to node B if node B is already connected to node A; perhaps this implies that my more sophisticated attempts are super lame, but hey, it’s currently 64th percentile, so you could do worse.)
There are two papers I’ve found informative for the same reason, namely, that they provide a broad overview of edge prediction methodology. Liben-Nowell and Kleinberg’s “The Link Prediction Problem for Social Networks” I found to be more readable. Cukierski, Hamner, and Yang’s “Graph-based Features for Supervised Link Predictions” I found to be drier, but it specifically addresses directed graphs and it directly recounts the authors’ successful entry into a similar competition for Flickr (in fact Hamner now works for Kaggle).
My first attempts (before really reading the above papers) were based off of a simple tip in the Kaggle forums. He proposed simply suggesting every connection A -> B for which B -> A (if A not already -> B). This actually would already get you to the 30th percentile as of the present writing, though this figure will of course drop over time. My best result is still a refined version of this approach, which simply ranks these predictions in a more intelligent fashion. Subsequent attempts at something “smarter” have not yielded improvements in score.
The general approach I’m taking is to define a relevant neighborhood for each node from-node in the test set and then assess a score on each potential edge (from-node, to-node). In the brute-force case each node’s relevant neighborhood would be the entire graph; in the aforementioned strategy of completing bilateral connections, the relevant neighborhood would be any parent nodes of from-node. If you’re just computing one feature, you can just rank the nodes by score, optionally truncate the list based on some kind of cutoff, and return the top 10 nodes as your recommendations (or fewer if there are fewer than 10). The determination of the cutoff is a problem with an unclear answer; I think you have to do some kind of analysis on the distribution of scores, but even then you’re ultimately drawing a line in the sand.
Alternatively, particularly if you’d like to combine more than one feature into your analysis, you could run a logistic regression, which is what I’ve been doing. Briefly, a linear regression attempts to fit a linear equation to a set of input variables to predict the value of an outcome variable; this can give distorted results if the outcome variables all fall within a band, such as if you’re trying to predict a 0-or-1 outcome. A logistic regression transforms the outcome variables from the range [0,1] to the full number line using a function called the logistic function; you would then invert it on any predictions back to the range [0,1] to get a meaningful number.
In our case we can say we’re trying to predict the probability that an edge has been deleted between two nodes, and score node pairs based on this predicted probability. If you are only using one feature to predict, the logistic regression will be trivial, since the logistic function is monotonic; if one pair scores higher than another then it’ll still score higher after being passed through the logistic function. But you can run a regression on multiple features, such as if you wanted to use both two nodes’ common neighbors and their combined number of neighbors, and you can also add square and cube terms and cross terms and all the usual jazz that people do with regressions. Viewing the ranking score as a probability also gives you some intuition behind where you might set a cutoff.
The highest score I’ve gotten so far involved plugging the nodes into a regression based on the numbers of children and parent connections on both the from-node and the to-node. There are a bunch of other methodologies in the above papers that I’d like to try – I’m currently working on a PageRank-based calculation, PageRank being the algorithm underlying how Google ranks web query relevancy.
Statistically translating phrases with unusual translations
Google Translate, according to Wikipedia and my own empirical observations, is based on the statistical machine translation paradigm. Rather than constructing its translations by learning dictionaries and rules of grammar, a statistical machine translator will analyze texts for which it has known good translations in multiple languages and will learn how to translate new phrases from them. Statistical translation is descriptivist, reflecting how people actually write and speak rather than how rules of syntax dictate they should write and speak. (To the extent that the source texts themselves reflect how people actually write and speak, of course.)
Consequently, phrases that are in practice translated in a manner that differs strongly from their literal translations are translated as done in practice, not in the literal sense. Idioms are certainly one type of phrase that match this criterion:
- L’habit ne fait pas le moine (French) translates to The clothes do not make the man (English), although the literal translation is The robe does not make the monk. (What is quite interesting is if you start with a lower case, l’habit ne fait pas le moine, you get a non-idiomatic translation, appearances can be deceiving.)
- I’m pulling your leg (English) translates to Yo estoy tomando el pelo (Spanish), with pulling your leg translated to a phrase that in Spanish literally means pulling your hair (but has the same meaning as the English idiom).
- I guess either Google did not source the news about the Costa Concordia disaster or faced too much diversity of translation when sourcing it, because the infamous phrase Vada a bordo, cazzo (Italian) is translated as Go on board, fucking (English), which is sort of broken and clearly was not parsed as a single phrase. This phrase was shouted by an Italian Coast Guard officer at the boat’s captain when the captain proved unwilling to go back and help the rescue; from what I’ve read it seems the right translation might be Get on board, dammit (what the press said) or Get the fuck on board (I suspect this is more unbowdlerizedly accurate, it sounds like the kind of thing a seafaring officer would have said in the stress of that situation if he were speaking English) or Get on board, you dick (apparently more literal, but I think the second phrase sounds slightly more natural).
Another kind of phrase falling into this category is titles. 千と千尋の神隠し (Japanese), a beautiful 2001 animated movie directed by Hayao Miyazaki, translates to Spirited Away (English), which was how the studios translated its title when releasing it to English-speaking countries. The same title also translates to Voyage de Chihiro (French), which was its title in French-speaking countries (almost; it was more precisely Le voyage de Chihiro, and I wonder if there’s a non-statistical rule at play on Google’s side that made it drop the article?).
I don’t speak a word of Japanese, but I found this article regarding the translation of the title, which more directly translates it into English as “Sen and Chihiro’s (experience of) being spirited away.” (In the film Chihiro is at one point renamed Sen, which has significance in her need to hold on to her identity.) Hence both of the above “official” translations differ from the literal translation, and in different ways; the English translation drops most of the title but retains the “spiriting away,” and the French translation drops Sen and converts the “spiriting away” into “the voyage.”
This poses a bit of a problem when you actually want a literal translation. On my tumblr I recently referenced the fact that the Chinese title of Infernal Affairs, the Hong Kong movie upon which The Departed is based, apparently more directly translates to “the non-stop path,” a reference to Buddhist/Chinese hell. But when you feed 無間道 into Google Translate, you get The Departed in English (Infernal Affairs is actually listed as an alternate translation and not the first choice! Interesting that that happened). To underscore Google’s proper-noun interpretation of this phrase, French and Spanish also translate this to The Departed, which I guess means that most of Google’s source text in these languages reused the untranslated English title. (Translating to Portuguese, on the other hand, produces Os Infiltrados, which according to IMDB was the title under which the film was released in Brazil.)
In any case you can’t get any other English translation from Google on this count. I do greatly prefer the data-driven, descriptivist approach of statistical translation over a rules-based approach (and the success of Google Translate is a testament to the validity of the statistics); this is a small but interesting area where it falls a little short. You’re only as good as your data.
Correlated cancer treatments
I’ve been reading a really great book, Siddhartha Mukherjee’s The Emperor of All Maladies, which is a history (or as the subtitle calls it, a “biography”) of cancer. I’m far from the first person to praise it; in fact, I started reading it on the basis of a strong recommendation from Marginal Revolution. I will say that Mukherjee is particularly good at distilling the history of oncology into meaningful themes: the ancient theory of humors, aggressive amputation as treatment, rivaling schools of thought on carcinogenesis, and so on. Bad history writing becomes a list of then-this, then-this, while good history writing finds cohesive narrative threads; this is good history writing.
At one point the book brings up a point I hadn’t thought about: since cancer cells divide far more rapidly than normal cells, they will also evolve far more rapidly in response to the selective pressure of medication. As bacteria evolve resistance to antibiotics, so too can cancer cells evolve resistance to chemotherapy. I’m no doctor and so I cannot comment on how important of a consideration this is in cancer treatment (perhaps it is actually quite minor?), but I do find it an interesting though unfortunate example of evolution at a sub-organism level.
The book also discusses the approach developed in the 1960s of treating cancer with an aggressive combination of chemotherapeutic drugs. One chapter describes oncologists trying first two, then three, then four cytotoxic drugs at once, seriously endangering patients’ lives in the hopes of eliminating every last trace of their cancers. (Many cancer treatments are harmful to healthy human cells in addition to cancerous ones, making treatment potentially lethal; at the same time there had been cases of cancer returning after having been reduced to undetectable levels, encouraging doctors to pursue forceful medication even after outward signs of the disease had disappeared.) Mukherjee describes a synergistic effect of combinative treatment: “Since different drugs elicited different resistance mechanisms, and produced different toxicities in cancer cells, using drugs in concert dramatically lowered the chance of resistance and increased cell killing.” (p. 141)
An important consideration in combinative treatments, then, is the correlation between the probabilities of cells evolving resistance against them. A really ideal pair of treatments would be two treatments where resistant mutation against one necessarily produced non-resistance against the other. For example, if one treatment’s chemical pathway relied on the presence of a certain protein and the other relied on its absence, the two in combination would be immune to a mutation that toggled production of that protein. Correspondingly, I would guess that it would be easier for cancers to evolve resistance against two chemotherapies with similar chemical pathways.
Knowing versus understanding
I was watching a little bit of spring training baseball and at one point the announcer mentioned that the infield fly rule was in effect. For whatever reason, the infield fly rule is sometimes held up as a baseball obscurity known only by devoted fans. It’s actually extremely logical and easy to remember. The problem is in how you remember it: if you simply remember the rule word by word, it will seem like a piece of arcana, but if you understand why it’s in place it’s quite simple.
The infield fly rule states that if there are fewer than two outs and runners on first and second (a runner may or may not be on third as well), any easy pop fly to the infield is an automatic out. The reason it exists is to prevent cheap double plays. If such a ball is not an automatic out, then the fielders can wait under the ball and watch the runners. If either runner strays from his base, the fielders can catch the ball and throw to the vacated base for an easy two outs. But if both runners stay near their bases, they can intentionally drop the ball, quickly pick it up, and get easy force outs at third and second. To remember the infield fly rule, just remember that it applies any time there might be a cheap double play off of a pop fly.
A good mathematical example of this sort of knowing versus understanding is the normal distribution. The formula for its probability distribution function φ(x) is as follows:
where (μ, σ) are the mean and standard deviation of the distribution, respectively.
To someone unfamiliar with statistics, this seems like a painful thing to memorize. But it’s much easier if you break it into comprehensible pieces. The formula, currently with respect to x, can be re-expressed in terms of a “standardized” x, where you subtract the mean and then divide by the standard deviation. This variable will have a mean of 0 and a standard deviation of 1, since means are additive (if you add n to a random number, its mean will increase by n) and standard deviations scale multiplicatively (if you multiply a random number by n, its standard deviation will also scale up by n).
So if we denote our standardized variable as x with a bar over it (a mathematical convention), we get:
Now what about that term on the left? Well, the integral of a probability distribution function from -∞ to +∞ must have a value of 1, so that the total probability of all its possible outcomes sums to 1. The purpose of that term is simply to normalize the curve so that this condition is met. As it turns out:
This is perhaps something that you do need to memorize straight out unless you want to solve this integral every time you write down the normal distribution. But it’s at least a pretty cool relationship to memorize, relating two major mathematical constants with a bit of calculus. If we take this formula and use a change of variables we get:
That’s why the normalizing factor on the left is what it is. Let’s call it A; since we know that it is for the single semantic purpose of normalizing the integral, we should be comfortable shoehorning it into a variable this way. We can now express the normal distribution as follows:
This is the normal probability distribution function down to its bare bones. It’s simply the curve e^(-x^2/2), adjusted by the desired mean and standard deviation, and then normalized by a factor so that the total area under the curve is 1. If you understand this, not only will it be a lot easier for you to remember the formula, but you’ll have a much better comfort level with the function and will be able to apply it more readily elsewhere.
Layman’s explanation of PCA
(I started writing a post related to principal components analysis, and tried to write a brief layman’s explanation of it at its start. But I wasn’t able to come up with something short that was still adequate for the purposes of understanding the post. So I expanded my layman’s explanation to a full post, and will write my originally intended post next.)
Principal components analysis (PCA) is a statistical method in which you re-express a set of random data points in terms of basic components that explain the most variance in the data. For the layman, I think it is easiest to understand with an example data set. Below is some basic World Bank 2009 data for the G20 countries (19 data points, since one of the G20 “countries” is the EU):
| Country | GDP per capita ($) | Life expectancy (years) | Forested land area (%) |
| Argentina | 7,665 | 75 | 10.7 |
| Australia | 42,131 | 82 | 19.4 |
| Brazil | 8,251 | 73 | 61.4 |
| Canada | 39,644 | 81 | 34.1 |
| China | 3,749 | 73 | 22.2 |
| France | 40,663 | 81 | 29.1 |
| Germany | 40,275 | 80 | 31.8 |
| India | 1,192 | 65 | 23.0 |
| Indonesia | 2,272 | 68 | 52.1 |
| Italy | 35,073 | 81 | 31.1 |
| Japan | 39,456 | 83 | 68.5 |
| Mexico | 7,852 | 76 | 33.3 |
| Russia | 8,615 | 69 | 49.4 |
| Saudi Arabia | 13,901 | 74 | 0.5 |
| South Africa | 5,733 | 52 | 4.7 |
| South Korea | 17,110 | 80 | 64.1 |
| Turkey | 8,554 | 73 | 14.7 |
| United Kingdom | 35,163 | 80 | 11.9 |
| United States | 45,758 | 78 | 33.2 |
Each data point (GDP per capita, life expectancy, forested land area) can be expressed in terms of a linear combination of vectors (1,0,0), (0,1,0) and (0,0,1), which I’ll refer to as components. For example, Argentina’s data can be represented as 7665 * (1,0,0) + 75 * (0,1,0) + 10.7 * (0,0,1). Using these components as our “basis” is very straightforward, since the coefficients simply correspond to the values of the data points.
However, it is an algebraic fact that we could have used any three linearly independent vectors as our components (“linearly independent” vectors cannot be expressed as a sum of multiples of each other). For example, if our vectors had been (1,1,0), (1,0,1), and (0,1,1), then we could also have represented Argentina as 3864.65 * (1,1,0) + 3800.35 * (1,0,1) – 3789.65 * (0,1,1). These coefficients are not especially intuitive, but the components do work; we could re-express all of the countries’ data points in terms of this basis instead.
PCA provides us with a way of finding basis vectors that explain the largest amount of variance in the data. For example, as you might expect, GDP per capita and life expectancy are correlated. Therefore a basis vector like (10000,4,0) would be useful because variation in its coefficient would explain a lot of the variation in the overall data. PCA produces a set of component vectors where the first vector is the one that explains the most variance possible, the second vector explains the most variance after accounting for the variance explained by the first vector, and so on.
We often standardize the data by its standard deviation first, to avoid overweighting numerically larger data points; for example, we wouldn’t want to give undue weight to GDP per capita over life expectancy just because GDP figures are in the thousands and life expectancy figures are all below 100. (This gives us vectors whose lengths are all equal to 1.) Running a standardized PCA on the data in R (using the function prcomp()) above yields the following three component vectors:
| component | PC1 | PC2 | PC3 |
| GDP per capita ($) | 0.6539131 | -0.35020818 | -0.6706355 |
| Life expectancy (years) | 0.6925541 | -0.07977085 | 0.7169418 |
| Forested land area (%) | 0.3045760 | 0.93326980 | -0.1903749 |
Variation in the coefficients of the first vector explains 60.3% of the variance of the data; when you add the second vector you can explain an additional 31.5%, and when you add the third you explain the remaining 8.2%. (Since as we discussed, the data can be fully re-expressed with three vectors, the variance should be fully explained by the time we include the third vector.)
This analysis tells us that the most important explanatory axis is that of GDP per capita and life expectancy, although forested land area is also correlated with these two to a weaker extent. You can see this by the fact that the first principal component has positive numbers for all three but very similar numbers for GDP per capita and life expectancy. If we had to simplify our data down to one single number per country while losing the least amount of information, the coefficient of the first principal component would be it.
The second principal component tells us that the variation that remains after the first component can be best explained with variation in forested land area, with some negative weight given to GDP per capita. This is as we might expect; once variation along the GDP-life expectancy axis is accounted for, the remaining variation is mostly in forested land area. (I included it specifically to be poorly correlated with the other two.) The fact that GDP per capita has a negative value on the second component suggests that it is less correlated with forested land area than the first component alone would suggest. This is indeed true; forested land area in our data set has a 28% correlation with life expectancy but only an 8% correlation with GDP per capita.
The third component shows that the remaining variance is mostly how life expectancy and GDP per capita differ beyond that which is predicted by variation in the first two components. Keep in mind, though, that by the time we’re here we have already explained 91.8% of the data variance; it is less valuable to read into the meaning of the least significant principal components.