Traveling Lands Beyond

"Beyond what?" thought Milo as he continued to read.

Posts Tagged ‘principal components

Layman’s explanation of PCA

leave a comment »

(I started writing a post related to principal components analysis, and tried to write a brief layman’s explanation of it at its start. But I wasn’t able to come up with something short that was still adequate for the purposes of understanding the post. So I expanded my layman’s explanation to a full post, and will write my originally intended post next.)

Principal components analysis (PCA) is a statistical method in which you re-express a set of random data points in terms of basic components that explain the most variance in the data. For the layman, I think it is easiest to understand with an example data set. Below is some basic World Bank 2009 data for the G20 countries (19 data points, since one of the G20 “countries” is the EU):

Country GDP per capita ($) Life expectancy (years) Forested land area (%)
Argentina 7,665 75 10.7
Australia 42,131 82 19.4
Brazil 8,251 73 61.4
Canada 39,644 81 34.1
China 3,749 73 22.2
France 40,663 81 29.1
Germany 40,275 80 31.8
India 1,192 65 23.0
Indonesia 2,272 68 52.1
Italy 35,073 81 31.1
Japan 39,456 83 68.5
Mexico 7,852 76 33.3
Russia 8,615 69 49.4
Saudi Arabia 13,901 74 0.5
South Africa 5,733 52 4.7
South Korea 17,110 80 64.1
Turkey 8,554 73 14.7
United Kingdom 35,163 80 11.9
United States 45,758 78 33.2

Each data point (GDP per capita, life expectancy, forested land area) can be expressed in terms of a linear combination of vectors (1,0,0), (0,1,0) and (0,0,1), which I’ll refer to as components. For example, Argentina’s data can be represented as 7665 * (1,0,0) + 75 * (0,1,0) + 10.7 * (0,0,1). Using these components as our “basis” is very straightforward, since the coefficients simply correspond to the values of the data points.

However, it is an algebraic fact that we could have used any three linearly independent vectors as our components (“linearly independent” vectors cannot be expressed as a sum of multiples of each other). For example, if our vectors had been (1,1,0), (1,0,1), and (0,1,1), then we could also have represented Argentina as 3864.65 * (1,1,0) + 3800.35 * (1,0,1) – 3789.65 * (0,1,1). These coefficients are not especially intuitive, but the components do work; we could re-express all of the countries’ data points in terms of this basis instead.

PCA provides us with a way of finding basis vectors that explain the largest amount of variance in the data. For example, as you might expect, GDP per capita and life expectancy are correlated. Therefore a basis vector like (10000,4,0) would be useful because variation in its coefficient would explain a lot of the variation in the overall data. PCA produces a set of component vectors where the first vector is the one that explains the most variance possible, the second vector explains the most variance after accounting for the variance explained by the first vector, and so on.

We often standardize the data by its standard deviation first, to avoid overweighting numerically larger data points; for example, we wouldn’t want to give undue weight to GDP per capita over life expectancy just because GDP figures are in the thousands and life expectancy figures are all below 100. (This gives us vectors whose lengths are all equal to 1.) Running a standardized PCA on the data in R (using the function prcomp()) above yields the following three component vectors:

component PC1 PC2 PC3
GDP per capita ($) 0.6539131 -0.35020818 -0.6706355
Life expectancy (years) 0.6925541 -0.07977085 0.7169418
Forested land area (%) 0.3045760 0.93326980 -0.1903749

Variation in the coefficients of the first vector explains 60.3% of the variance of the data; when you add the second vector you can explain an additional 31.5%, and when you add the third you explain the remaining 8.2%. (Since as we discussed, the data can be fully re-expressed with three vectors, the variance should be fully explained by the time we include the third vector.)

This analysis tells us that the most important explanatory axis is that of GDP per capita and life expectancy, although forested land area is also correlated with these two to a weaker extent. You can see this by the fact that the first principal component has positive numbers for all three but very similar numbers for GDP per capita and life expectancy. If we had to simplify our data down to one single number per country while losing the least amount of information, the coefficient of the first principal component would be it.

The second principal component tells us that the variation that remains after the first component can be best explained with variation in forested land area, with some negative weight given to GDP per capita. This is as we might expect; once variation along the GDP-life expectancy axis is accounted for, the remaining variation is mostly in forested land area. (I included it specifically to be poorly correlated with the other two.) The fact that GDP per capita has a negative value on the second component suggests that it is less correlated with forested land area than the first component alone would suggest. This is indeed true; forested land area in our data set has a 28% correlation with life expectancy but only an 8% correlation with GDP per capita.

The third component shows that the remaining variance is mostly how life expectancy and GDP per capita differ beyond that which is predicted by variation in the first two components. Keep in mind, though, that by the time we’re here we have already explained 91.8% of the data variance; it is less valuable to read into the meaning of the least significant principal components.

Written by Andy

Mon 6 Feb 2012 at 2:00 am

Follow

Get every new post delivered to your Inbox.