<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Traveling Lands Beyond</title>
	<atom:link href="http://travelinglandsbeyond.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://travelinglandsbeyond.com</link>
	<description>&#34;Beyond what?&#34; thought Milo as he continued to read.</description>
	<lastBuildDate>Wed, 05 Dec 2012 05:12:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='travelinglandsbeyond.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Traveling Lands Beyond</title>
		<link>http://travelinglandsbeyond.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://travelinglandsbeyond.com/osd.xml" title="Traveling Lands Beyond" />
	<atom:link rel='hub' href='http://travelinglandsbeyond.com/?pushpress=hub'/>
		<item>
		<title>Haversines</title>
		<link>http://travelinglandsbeyond.com/2012/11/20/haversines/</link>
		<comments>http://travelinglandsbeyond.com/2012/11/20/haversines/#comments</comments>
		<pubDate>Tue, 20 Nov 2012 15:44:34 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[cartography]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[navigation]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=454</guid>
		<description><![CDATA[Let’s suppose you want to calculate the distance between two points. You don’t have the ability to measure the distance directly, but you can measure the distance between where you’re standing and each of the points, and you can also measure the angle formed by straight lines out to those two points from where you’re [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=454&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Let’s suppose you want to calculate the distance between two points. You don’t have the ability to measure the distance directly, but you can measure the distance between where you’re standing and each of the points, and you can also measure the angle formed by straight lines out to those two points from where you’re standing. Can you calculate the distance between those points?</p>
<p>If you assume that all points and distances are calculated along a flat plane, you can determine the distance by applying the law of cosines to the triangle formed by you and those two points. The law of cosines states that</p>
<p><img src='http://s0.wp.com/latex.php?latex=c%5E2+%3D+a%5E2+%2B+b%5E2+-+2ab+%5Ccos+C+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='c^2 = a^2 + b^2 - 2ab &#92;cos C ' title='c^2 = a^2 + b^2 - 2ab &#92;cos C ' class='latex' /></p>
<p>where <em>a</em>, <em>b</em>, and c are the sides of a triangle, and <em>C</em> is the angle opposite side <em>c</em>. In this case, you are solving for <em>c</em>, and you have <em>a</em>, <em>b</em>, and <em>C</em>.</p>
<p>Of course, the Earth is not a flat plane. If you are interested in distances are calculated along the surface of the earth, you will get some error across large distances using the law of cosines. Fortunately there is another mathematical formula you can use, the <a href="http://en.wikipedia.org/wiki/Spherical_law_of_cosines">spherical law of cosines</a>, which states that:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos+c+%3D+%5Ccos+a+%5Ccos+b+%2B+%5Csin+a+%5Csin+b+%5Ccos+C+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;cos c = &#92;cos a &#92;cos b + &#92;sin a &#92;sin b &#92;cos C ' title='&#92;cos c = &#92;cos a &#92;cos b + &#92;sin a &#92;sin b &#92;cos C ' class='latex' /></p>
<p>where <em>a</em>, <em>b</em>, and <em>c</em> are the sides of a “triangle” whose sides are actually curves connecting points on the surface of a sphere, and <em>C</em> is the angle opposite side c. This law holds true for a unit sphere, i.e. a sphere with radius = 1, so when calculating real-life distances along Earth’s surface, scale all your distances by the radius of the earth <em>r.</em> (The radius of the Earth is approximately 6371km; this in fact varies according to location as the Earth is not a perfect sphere, and so you’ll have some error in your calculations.) <a href="http://www.amatyc.org/Events/conferences/2011Austin/proceedings/hutchinsonT1A.pdf">This document</a> is a really well-explained derivation of the spherical law of cosines, and if you want to brush up on your vector math skills I suggest you step through the proof yourself.</p>
<p>If you imagine that you’re standing at the North Pole, you can get a generic formula for the distance between two points given their latitudes and longitudes. The sides <em>a</em> and <em>b</em> equal 90° &#8211; their respective latitudes, and the angle <em>C</em> equals the difference between their longitudes. (Set all the latitudes and longitudes in terms of degrees north and degrees east, respectively – so 1° south = -1° north, and 1° west = -1° east.) The law of cosines then becomes:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos+%28d%2Fr%29+%3D+%5Csin+%5Cphi_1+%5Csin+%5Cphi_2+%2B+%5Ccos+%5Cphi_1+%5Ccos+%5Cphi_2+%5Ccos+%28%5Clambda_2+-+%5Clambda_1%29&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;cos (d/r) = &#92;sin &#92;phi_1 &#92;sin &#92;phi_2 + &#92;cos &#92;phi_1 &#92;cos &#92;phi_2 &#92;cos (&#92;lambda_2 - &#92;lambda_1)' title='&#92;cos (d/r) = &#92;sin &#92;phi_1 &#92;sin &#92;phi_2 + &#92;cos &#92;phi_1 &#92;cos &#92;phi_2 &#92;cos (&#92;lambda_2 - &#92;lambda_1)' class='latex' /></p>
<p>where <em>d</em> is the distance between the points, <em>r</em> is the radius of the Earth, and the φ and λ represent latitude and longitude, respectively. (To get this formula, substitute values for <em>a</em>, <em>b</em>, and <em>C</em>, and then use the fact that cos (90° &#8211; x) = sin x and sin(90° &#8211; x) = cos x.)</p>
<p>What I found most interesting is that people often prefer to use a transformed version of the spherical law of cosines, called the <a href="http://en.wikipedia.org/wiki/Law_of_haversines">law of haversines</a>. The haversine is a simple trigonometric function:</p>
<p><img src='http://s0.wp.com/latex.php?latex=h%28x%29+%3D+%5Csin%5E2+%28x%2F2%29+%3D+%281+-+%5Ccos+x%29%2F2+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='h(x) = &#92;sin^2 (x/2) = (1 - &#92;cos x)/2 ' title='h(x) = &#92;sin^2 (x/2) = (1 - &#92;cos x)/2 ' class='latex' /></p>
<p>If you substitute this into the law of cosines and apply some trig identities, you get the law of haversines:</p>
<p><img src='http://s0.wp.com/latex.php?latex=h%28c%29+%3D+h%28a+-+b%29+%2B+%5Csin+a+%5Csin+b+%5Ccdot+h%28C%29+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='h(c) = h(a - b) + &#92;sin a &#92;sin b &#92;cdot h(C) ' title='h(c) = h(a - b) + &#92;sin a &#92;sin b &#92;cdot h(C) ' class='latex' /></p>
<p>And if you again use the approach of standing at the North Pole, you get a practical definition of how to calculate the spherical distance between two points on Earth given their latitudes and longitudes:</p>
<p><img src='http://s0.wp.com/latex.php?latex=h%28d%2Fr%29+%3D+h%28%5Cphi_2+-+%5Cphi_1%29+%2B+%5Ccos+%28%5Cphi_1%29+%5Ccos+%28%5Cphi_2%29+h%28%5Clambda_2+-+%5Clambda_1%29&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='h(d/r) = h(&#92;phi_2 - &#92;phi_1) + &#92;cos (&#92;phi_1) &#92;cos (&#92;phi_2) h(&#92;lambda_2 - &#92;lambda_1)' title='h(d/r) = h(&#92;phi_2 - &#92;phi_1) + &#92;cos (&#92;phi_1) &#92;cos (&#92;phi_2) h(&#92;lambda_2 - &#92;lambda_1)' class='latex' /></p>
<p>Why would you use haversines rather than the spherical law of cosines? They are mathematically identical, and the haversine law doesn’t look any simpler than the cosine law. There are a few helpful resources on the subject &#8211; the <a href="http://en.wikipedia.org/wiki/Law_of_haversines">Wikipedia article on law of haversines</a>, <a href="http://gis.stackexchange.com/questions/4906/why-is-law-of-cosines-more-preferable-than-haversine-when-calculating-distance-b">this Stack Exchange thread</a> and <a href="http://stackoverflow.com/questions/2096385/formulas-to-calculate-geo-proximity">this Stack Overflow thread</a> - but the one I found that explained it the most clearly was this article from 1984 called <a href="http://daimi.au.dk/~dam/thesis/Sky_and_Telescope_1984.pdf">&#8220;Virtues of the Haversine,&#8221;</a> by one R. W. Sinnott (complete with old-school BASIC implementation of the haversine navigational formula!).</p>
<p>The way floating point numbers are represented in computers is with a certain number of digits of significance, plus an exponent, in the manner of scientific notation. Therefore (to borrow an example from this <a href="http://en.wikipedia.org/wiki/Single_precision_floating-point_format">Wikipedia article on the IEEE binary32 format</a>) a number like 0.375 will in fact be represented as 375 for its significant digits and −3 for its exponent in a decimal format &#8211; or, in binary format, 11 as its significant digits and −11 as its exponent. The decimal point is “floated,” allowing representation of very small or very large numbers to as much accuracy as significant digits will allow.</p>
<p>Over short distances, the spherical law of cosines takes the form cos <em>c</em> = something close to 1. This is necessarily true if the formula is to be accurate, because the only way for <em>c</em> to be very small is for the right-hand side of the equation to be very close to 1. On the other hand, the law of haversines over short distances yields something of the form h(<em>c</em>) = something close to 0. For example, over distances of 1 km (using 6371km as the Earth’s radius) and an angle of 45°, the law of cosines requires you to apply arccos to 0.9999999981237065, while the law of haversines requires you to apply an inverse haversine function to 9.381466938565206e-10.</p>
<p>Computers using floating-point architecture will have more difficulty with the cosine formula than with the haversine formula, because a number close to 1 requires many digits of precision that are in some sense unimportant to the problem. Continuing the example, let’s suppose we have 16 decimal digits of significance available in our architecture. With the law of cosines, we have to store 8 nines before we get to the part of the number that really “matters,” leaving us with only half the precision we might otherwise have. In contrast, with the law of haversines we have all 16 digits’ worth of precision available. If we were operating on a system with only 8 digits of precision, the value we pass into arccos might get rounded to 1, yielding an obviously incorrect distance of 0, whereas we would still have accuracy with haversines.</p>
<p>Modern computers can store enough digits of significance that for many practical purposes, this isn’t an issue. (In Python, using a 45° angle, I had to shrink <i>a</i> and <i>b</i> to sub-meter distances before the cosine method produced 0.) But as Sinnott points out in his article, increasing the digits merely postpones the problem to a smaller level. And it remains an interesting and relevant illustration of the sometimes-rough stitching between abstract mathematics and computational mathematics.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/454/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/454/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=454&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/11/20/haversines/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Big Bird and federal spending revisited</title>
		<link>http://travelinglandsbeyond.com/2012/10/09/big-bird-and-federal-spending-revisited/</link>
		<comments>http://travelinglandsbeyond.com/2012/10/09/big-bird-and-federal-spending-revisited/#comments</comments>
		<pubDate>Tue, 09 Oct 2012 16:49:27 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[deficit]]></category>
		<category><![CDATA[government]]></category>
		<category><![CDATA[public broadcasting]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=450</guid>
		<description><![CDATA[A while ago I wrote about the importance of broadly familiarizing yourself with the federal budget before forming any opinions about government spending. The subject was brought up several times during the last presidential debate. When watching one of these debates, in which the candidates are specifically interested in scoring points over each other even [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=450&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A while ago <a href="http://travelinglandsbeyond.com/2012/01/02/reading-the-federal-budget/">I wrote</a> about the importance of broadly familiarizing yourself with the federal budget before forming any opinions about government spending. The subject was brought up several times during the last presidential debate. When watching one of these debates, in which the candidates are specifically interested in scoring points over each other even if it comes at the expense of sound reasoning, it is particularly important to keep budget figures in front of you when evaluating claims. It is also important to standardize all numbers into one unit, as un-standardized statistics can be very psychologically <a href="http://xkcd.com/558/">misleading</a>. I’ll use billions.</p>
<p>One memorable sound bite involved Romney labeling PBS as a potential spending cut (I’m sure if he could do it again he would have phrased it differently, as the audience has generally remembered it as a threat to fire Big Bird). PBS, NPR, and other public broadcasting entities are <a href="http://www.cpb.org/funding/">funded in part</a> by the Corporation for Public Broadcasting (private donors also contribute). Its <a href="http://www.cpb.org/aboutcpb/leadership/board/resolutions/FY_2013_Operating_Budget.pdf">federal appropriation in 2012</a> was $0.444 billion. <a href="http://www.gpo.gov/fdsys/pkg/BUDGET-2013-BUD/pdf/BUDGET-2013-BUD-29.pdf">Total enacted federal spending in 2012</a> (see &#8220;Outlays&#8221; under table S-1) was approximately $3,796 billion. So defunding the CPB, which would include not just PBS’s funding but all federal public broadcasting subsidies, would cut about 0.01% of federal spending. I was talking with a friend after the debates and compared it with trying to lose weight by getting a haircut, which in terms of actual proportion of one’s hair weight to one’s overall weight is probably reasonably accurate.</p>
<p>I think Romney’s “borrowing from China” test is a reasonable one, but at the same time anyone who wants to make government spending more manageable has to start off by looking at where we actually spend our money. And that means you’re basically looking at two things: defense and social safety nets (Social Security, Medicare, and Medicaid). Everything else is a sideshow in comparison. </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/450/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=450&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/10/09/big-bird-and-federal-spending-revisited/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Baseball to cricket conversion guide</title>
		<link>http://travelinglandsbeyond.com/2012/09/19/baseball-to-cricket-conversion-guide/</link>
		<comments>http://travelinglandsbeyond.com/2012/09/19/baseball-to-cricket-conversion-guide/#comments</comments>
		<pubDate>Wed, 19 Sep 2012 20:01:42 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[baseball]]></category>
		<category><![CDATA[cricket]]></category>
		<category><![CDATA[sports]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=445</guid>
		<description><![CDATA[Cricket’s Twenty20 World Cup has just started. I certainly didn’t follow cricket growing up but a few years ago a co-worker who grew up in cricket-mad India taught me the rules and got me interested in the sport. I still only know the basics and I only really tune in at major events such as [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=445&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Cricket’s <a href="http://www.espncricinfo.com/icc-world-twenty20-2012/content/series/531597.html">Twenty20 World Cup</a> has just started. I certainly didn’t follow cricket growing up but a few years ago a co-worker who grew up in cricket-mad India taught me the rules and got me interested in the sport. I still only know the basics and I only really tune in at major events such as this one, but I think it’s actually a sport that American baseball fans can really learn to enjoy. (Conversely, I think cricket fans could learn to enjoy baseball as well.)</p>
<p>The Twenty20 format is the shortest format of cricket, with matches lasting a few hours rather than all day or all week. I understand that it represents cricket’s attempts to reach a broader, global audience and to make matches easier to televise. Below, I wrote a twenty-step process to transform a game of baseball into a game of cricket. Of course there are simplifications, and there are no discussions of strategy (perhaps you can deduce them from the rule changes?), but this offers a little bridge to cricket for the typical American, who is already familiar with the rules of baseball.</p>
<ol>
<li>Eleven players to a side.</li>
<li>The cricket bat is flatter, like a long paddle. The cricket ball has a single seam going in a circle around its middle.</li>
<li>Make the field oval-shaped.</li>
<li>Eliminate two bases and move the remaining two bases to the center of the oval. Replace the physical bases with lines behind which runners/batsmen are safe. The area behind the lines are the “creases.”</li>
<li>Eliminate foul territory.</li>
<li>Keep runners on both bases at all times. As there are only two bases, the batting team scores a run every time they switch places.</li>
<li>Eliminate the pitcher’s mound; the pitcher (“bowler”) throws from behind one of the bases.</li>
<li>Give the bowler a running start, but also require him to throw with his arm unbent.</li>
<li>Give the batsman more freedom to move around, rather than remaining constrained in a batter’s box. However, if they run too far forwards they do risk being put out as they will be out of their crease.</li>
<li>Instead of an umpire calling three strikes and you’re out, put wickets in the ground at both bases (“ends”). When these are struck by the ball (even if the ball hits the bat first), the batsman is out.</li>
<li>If the ball hits you and the umpire judges that it would have otherwise hit the wicket, you’re out (“leg before wicket,” or “lbw”).</li>
<li>If a bowled ball is not batted and passes reasonably close to the wickets, play simply continues (the batsmen can try to score runs, called “byes”).</li>
<li>If a bowled ball is too far away to be reasonably batted, the umpire calls it a “wide,” the batting team gets a free run, and the bowler redoes the delivery (it does not count towards the over; see points 18 and 20 below).</li>
<li>A ball that strikes the batsman but is not lbw is treated like a batted ball; runs scored then are “leg byes.”</li>
<li>A batted ball that leaves the grounds on one or more bounces (like a ground-rule double) is called a “four” and is worth four runs. A ball that leaves the grounds in the air (like a home run) is called a “six” and is worth six runs. A bowled ball that leaves the grounds as a “wide” is worth five runs and the bowler redoes the delivery.  There are no fences on the boundaries.</li>
<li>Instead of fielders tagging or forcing out runners, the fielders get runners out by striking the wickets at each end with the ball before the runners can reach safety.</li>
<li>Ten outs (“wickets”) end the innings. Note that since there are eleven players to a side, once you’re out, you will not return for the remainder of the innings.</li>
<li>Every six bowled balls is an “over,” and for every over the fielding team can bring in a new bowler. A bowler who is relieved can bowl again later.</li>
<li>When a bowler is substituted, he switches places with a fielder. The bench only has injury substitutes.</li>
<li>In Twenty20 the number of overs is limited to 20, and each team bats only once. This differs slightly under the one-day international format (50 overs) and substantially in the traditional Test format (no overs limit, but teams can declare their innings over) but I won’t get into that here.</li>
</ol>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/445/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/445/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=445&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/09/19/baseball-to-cricket-conversion-guide/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Improve the insurance card</title>
		<link>http://travelinglandsbeyond.com/2012/09/07/improve-the-insurance-card/</link>
		<comments>http://travelinglandsbeyond.com/2012/09/07/improve-the-insurance-card/#comments</comments>
		<pubDate>Fri, 07 Sep 2012 23:23:52 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data presentation]]></category>
		<category><![CDATA[healthcare]]></category>
		<category><![CDATA[insurance]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=441</guid>
		<description><![CDATA[I recently spent a lot of money on healthcare-related expenses. I had a bad ankle sprain about 6 weeks ago that necessitated a couple of trips to an orthopedist and some physical therapy, and I also had an eye checkup and bought a new pair of glasses because my current ones aren’t very good (they [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=441&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I recently spent a lot of money on healthcare-related expenses. I had a bad ankle sprain about 6 weeks ago that necessitated a couple of trips to an orthopedist and some physical therapy, and I also had an eye checkup and bought a new pair of glasses because my current ones aren’t very good (they get bent out of shape easily and even in shape don’t fit that well on my face and slide off a lot). Paying for healthcare is, in my opinion, a very confusing and opaque process. I think some very un-burdensome laws mandating insurers to provide clearer information could go a long way towards improving the industry.</p>
<p>If you’ve read <a href="http://www.amazon.com/Nudge-Improving-Decisions-Health-Happiness/dp/014311526X"><em>Nudge</em></a>, by Richard Thaler and Cass Sunstein, this concept should be familiar. For those who haven’t read <em>Nudge</em>, one of its recurrent themes is that people should be free to make their own economic choices, but that policymakers should do what they can to encourage people to make good decisions. One way in which they can do this is to require clear presentation of information: prices, pros and cons, simple recommendations, and so on. This approach emphasizes that the government’s role is never to tell people what they have to do but to encourage them to do the right thing, and it is a philosophy with which I broadly agree (see “<a href="http://en.wikipedia.org/wiki/Libertarian_paternalism">Libertarian paternalism</a>”).</p>
<p>Here’s the information on my medical insurance card (I have Aetna):</p>
<ul>
<li>The name of my insurer, the name of my insurance plan, and the employer through which I have obtained this plan</li>
<li>My name, ID number, and insurance group number</li>
<li>Some phone numbers, the insurer’s website address, and the insurer’s physical address</li>
<li>Some cryptic figures: “PCP 20%” and “SPC 20%”</li>
<li>Some fine print on the back</li>
</ul>
<p>The point of carrying around a card like this is to have a quick reference for anything you might need during a medical procedure so that you’re as well-informed at the clinic, hospital, or wherever your point of action is. It’s great to have a nicely organized website you can look things up on at home, but you should still provide customers something that they can check on the fly. And it’s absurd to waste the real estate on the card with fine print. We should all be able to accept that an insurance card does not contain the full, definitive details of the insurance plan; a simple reference to an online document with the definitive details should do fine.</p>
<p>In its place, insurers ought to include more information about the plan itself. My plan is described as “BASIC – DED. $2,100/$4,200.” Does that mean my annual deductible is $2,100 or $4,200? Answer from looking on the website: it’s $2,100; the $4,200 refers to family coverage. What do those “PCP 20%” and “SPC 20%” numbers mean? Answer: I think this refers to the fact that in general I pay 20% for both primary care physicians (PCPs) and specialists (SPCs) after my annual deductible is met. Why not add a few extra words and spell this out more clearly on the card? There is blank space that could be used.</p>
<p>That’s it, by the way, for information about the plan. Left unstated on the card is the out of pocket maximum, which is $5,000 for me; I think it’s important to know the maximum amount of money you can spend on health care per year under your insurance plan. Other nice things to know: what are the coverage levels and frequency limits, if any, on physical exams/immunizations, cancer screenings, hospitalizations, and urgent care centers? I think coverage on these four services with some brief explanations would fit nicely on the card.</p>
<p>There’s only so much text you can fit on a card, but you can fit much more in non-human readable code, and you can fit much more code in a magnetic stripe or chip. If insurance formulas are standardized into a coded format, then what you could do is encode the full plan data into a magnetic stripe on an insurance card. Then, to calculate the cost to a patient, a medical services provider with a computer program capable of reading this format could simply enter the services that the patient will require, swipe the card, and have it output the total dollar price. This solves an issue that I found very annoying regarding medical bills, which is that I generally don’t know how much I’m going to be charged until days or weeks later, when I get a claim in my email. Part of the difficulty in telling patients what they’ll owe right away is variation in coverage levels. But if insurance plan coverage is standardized into a coded format, the receptionist just needs to itemize the bill and swipe, and a bill appears for the patient right then and there. The program could even automatically send claim information to the insurer as well. This would be more convenient for the patient and medical service office and, I think, less error-prone (I have an error on one of my bills that I’m going to have to sort out on the phone next Monday, and I can’t say that I am looking forward to the back-and-forth calls that this will entail).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/441/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=441&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/09/07/improve-the-insurance-card/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Weak assumptions</title>
		<link>http://travelinglandsbeyond.com/2012/08/24/weak-assumptions/</link>
		<comments>http://travelinglandsbeyond.com/2012/08/24/weak-assumptions/#comments</comments>
		<pubDate>Fri, 24 Aug 2012 06:27:16 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[black-scholes]]></category>
		<category><![CDATA[finance]]></category>
		<category><![CDATA[modeling]]></category>
		<category><![CDATA[normal distribution]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=357</guid>
		<description><![CDATA[This post may be easier to read if you have some comfort with financial mathematics. Thousands of people across the history of finance have dutifully memorized one of the most famous results in financial mathematics, the Black-Scholes formula for pricing a European option. For the sake of completeness (skip ahead if you like), here is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=357&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>This post may be easier to read if you have some comfort with financial mathematics.</p>
<p>Thousands of people across the history of finance have dutifully memorized one of the most famous results in financial mathematics, the <a href="http://en.wikipedia.org/wiki/Black-Scholes">Black-Scholes formula</a> for pricing a European option. For the sake of completeness (skip ahead if you like), here is the formula for pricing a European call (<em>C</em>) or put (<em>P</em>) on a non-dividend-paying asset, which you can also find in countless textbooks and on countless websites:</p>
<p><img src='http://s0.wp.com/latex.php?latex=C+%3D+SN%28d_1%29+-+Ke%5E%7B-rt%7DN%28d_2%29+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='C = SN(d_1) - Ke^{-rt}N(d_2) ' title='C = SN(d_1) - Ke^{-rt}N(d_2) ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=P+%3D+N%28-d_2%29Ke%5E%7B-rt%7D+-+SN%28-d_1%29+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='P = N(-d_2)Ke^{-rt} - SN(-d_1) ' title='P = N(-d_2)Ke^{-rt} - SN(-d_1) ' class='latex' /></p>
<p>where</p>
<p><img src='http://s0.wp.com/latex.php?latex=d_1+%3D+%5Cfrac%7B%5Cln%28S%2FK%29+%2B+%28r+%2B+%5Csigma%5E2%2F2%29t%29%7D%7B%5Csigma+%5Csqrt%7Bt%7D%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='d_1 = &#92;frac{&#92;ln(S/K) + (r + &#92;sigma^2/2)t)}{&#92;sigma &#92;sqrt{t}} ' title='d_1 = &#92;frac{&#92;ln(S/K) + (r + &#92;sigma^2/2)t)}{&#92;sigma &#92;sqrt{t}} ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=d_2+%3D+%5Cfrac%7B%5Cln%28S%2FK%29+%2B+%28r+-+%5Csigma%5E2%2F2%29t%29%7D%7B%5Csigma+%5Csqrt%7Bt%7D%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='d_2 = &#92;frac{&#92;ln(S/K) + (r - &#92;sigma^2/2)t)}{&#92;sigma &#92;sqrt{t}} ' title='d_2 = &#92;frac{&#92;ln(S/K) + (r - &#92;sigma^2/2)t)}{&#92;sigma &#92;sqrt{t}} ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=%28d_2+%3D+d_1+-+%5Csigma+%5Csqrt%7Bt%7D%29&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='(d_2 = d_1 - &#92;sigma &#92;sqrt{t})' title='(d_2 = d_1 - &#92;sigma &#92;sqrt{t})' class='latex' /></p>
<p>and <em>S</em> is the underlying asset price, <em>K</em> is the strike price of the option, <em>t</em> is the time to option expiry, <em>r</em> is the interest rate out to time <em>t</em>, <em>σ</em> is the volatility of the underlying asset, and <em>N()</em> represents the cdf of a standard <a href="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a>.</p>
<p>It is important to remember that while this is a ubiquitous formula used to price options, so much so that option prices are thought of by many traders in terms of their Black-Scholes volatility rather than their dollar price, it is only a mathematical model and is only correct insofar as its assumptions are met. And as with all models, real life matches the model assumptions imperfectly. You could come up with another option pricing model based off of different assumptions and in some sense it would be no more “right” or “wrong” than Black-Scholes; the area of debate would be how well those assumptions fit reality.</p>
<p>For example, let’s say that you had an option on a small pharmaceutical company that was awaiting FDA approval on its only product, a drug upon which the entire firm’s fortunes rested. If the FDA approved, the stock would go to $100, and if not, the stock would go to $0. In this case Black-Scholes’s assumptions about the dynamics of the stock price are very poorly met, and it would not be a great model to use.</p>
<p>Some financiers who are particularly dutiful have also memorized formulas for the basic <a href="http://en.wikipedia.org/wiki/Black-Scholes#The_Greeks">Black-Scholes greeks</a>. For example, the deltas (sensitivities to underlying asset price) of a call and a put are</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+S%7D+%3D+N%28d_1%29+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;frac{&#92;partial C}{&#92;partial S} = N(d_1) ' title='&#92;frac{&#92;partial C}{&#92;partial S} = N(d_1) ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B%5Cpartial+P%7D%7B%5Cpartial+S%7D+%3D+N%28d_1%29+-+1+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;frac{&#92;partial P}{&#92;partial S} = N(d_1) - 1 ' title='&#92;frac{&#92;partial P}{&#92;partial S} = N(d_1) - 1 ' class='latex' /></p>
<p>The relationship between the delta of a call and a put of the same strike and expiry is therefore: call delta &#8211; put delta = 1. The formulas for the deltas are strictly Black-Scholes; you can get them by taking the derivative of the Black-Scholes pricing formula, and they might not be accurate under a different option pricing model. But the relationship between the two is not, depending solely on put-call parity.</p>
<p>Put-call parity states that the price of a call minus the price of a put equals the discounted present value of the asset price minus the strike price. It is a much weaker assumption than those that underlie Black-Scholes. You don’t need to say anything about volatility, or Brownian motion, or continuous-time hedging. Not only that, it’s very intuitive and logical: if you have the right to buy a stock above $100 at some point in the future, and someone has the right to sell a stock to you below $100 at that same point in time, you essentially have a forward agreement to buy the stock at $100, which at that point in time will be worth the expected value of the stock less $100, and which today will be worth the stock price less the discounted value of $100 at expiry. It’s much harder to imagine scenarios in which put-call parity would be violated than in which Black-Scholes assumptions are violated (in fact Black-Scholes assumptions imply put-call parity).</p>
<p>What this means is that any options model that accepts the weak and almost always realistic assumption of put-call parity must have the same relationship between call delta and put delta. Let’s look at another slightly trickier example, regarding vega (sensitivity to volatility) and theta (sensitivity to the passage of time). The Black-Scholes formulas for vega and theta of a call are:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+%5Csigma%7D+%3D+SN%27%28d_1%29+%5Csqrt%7Bt%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;frac{&#92;partial C}{&#92;partial &#92;sigma} = SN&#039;(d_1) &#92;sqrt{t} ' title='&#92;frac{&#92;partial C}{&#92;partial &#92;sigma} = SN&#039;(d_1) &#92;sqrt{t} ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=-%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+t%7D+%3D+SN%27%28d_1%29+%5Cfrac%7B%5Csigma%7D%7B2+%5Csqrt%7Bt%7D%7D+-+rKe%5E%7B-rt%7DN%28d_2%29+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='-&#92;frac{&#92;partial C}{&#92;partial t} = SN&#039;(d_1) &#92;frac{&#92;sigma}{2 &#92;sqrt{t}} - rKe^{-rt}N(d_2) ' title='-&#92;frac{&#92;partial C}{&#92;partial t} = SN&#039;(d_1) &#92;frac{&#92;sigma}{2 &#92;sqrt{t}} - rKe^{-rt}N(d_2) ' class='latex' /></p>
<p>(The negative sign in the theta is there because I have represented <em>t</em> as time to expiry, and theta is typically thought of as how value changes as time moves forward, in which case <em>t</em> would be decreasing.) Let’s further assume that the interest rate is zero, so that the theta simplifies to:</p>
<p><img src='http://s0.wp.com/latex.php?latex=-%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+t%7D+%3D+SN%27%28d_1%29+%5Cfrac%7B%5Csigma%7D%7B2+%5Csqrt%7Bt%7D%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='-&#92;frac{&#92;partial C}{&#92;partial t} = SN&#039;(d_1) &#92;frac{&#92;sigma}{2 &#92;sqrt{t}} ' title='-&#92;frac{&#92;partial C}{&#92;partial t} = SN&#039;(d_1) &#92;frac{&#92;sigma}{2 &#92;sqrt{t}} ' class='latex' /></p>
<p>In this case, the relationship between vega and theta is:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+%5Csigma%7D+%5Cfrac%7B%5Csigma%7D%7B2t%7D+%3D+-%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+t%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;frac{&#92;partial C}{&#92;partial &#92;sigma} &#92;frac{&#92;sigma}{2t} = -&#92;frac{&#92;partial C}{&#92;partial t} ' title='&#92;frac{&#92;partial C}{&#92;partial &#92;sigma} &#92;frac{&#92;sigma}{2t} = -&#92;frac{&#92;partial C}{&#92;partial t} ' class='latex' /></p>
<p>This relationship, though under the further assumption of a zero interest rate, holds under a weaker assumption than Black-Scholes: it requires that your volatility parameter (however you define that) and your time to expiry are used in the price solely in the form of an intermediate parameter σ * sqrt(<em>t</em>). To see this mathematically, let’s write the call price as some unspecified function of this intermediate parameter:</p>
<p><img src='http://s0.wp.com/latex.php?latex=C+%3D+f%28%5Csigma+%5Csqrt%7Bt%7D%29+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='C = f(&#92;sigma &#92;sqrt{t}) ' title='C = f(&#92;sigma &#92;sqrt{t}) ' class='latex' /></p>
<p>Then if we take derivatives with the chain rule:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+%5Csigma%7D+%3D+f%27%28%5Csigma+%5Csqrt%7Bt%7D%29+%5Csqrt%7Bt%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='&#92;frac{&#92;partial C}{&#92;partial &#92;sigma} = f&#039;(&#92;sigma &#92;sqrt{t}) &#92;sqrt{t} ' title='&#92;frac{&#92;partial C}{&#92;partial &#92;sigma} = f&#039;(&#92;sigma &#92;sqrt{t}) &#92;sqrt{t} ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=-%5Cfrac%7B%5Cpartial+C%7D%7B%5Cpartial+t%7D+%3D+f%27%28%5Csigma+%5Csqrt%7Bt%7D%29+%5Cfrac%7B%5Csigma%7D%7B2+%5Csqrt%7Bt%7D%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='-&#92;frac{&#92;partial C}{&#92;partial t} = f&#039;(&#92;sigma &#92;sqrt{t}) &#92;frac{&#92;sigma}{2 &#92;sqrt{t}} ' title='-&#92;frac{&#92;partial C}{&#92;partial t} = f&#039;(&#92;sigma &#92;sqrt{t}) &#92;frac{&#92;sigma}{2 &#92;sqrt{t}} ' class='latex' /></p>
<p>and you can see that the relationship holds. If interest rates are zero, Black-Scholes does satisfy this weaker assumption; if we define <em>V</em> = <em>σ</em> * sqrt(<em>t</em>), the d1 and d2 terms can be rewritten as:</p>
<p><img src='http://s0.wp.com/latex.php?latex=d_1+%3D+%5Cfrac%7B%5Cln%28S%2FK%29+%2B+V%5E2%7D%7B2V%7D+&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='d_1 = &#92;frac{&#92;ln(S/K) + V^2}{2V} ' title='d_1 = &#92;frac{&#92;ln(S/K) + V^2}{2V} ' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=d_2+%3D+d_1+-+V&amp;bg=ffffff&amp;fg=1c1c1c&amp;s=0' alt='d_2 = d_1 - V' title='d_2 = d_1 - V' class='latex' /></p>
<p>We might call <em>V</em> “total” volatility. The intuition behind tying <em>σ</em> and <em>t</em> together is that an option price depends on the probability distribution of the asset out to time <em>t</em>, which in turn depend on a) the value of <em>t</em> is and b) how “innately” volatile the asset is, represented by <em>σ</em>. A high-volatility asset will have a wider distribution than a low-volatility asset over the same time frame, but the low-volatility asset will have a wider distribution at some point if you examine it over a sufficiently longer time frame than the high-volatility asset. Combining the two parameters as <em>V</em> = σ * sqrt(<em>t</em>) is to say that you’ve defined your <em>σ</em> as a per-root-time measure of volatility, or, more simply, you’ve defined <em>σ</em><sup>2</sup> as a per-time measure of volatility. For those who have taken some stochastic math, you’ll know that this is indeed true of standard Brownian motion: variance at time <em>t</em> is <em>σ</em><sup>2</sup><em>t</em>.</p>
<p>Why might you be interested in this (which otherwise seems like a small mathematical exercise to kick at financial interview candidates)? Of course, the fewer assumptions your models need, the better, and we can more broadly and confidently apply any aspects of our modeling framework that depend on only a subset of the full assumptions. It’s not simply that we need to worry that much less about matching assumptions and reality, but also that these aspects of the model will be robust to changes in a real-world environment. In times of financial crisis, certain assumptions that were a very strong fit to reality for a long time may suddenly fall apart. Rather than either relying on violable assumptions or throwing out a model that does actually work most of the time, we can assess what aspects of our models rely on exactly what assumptions and be aware of what will and will not hold up in a changing environment.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/357/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/357/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=357&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/08/24/weak-assumptions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>The Heritage Health Prize</title>
		<link>http://travelinglandsbeyond.com/2012/08/21/the-heritage-health-prize/</link>
		<comments>http://travelinglandsbeyond.com/2012/08/21/the-heritage-health-prize/#comments</comments>
		<pubDate>Wed, 22 Aug 2012 02:20:57 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[healthcare]]></category>
		<category><![CDATA[heritage health prize]]></category>
		<category><![CDATA[kaggle]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[prediction]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=354</guid>
		<description><![CDATA[I started work on a second data competition, the Heritage Health Prize, which is well-known in the community as it has a very large purse, $3 million to the winning team. The objective of this competition is to predict hospitalizations for patients, given health insurance claims data for those patients in previous years. It is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=354&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I started work on a second data competition, the <a href="http://www.heritagehealthprize.com/c/hhp">Heritage Health Prize</a>, which is well-known in the community as it has a very large purse, $3 million to the winning team. The objective of this competition is to predict hospitalizations for patients, given health insurance claims data for those patients in previous years. It is a tremendous application of data analysis, as I think healthcare is extremely fertile ground for increasing efficiency by being smarter about care and prescription and procedure. I may be off-and-on with this one, working on it for a while and then letting it sit for a while; as before, my objective is to learn as much as I can, not realistically to win, and if I feel like I’m spinning my wheels I’ll drop it for a while.</p>
<p>What I particularly like about this competition is the “Milestone Prizes” that the organizers also award. The competition will last for two years, and every 6 months the top 2 entrants win a much smaller but not insubstantial prize, in the five-digit dollar range. In order to claim the Milestones, the winning teams must submit a write-up of their methodology, to the organizers’ satisfaction. Here are links to the <a href="http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1">Milestone 1</a> and <a href="https://www.heritagehealthprize.com/c/hhp/leaderboard/milestone2">Milestone 2</a> papers. (You can only read those PDFs if you are in the competition, unfortunately, and I don’t intend to re-share them if the organizers don’t want them to be shared.)</p>
<p>Two Milestones have passed, with the third coming up in a few weeks. The papers have been tremendously helpful in getting started; my initial approach has been a highly simplified version of their procedures, and it’s good enough to get to 211th place out of 1268 (though only 818 entries right now clear a naïve-ish benchmark where every entry is predicted at an optimized constant value, and I say “naïve-ish” because the <a href="http://www.heritagehealthprize.com/c/hhp/forums/t/661/the-optimized-constant-value-benchmark">method</a> for deducing that optimized constant is thoughtful). Unfortunately my efforts at sophisticating my models along the lines of the papers have not yielded much improvement beyond my initial go, but hopefully I’ll figure something out.</p>
<p>Although two Milestones have passed, it is helpful to read the first Milestone papers first, because the later ones build on/make reference to the previous ones. I was surprised by the similarity of the papers’ structure, despite being written independently:</p>
<p><strong>Features</strong>: from the raw data supplied by the competition, what variables became the input into your prediction algorithms? In some cases, there is no transformation; you feed the competition data right through. In other cases, the papers calculated per patient averages, minimums and maximums, etc. and fed those through.</p>
<p><strong>Algorithms</strong>: in general, strong entries use more than one (see “Ensembling” below). This is where some of the ornery mathematics comes into play, and to really do a good job here you need to read some academic papers. But many of the established statistical models have already been implemented in languages such as R, so if you simply want to get an entry on the leaderboard you actually don’t need to know too much about the models; download them and run them as a black box. (I’m still learning about these models and yet I’ve managed to write implementations that use them.) R in particular has strong community development of these statistical models and is what I’ve been using. The algorithms that are new to me that I’ve been trying to learn so far are called gradient boosting and random forests.</p>
<p><strong>Feature selection</strong>: a model is a combination of an algorithm and a subset of the available features. You might run the same algorithm on two different subsets of the features and call those two separate models. Models with the same algorithms may benefit less from the ensembling step (see below) because they may perform similarly well or similarly poorly on a given data point, but the papers both seem to employ this strategy to generate better predictions.</p>
<p><strong>Ensembling</strong>: it seems the established way to get a strong overall model is to harness many different prediction models and ensemble them with a top-level algorithm that weights the models accordingly. The idea is that different models may perform well on different subsets of the data (for whatever reason; the “why” may not be well understood), so if you can combine them in a manner that uses the best suited model for each data point, you’ll have a very strong predictor. I actually find the papers to be a little sparse on some details here (maybe because I’m inexperienced) but I think the procedure followed by the Milestone winners is to run what’s called a ridge regression to calculate weightings for each model and for the final prediction to be a linear combination of the models.</p>
<p><strong>Miscellany</strong>: One of the Milestone papers interestingly pointed out that the distribution for one feature changed sharply in the last year of available data. In finance we’d call this a “regime change.” The authors decided to toss that feature entirely as a result. They illustrated what clearly does appear to be a change in the nature of the feature’s statistical distribution but did not provide a concrete quantitative test for it, and my own efforts to write such a screen haven’t been successful so far. The issue is that you may not worry about a change in the mean or variance or even a few higher-order moments of a feature’s statistical distribution, but you may be worried if the variable’s family of distributions changed; if something used to be normally distributed and suddenly becomes uniformly distributed, that’s a real problem.</p>
<p>There was little attempt to impose a real-world interpretation on the raw data. The winners generally didn’t try to say something about why their models do what they do with drug prescriptions, hospitalization locations, etc. With minor exceptions, they focused on getting good data and good data mining algorithms. To some degree the selection of features induces some kind of interpretation – why did you calculate this feature? why are you picking this subset of features? – but that is not explained in much depth, and I interpret that to mean that it was not done on the basis of heavy thinking about real-world meaning of the data.</p>
<p>Having already been through a rookie stumbling phase with Amazon EC2, I am pleased to say I’m using it a bit more efficiently now. I’ve already got a “base” snapshot of a Linux install (Ubuntu) sitting around, and I’ve done all my work on a separate EC2 volume. If I ever want to cease work for a while, I can just detach the drive and stop the instance, and only pay for storage. If I want a lot of computing power or I want to try more than one thing in parallel, I can duplicate the volume, create some new higher-powered instances, attach the volumes to the new instances, and go. It’s pleasantly easy at this point to get started.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/354/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/354/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=354&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/08/21/the-heritage-health-prize/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Getting started with Linux and EC2 &#8211; online doc</title>
		<link>http://travelinglandsbeyond.com/2012/07/30/getting-started-with-linux-and-ec2-online-doc/</link>
		<comments>http://travelinglandsbeyond.com/2012/07/30/getting-started-with-linux-and-ec2-online-doc/#comments</comments>
		<pubDate>Tue, 31 Jul 2012 03:56:33 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=351</guid>
		<description><![CDATA[Over the past few days I consolidated some of what I’ve learned about EC2 and Linux into a Google document. It aims to teach someone who has some comfort with computers but may not necessarily be experienced with Linux (especially from an administrator’s perspective) how to get started with EC2. You can read it here. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=351&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Over the past few days I consolidated some of what I’ve learned about EC2 and Linux into a Google document. It aims to teach someone who has some comfort with computers but may not necessarily be experienced with Linux (especially from an administrator’s perspective) how to get started with EC2. You can read it <a href="https://docs.google.com/document/d/1umKE7HN1LDOPLr9E91MFP7z1ph7vDLdG102lWS8MqLE/edit">here</a>. If you are ever interested in learning about EC2 I hope you find it useful to get your feet wet.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/351/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/351/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=351&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/07/30/getting-started-with-linux-and-ec2-online-doc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>My (very non-expert) two cents on PPACA</title>
		<link>http://travelinglandsbeyond.com/2012/07/10/my-very-non-expert-two-cents-on-ppaca/</link>
		<comments>http://travelinglandsbeyond.com/2012/07/10/my-very-non-expert-two-cents-on-ppaca/#comments</comments>
		<pubDate>Wed, 11 Jul 2012 00:13:28 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[government]]></category>
		<category><![CDATA[healthcare]]></category>
		<category><![CDATA[ppaca]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=349</guid>
		<description><![CDATA[The Supreme Court’s recent ruling on the federal individual health care mandate (National Federation of Independent Business v. Sebelius) was necessarily ideological; the validity of the law with respect to the Constitution can hinge on points such as whether health insurance as mandated in the Patient Protection and Affordable Care Act (PPACA) can be interpreted [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=349&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The Supreme Court’s recent ruling on the federal individual health care mandate (<a href="http://www.supremecourt.gov/opinions/11pdf/11-393c3a2.pdf"><em>National Federation of Independent Business v. Sebelius</em></a>) was necessarily ideological; the validity of the law with respect to the Constitution can hinge on points such as whether health insurance as mandated in the <a href="http://en.wikipedia.org/wiki/PPACA">Patient Protection and Affordable Care Act (PPACA)</a> can be interpreted as a tax. But the Court’s ruling should not be thought of as an approval or affirmation of the idea of universal health care but solely of the PPACA’s particular approach to implementing health policy.</p>
<p>Somewhere along the way, as I think sometimes happens in politics, the debate about federally mandated health care insurance became an ideological debate and lost sight of what should be the real goal of health policy, which is to improve healthcare outcomes. Unfortunately the necessarily ideological points of the Supreme Court ruling don’t help the issue; as far as I can tell (being a person of left-wing sympathies), the aspect of the PPACA that gets right-wingers fired up is that it <em>is</em> a tax and they don’t like taxes, in a religiously axiomatic way.</p>
<p>I dislike these <a href="http://en.wikipedia.org/wiki/Grover_Norquist">fervid anti-tax stances</a>. As I believe has been said by famous economist Milton Friedman, <a href="http://www.economist.com/blogs/freeexchange/2009/10/to_spend_is_to_tax">“To spend is to tax.”</a> A true anarchist may support a zero tax rate because he supports zero government, but everyone else should have some level of taxation that he is willing to support in order to keep his desired level of government running. The taxes are a means to the government’s ends defined by our laws and policies, and I think people who attack taxes rather than expenditures take the wrong approach. Rather than opposing enactments <em>because</em> they are taxes, we should be making judgments on whether their contents are worth the taxes that will be necessary to support them.</p>
<p>I think it’s perfectly fair for people to disagree about whether the costs of the PPACA merit the benefits. Here reasonable left-wingers and right-wingers may differ in their personal opinions of the value of healthcare; one person may believe that an annual expense of up to $400 per person is justified to provide a certain level of universal health coverage, whereas another person may consider the justified amount to be $200 per person. (In March 2012 the CBO <a href="http://www.cbo.gov/publication/43080">estimated</a> the cost of the PPACA to be $1.1 trillion over the next 10 years, which assuming an average US population over that time of 325-350 million comes out to $314-$339 per person per year). For any reasonable person there is some cost that is low enough and some cost that is too high.</p>
<p>The point that people should not disagree on, though, is that any improvement in healthcare efficiency – any policy which can improve the quality of healthcare treatment for the same amount expended – is a desirable outcome. For the same reason that we cheer on the development of more potent drugs, innovative surgical techniques, scientific breakthroughs in biochemistry and genetics, etc., we should also cheer on any government policy that delivers better healthcare for the same cost.</p>
<p>The right-wing response is generally to say that the least government involvement results in the most efficient outcomes, but this is a policy guideline and not a mathematical law. It’s well-known that the US spends the most money in the world per capita on healthcare; this is not inherently problematic, but it does suggest the question of whether we are getting the best healthcare bang for our buck. And there’s nothing incompatible with supporting government action in one sector of the economy while advocating a hands-off approach in others. This sounds obvious but I think the casually educated free-market pundit has a tendency to shoehorn every industry and every economic situation into whatever toolbox or philosophy she learned in undergraduate economics 101. Understanding basic economic principles is important but reality does get more complicated.</p>
<p>The big challenge for the left is to keep itself honest about the true goal of healthcare reform, which is to get a healthier populace per dollar spent. It is not my opinion that universal health insurance is an end in and of itself. I think some leftists embrace and rally around the “X for Everyone” mindset but I think that’s far better achieved through the market if possible; rather than mandating “X for Everyone” and supporting it with taxes, let’s get good X so affordable that anyone can buy it. At least I continue to think that’s the ideal way to do things whenever you can, for commoditized products, and I definitely think that inadequate concern for efficiency, or excessive concern for equality over efficiency, leads to serious long-run institutional frailties in most economic situations.</p>
<p>Health insurance is not necessarily a commoditized product for which we can rely on all-private markets (I think comparisons of the individual mandate to requiring people to buy some random product like a car is totally fallacious). But maybe the individual mandate in general, or the individual mandate as implemented by PPACA, isn’t the right answer either; leftists need to be honest with themselves when monitoring its ongoing success or failure and not be content with just saying that we’ve now achieved universality and let’s rest on our laurels, not lose sight of the goal of improving outcomes.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/349/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=349&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/07/10/my-very-non-expert-two-cents-on-ppaca/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Big data autodidacticism</title>
		<link>http://travelinglandsbeyond.com/2012/07/10/big-data-autodidacticism/</link>
		<comments>http://travelinglandsbeyond.com/2012/07/10/big-data-autodidacticism/#comments</comments>
		<pubDate>Tue, 10 Jul 2012 19:17:36 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[graph theory]]></category>
		<category><![CDATA[kaggle]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=346</guid>
		<description><![CDATA[The aforementioned Facebook data mining contest ends today. The contest was, given a directed graph with missing edges and a list of nodes, to predict up to 10 new edges for each node in the list to point to. This is the first time I’ve tried a Kaggle competition. I picked it up as a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=346&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The aforementioned <a href="http://www.kaggle.com/c/FacebookRecruiting">Facebook data mining contest</a> ends today. The contest was, given a directed graph with missing edges and a list of nodes, to predict up to 10 new edges for each node in the list to point to. This is the first time I’ve tried a <a href="http://www.kaggle.com/">Kaggle</a> competition. I picked it up as a way to teach myself about machine learning and data analysis techniques. I’ve also done a bit of reading from Toby Segaran’s <a href="http://shop.oreilly.com/product/9780596529321.do"><em>Programming Collective Intelligence</em></a> (I also have Drew Conway and John Myles White’s <a href="http://shop.oreilly.com/product/0636920018483.do"><em>Machine Learning for Hackers</em></a> but haven’t really gone beyond the intros yet). And I’ve also been trying out a machine learning course from Coursera, given by Stanford professor Andrew Ng, which is just finishing up as well.</p>
<p>On Kaggle I’m somewhere around the 75th to 80th percentile, although I’m afraid to say my solution is essentially the same as one posted (possibly against the rules?) in the discussion forums, so not really an original idea on my part. For an early description of my attempts, see the <a href="http://travelinglandsbeyond.com/2012/06/20/edge-prediction/">previous post</a>. As it turns out, those attempts all fared worse than a <a href="http://en.wikipedia.org/wiki/PageRank">PageRank</a>-like algorithm that operated as follows, given a node for which you want to predict outgoing edges:</p>
<ol>
<li>Every other node is initially scored zero.</li>
<li>Send out a value of 1/(# of edges) out along each edge to each neighbor, both on outgoing and incoming edges. So both nodes that point to and are pointed to this node will receive this value, and a neighbor node that both points to and is pointed to by the node in question will receive 2x this value.</li>
<li>Add the value received by each neighboring node to its score.</li>
<li>Repeat steps 2 and 3 recursively twice, going out to the neighbors’ neighbors and the neighbors’ neighbors’ neighbors, but in these cases, if sending a value across an incoming edge (in the reverse direction that the edge points), do not add the value received by the neighbor to its score.</li>
</ol>
<p>Note that this is not a probability distribution across nodes. I avoided looking at the forum-posted solution and implementation for a while, then finally when I thought I was kind of spinning my wheels I read it through and punted around a few random improvements, but none of them really worked. (I did re-implement the solution in my own code framework, of course.)</p>
<p>Prior to starting on Kaggle, I had been sort of following along and plugging away at the examples in Segaran’s book, reproducing the code, running the examples myself, etc. I was learning, but I think it really helps to have some kind of project or target to go after. It’s the difference between, say, learning music by listening to lots of songs and reading scores and charts and theory, and learning by actually picking up an instrument and playing. (During this time I actually picked up guitar as well – it’s not a bad change of pace when you need one, and it’s nice to fiddle around with one while a slow-moving program is running.) I do still plan to return to his book and continue along with more examples, hopefully with a better appreciation and faster learn rate now that I’ve tried a project.</p>
<p>Participating in the competition was definitely educational, but as mentioned, it does lend itself to some wheel spinning. When submitting predictions, the competition does compute your overall score (using a metric publicly defined in the rules), but no details about what you did right and what you did wrong, as you might actually have in a real-life situation. Obviously they have to do this so that people don’t just submit a solution that is overfit to the test data. But this does mean, I think, that you’re just going to learn at that much of a slower pace.</p>
<p>Kaggle did give me the chance to use <a href="http://aws.amazon.com/ec2/">Amazon EC2</a> for what is ostensibly its “real” purpose, which is to purchase computing power by the hour. The algorithm described above is slow (at least my implementation of it was slow, maybe someone out there has a smarter and speedier version), and would take hours and possibly days to run on my laptop (a MacBook Air). Once I started getting to the point where my algorithms were taking this long, I took it to the cloud, spinning up a high powered Linux instance, uploading the code, and running it there. It still would take a few hours by the end, but that’s a bearable runtime.</p>
<p>To take full advantage of the multiple cores on the high-end EC2 instances I had to rewrite the code to support multithreading, which was something I hadn’t done before, and which was in my opinion generally a frustrating experience, lending itself to unpredictable crashes and more challenging debugging.</p>
<p>A word or two about <a href="https://www.coursera.org/">Coursera</a>, whose machine learning course I’m finishing up now: I liked it enough to try some more courses, but at times it felt like I was just following along the motions. To extend my music analogies, it felt like I was indeed actually playing guitar, but someone was sitting behind me holding my hands making me strum and finger all the chords. I’m not positive how much I will retain and how much will slide out my ears within the coming weeks. The slides and the presentations are good reads, but the programming exercises aren’t all that. The benefits you get from taking an in-person, structured class is that you also have close contact and cooperation with classmates; maybe you realistically can’t do Courseras unless they’re coupled with <a href="http://www.meetup.com/">Meetups</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/346/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/346/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=346&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/07/10/big-data-autodidacticism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
		<item>
		<title>Edge prediction</title>
		<link>http://travelinglandsbeyond.com/2012/06/20/edge-prediction/</link>
		<comments>http://travelinglandsbeyond.com/2012/06/20/edge-prediction/#comments</comments>
		<pubDate>Wed, 20 Jun 2012 20:44:48 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[graph theory]]></category>
		<category><![CDATA[kaggle]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[prediction]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://travelinglandsbeyond.com/?p=344</guid>
		<description><![CDATA[Recently I’ve been working on a Kaggle competition sponsored by Facebook. Kaggle is a website onto which firms and organizations can upload their own data mining competitions open to the public. They will provide some sort of input/output data set, named a training set in the lingo of machine learning, which competitors use to create [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=344&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Recently I’ve been working on a <a href="http://www.kaggle.com/">Kaggle</a> <a href="http://www.kaggle.com/c/FacebookRecruiting">competition</a> sponsored by <a href="http://www.facebook.com/">Facebook</a>. Kaggle is a website onto which firms and organizations can upload their own data mining competitions open to the public. They will provide some sort of input/output data set, named a training set in the lingo of machine learning, which competitors use to create their predictive algorithms. They will also provide a test set of inputs and some metric for scoring predicted output versus true output. Competitors submit their predictions and Kaggle scores them against the true outputs and ranks the leaders.</p>
<p>I don’t have a realistic hope of winning this competition – it’s my first time trying and there are pro data scientists working on this stuff – but it has been a good way to learn about the design of machine learning algorithms. Additionally, while it’s not a truly Big Data set (the uncompressed training data set is 142 MB), it’s big enough that you can’t go with brute force methods; you need to be thoughtful about what you do and do not spend time computing.</p>
<p>The Facebook competition is an edge prediction problem. Facebook provides a data file describing some kind of social network (it isn’t the Facebook graph, and obviously it’s anonymous, with graph nodes represented by numbers; someone in the forums put up a decent guess that it’s <a href="http://instagr.am/">Instagram</a>) that has had some of its edges deleted. The graph is directed, meaning that every connection is from one node and to another node; A can connect to B independently of B connecting to A. Facebook provides a list of nodes and asks you to make 0 to 10 ranked recommendations as to what other nodes it should follow, or in other words, what missing edges you would recommend drawing from that node to the rest of the graph.</p>
<p>The training set consists of about 1.86 million nodes connected by about 9.44 million edges. There are no self-connections; an edge always connects two distinct nodes. Theoretically you want to be able to assess a score to every pair of nodes (from-node, to-node) and grab the top several pairs for which edges do not already exist. However this requires a couple trillion score calculations, which for any computationally costly score calculation will become infeasible, and in any case will often produce a poor score that is subsequently discarded. So you have to cut your scope down; for each node, you might consider only nodes within a certain number of connections. (In fact my highest-ranking effort at present writing only attempts to connect node A to node B if node B is already connected to node A; perhaps this implies that my more sophisticated attempts are super lame, but hey, it’s currently 64th percentile, so you could do worse.)</p>
<p>There are two papers I’ve found informative for the same reason, namely, that they provide a broad overview of edge prediction methodology. Liben-Nowell and Kleinberg’s <a href="http://www.cs.cornell.edu/home/kleinber/link-pred.pdf">“The Link Prediction Problem for Social Networks”</a> I found to be more readable. Cukierski, Hamner, and Yang’s <a href="http://kaggle.academia.edu/BenHamner/Papers/1679603/Graph-based_features_for_supervised_link_prediction">“Graph-based Features for Supervised Link Predictions”</a> I found to be drier, but it specifically addresses directed graphs and it directly recounts the authors’ successful entry into a similar competition for Flickr (in fact Hamner now works for Kaggle).</p>
<p>My first attempts (before really reading the above papers) were based off of a simple tip in the Kaggle forums. He proposed simply suggesting every connection A -&gt; B for which B -&gt; A (if A not already -&gt; B). This actually would already get you to the 30th percentile as of the present writing, though this figure will of course drop over time. My best result is still a refined version of this approach, which simply ranks these predictions in a more intelligent fashion. Subsequent attempts at something “smarter” have not yielded improvements in score.</p>
<p>The general approach I’m taking is to define a relevant neighborhood for each node from-node in the test set and then assess a score on each potential edge (from-node, to-node). In the brute-force case each node’s relevant neighborhood would be the entire graph; in the aforementioned strategy of completing bilateral connections, the relevant neighborhood would be any parent nodes of from-node. If you’re just computing one feature, you can just rank the nodes by score, optionally truncate the list based on some kind of cutoff, and return the top 10 nodes as your recommendations (or fewer if there are fewer than 10). The determination of the cutoff is a problem with an unclear answer; I think you have to do some kind of analysis on the distribution of scores, but even then you’re ultimately drawing a line in the sand.</p>
<p>Alternatively, particularly if you’d like to combine more than one feature into your analysis, you could run a <a href="http://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a>, which is what I’ve been doing. Briefly, a linear regression attempts to fit a linear equation to a set of input variables to predict the value of an outcome variable; this can give distorted results if the outcome variables all fall within a band, such as if you’re trying to predict a 0-or-1 outcome. A logistic regression transforms the outcome variables from the range [0,1] to the full number line using a function called the logistic function; you would then invert it on any predictions back to the range [0,1] to get a meaningful number.</p>
<p>In our case we can say we’re trying to predict the probability that an edge has been deleted between two nodes, and score node pairs based on this predicted probability. If you are only using one feature to predict, the logistic regression will be trivial, since the logistic function is monotonic; if one pair scores higher than another then it’ll still score higher after being passed through the logistic function. But you can run a regression on multiple features, such as if you wanted to use both two nodes’ common neighbors and their combined number of neighbors, and you can also add square and cube terms and cross terms and all the usual jazz that people do with regressions. Viewing the ranking score as a probability also gives you some intuition behind where you might set a cutoff.</p>
<p>The highest score I’ve gotten so far involved plugging the nodes into a regression based on the numbers of children and parent connections on both the from-node and the to-node. There are a bunch of other methodologies in the above papers that I’d like to try – I’m currently working on a <a href="http://en.wikipedia.org/wiki/PageRank">PageRank</a>-based calculation, PageRank being the algorithm underlying how <a href="http://www.google.com/">Google</a> ranks web query relevancy.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/travelinglandsbeyond.wordpress.com/344/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/travelinglandsbeyond.wordpress.com/344/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=travelinglandsbeyond.com&#038;blog=31018121&#038;post=344&#038;subd=travelinglandsbeyond&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://travelinglandsbeyond.com/2012/06/20/edge-prediction/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/68289776b401e18400218025e7884910?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Andy</media:title>
		</media:content>
	</item>
	</channel>
</rss>
