Archive for April 2012
When Jamie Moyer became the oldest pitcher in history to win a major league baseball game this year at age 49, the Internet responded with a slew of fun statistical factoids that reflected just how long he’d been around the game. One such fact was that Moyer had pitched to 8.9% of all MLB players in history. When I first read it, it was in the context of an interview transcript and was presented without evidence. Where did this fact come from?
A Google search yielded somewhat self-referential results. The top hit was the Gawker sports blog Deadspin, which is where I first came across the factoid. It links to Sports Radio Interviews, but the original source was an interview with a Denver sports radio station which of course did not drop some kind of formal citation in the middle of its broadcast (though I did listen to the audio as well). Most other Google hits either carry no corroboration, stating the statistic and “that’s amazing” or something like that, or refer to another page that also carries no corroboration.
Of all places, a forum on Jeopardy! winner Ken Jennings’s website was the most useful in verification. A forum contributor comfortable with Retrosheet data found that Jamie Moyer has faced 1,412 unique batters out of 15,856 total people in history who have had at least one major league plate appearance between 1876 and 2011. In the list of pitchers who had faced the most batters, Moyer was actually fourth, behind Greg Maddux, Tom Glavine, and Nolan Ryan. (Of course those guys are all retired, and so Moyer has the opportunity to keep climbing.) I actually verified Moyer’s data with Baseball-Reference.com, although I’m not sure how to easily check the denominator of 15,856. (Retrosheet.org doesn’t seem to be functioning as of this writing and actually I got a malicious cookie warning from Norton Anti-Virus when visiting it, so I’m not linking to it.)
The statistic of 8.9% sounds crazy at first. But if you consider that through 2011 Moyer had pitched for 25 out of 136 seasons, and that there have been more teams and therefore more unique players to face in the modern era than in decades past, it becomes believable. And Moyer is actually well behind Maddux in any case.
(I’ve been meaning to read the recent book The Lifespan of a Fact, by John D’Agata and Jim Fingal, who play the roles of “author” and “fact-checker,” respectively. I actually know Jim from my undergrad years; he was a DJ at the college radio station at the same time as me.)
Since Google eviscerated the social features of Reader in what I consider a misguided attempt to push people into Plus, I’ve been missing a way to curate and share the interesting web articles that I read. Starting today I am trying out tumblr for these purposes. If I read something and like it, I’ll post it at:
This blog will continue to house posts mostly of my own writing; the intent of my tumblr-reader is to simply share links with relatively short commentary and emphasized quotes. Please enjoy both!
Dominion is a board game that I really enjoy because it requires players to think carefully about strategy in the context of assessing probabilities and managing resources. Players start with identical decks of ten cards and draw five each turn, using the cards’ various abilities, acquiring new cards and sometimes eliminating existing cards, and reshuffling the deck when it runs out. The new cards are acquired out of a universe of (usually) 17 types. 10 of the card types are selected randomly each game out of a much broader universe (over 200 if you like), giving Dominion immense replay value because each game is different. Some cards that are useful in certain games are ineffective in others.
The basic economic conflict of the game is that cards that give you points generally don’t perform any other useful function, and so players must manage the composition of their deck carefully, trying to finish the game with the most points while trying not to clog up their hands. If you enjoy chess, poker, or strategic 52-deck card games like bridge or hearts or spades, I strongly recommend trying Dominion as I think it involves similar kinds of thinking. (I must admit I haven’t played bridge but I do enjoy hearts and spades.) If you like the much more popular Settlers of Catan, I’d also recommend it on the basis of correlated interests, though I don’t think the games are that similar in nature. And if you are a little put off by its dorky medieval theme, I’d tell you that to me that’s all just trappings. If it were themed around sports or politics, or even had no theme at all, I would enjoy the game just as much.
It’s playable for free on the Dominion Isotropic server and has also spawned some really detailed online strategic analysis. I think it’s easiest to break in with a friendly opponent to help you on the learning curve (not with the basic rules, which aren’t that complicated, but with the strategic ideas and with getting familiar with card types).
Once you’ve gotten familiar with the game, here are a few basic guidelines for improved play that I’ve learned:
Don’t overvalue attacks. They definitely have their uses, but I think many beginners worry too much about them. Attacks consume actions that could otherwise be used to develop your deck, and of course the opportunity cost of buying the attack card itself is having another useful development card. Someone who focuses too much on attacks may slow his opponents but will still fall behind in the end.
Both excesses and insufficiencies are inefficient. If you’re ending turns with 3 action cards in your hand but no extra actions with which to use them, you either needed to buy cards that give you extra actions or you needed to have bought money instead of some of those action cards. Conversely, if you’re ending turns with 3 extra actions but no action cards to play, you may have overpaid for those extra actions. The same is true for money; having not enough money is obviously bad, but having $12 and only one buy (assuming you’re not playing with Colony) is also bad. When reflecting back on a game, try to figure out if you had excesses or insufficiencies and whether you could have avoided them.
The main difference between money and actions is that you can play all the money in your hand each turn, whereas you only get one action per turn with which to start. This sounds like I’m just re-explaining a basic rule, but I think some people don’t adequately appreciate this and overvalue actions that give you money over just plain money. For example, if you have $4 and you can pick between Silver ($2) and Militia ($2 plus an extra attack feature), it is not necessarily true that the Militia is a better buy. At some point you may have that Militia and some other action you’d rather play, and you’ll wish the Militia were Silver instead.
When in doubt, Silver/Silver (buying Silver on each of your first two turns) is not a bad opening. This is especially true if there are some good $5 cards out there. Under this opening, the probability distribution for how much money your third and fourth hands will have is as follows (doesn’t add to 100% due to rounding):
As you can see, you’re unlikely not to pick up at least $5 in either hand (only a 9% chance), and if you’re lucky you may even have at least $5 in both hands (a 15% chance). It’s not a bad way to get you quickly into high-quality cards. And since Silver/Silver is always available as an opening, any other opening has to be measured against it.
A little background for non-technical readers: HTTP is the name of the communications protocol that governs the Web. You probably recognize it as a common prefix to web addresses, although modern browsers usually don’t require users to enter it any more. When you visit a web page, what happens is that your web browser sends an HTTP request to a web server somewhere asking it to serve up a “resource,” typically a web page. The server then sends an HTTP response, hopefully containing the contents of the web page you wanted to see, although sometimes things don’t work out (for example, I think everyone has at some point seen an HTTP 404 error, given when a server couldn’t find a requested resource).
It’s not common these days for a given trip to a web page to involve just one HTTP transaction. Even simple pages often use images, stylesheets (defining formatting, fonts, colors, etc.), separated script files, and so on. Your web browser has to request all of these too. Even the raw content of the page may be produced through HTTP transactions separate from and in addition to your basic page request. An example of the kind of website that may do this is a sports site that tracks live games. Raw game data can be kept in files that are refreshed during games; any page that needs to present that data to users can request those files and process them.
Google Chrome has a nice feature called “Developer Tools” that allows you to monitor all of your browser’s HTTP transactions. You can access it under the Tools menu on either Mac or Windows. It looks like this:
Developer Tools, as I realized today, is handy for scraping data from websites structured in the manner described above. A time-honored way to scrape web data is to write a program that fetches pages, parses the HTML (the language that describes web pages), and extracts the relevant content. HTML is designed for presentation and not for ease of data storage and recall, and writing a scraper can sometimes be a bit iterative while you figure out how to systematically strip away unneeded elements while retaining all of the ones you need to keep.
But for websites that segregate their data files in this manner, it’s much easier to access the files directly, which are generally in a format designed to be parsed easily. The issue is just to locate them, and that is where Developer Tools can help out. If you open Developer Tools on a website that appears to import data that you want, you can run through the record of HTTP transactions to see if there’s anything that corresponds to a raw data file (typically in a format such as JSON or XML).
I was already aware that MLB (Major League Baseball, for readers not familiar with American sport) makes their data available in this fashion on their GameDay servers. Yesterday with the help of Developer Tools I learned that the EPL (English Premier League, for readers not familiar with European sport) also does the same. Here is a sample; unfortunately they cannot be as easily browsed as the GameDay data, as their URLs contain an ID number that I could not obviously determine without parsing some EPL web pages as well.
In any case, I wrote a little code to scrape down a few years of EPL data and looked at the difference in refereeing calls between home and away teams. Here is the average per-game data, with p-scores indicating the likelihood that the mean of the home and away values are equal. Home-away pairs that differ at a 95% threshold (based on a simple equal-variance t-test) are highlighted.
|Non-card fouls (home)||11.776||12.612||11.937||12.261||12.639|
|Non-card fouls (away)||11.332||11.697||11.337||11.921||12.282|
|Non-card fouls p-value||0.0750||3.93e-4||0.0165||0.128||0.121|
|Yellow cards (home)||1.418||1.380||1.353||1.329||1.376|
|Yellow cards (away)||1.850||1.887||1.792||1.847||1.811|
|Yellow cards p-value||9.42e-7||1.33e-8||3.28e-7||9.51e-9||9.85e-7|
|Red cards (home)2||0.071||0.061||0.068||0.082||0.045|
|Red cards (away)||0.092||0.119||0.097||0.076||0.089|
|Red cards p-value||0.153||0.00423||0.0844||0.399||0.00787|
|Total fouls (home)3||13.266||14.053||13.358||13.671||14.061|
|Total fouls (away)||13.274||13.702||13.226||13.845||14.182|
|Total fouls p-value||0.491||0.114||0.334||0.294||0.355|
- The EPL’s website was missing game data for 2009-2010’s game Bolton 1-1 Stoke, so this season features only 379 observations.
- The EPL’s website appears to mark second-booking reds and straight reds as red cards one and the same, without a yellow card listed for the former type. So that is how they are treated in the data above.
- Total fouls does not include offside.
I think the most striking feature in the data is that although the total number of fouls per game is fairly similar between home and away sides, the away side is very consistently given 0.4 to 0.5 more yellows per game. The elevated yellow card count for away side is often paired with a lower non-card foul count. In other words, while EPL refs tend to give the same number of fouls to home and away teams, they turn more of those fouls into yellow cards for the away teams.
Today I was messing around with web development on Amazon Web Services and was stuck on a system configuration problem that I ended up resolving in a way that, while conceptually primitive and intellectually unsatisfying, would not have been available to me if I had not been working in the cloud.
In non-technical terms, I was having trouble installing a program on a cloud-based webserver I’d created. (A more technical explanation is at the end.) I couldn’t find any quick fixes on Google and it seemed that I was going to have to spend a lot of time reading about the installation process and trying different things and potentially digging into some daunting configuration files. I was about to post a question on Stack Overflow, a Q&A website for computer programmers, when I decided that first I would see if I could replicate the problem on a completely new computer.
If I weren’t using Amazon EC2 (or another similar service, although I’ve only ever used Amazon’s), my webserver would have been a physical machine, perhaps located in a corner of my apartment, or perhaps lodged in a rack of servers at some web hosting company’s offices. But EC2 allows you to create virtual computers (“instances”) out of Amazon’s cloud computational power that are untied to any specific physical setup. You can click through a few dialog screens to create a new computer that you can log into just like any other remote computer, and you can click through a few more to terminate it. And rather than paying by the month, you pay by the amount of computation power you consume. The webserver I was working on was such a virtual computer.
One of the web pages I came across when looking for help suggested that it might be an issue with the operating system I was using itself. So before asking Stack Overflow, I decided to check if I could create a brand new instance of the same computer without any of the software I’d installed on my webserver, then try to install just the program in question and see if I could still regenerate the problem. Creating a new computer out of hand for troubleshooting purposes is obviously not something you can do if you’re working with a physical machine. You could borrow another one, maybe. But on EC2 it was easy; I clicked through a dialog, generated a new instance, logged into it, and ran a handful of commands to install the program. And lo and behold, it installed properly.
I’m afraid the dénouement to this story is rather boring, because I never figured out what caused the problems on my old machine. I reinstalled the software that I’d need on the new instance, piece by piece, periodically checking to see if the install would succeed or fail, and it succeeded at every step. In fact I ended up terminating the old instance and using the new one going forward. And of course I admit that I did nothing especially clever or sophisticated. But I do think that it’s an example of how the cloud’s paradigm of “it’s easy to use as much or as little as you want, and pay for it as you use it” changes what you can do with the resources that you have (the added compute time of the new instance cost me less than 25 cents).
The exact problem with technical details (skip if you’re uninterested in that stuff): I was running RHEL 6.2 and ultimately trying to install Django and deploy it to Apache, which requires you to install the Apache module mod_wsgi. When building mod_wsgi I was getting the error described here under “Mixing 32 Bit And 64 Bit Packages”. I don’t think the issue was that I had anything compiled for a 32-bit system; I’d actually built Python from source. So the next suggestion was to rebuild Python with –enable-shared (
./configure --enable-shared), which sent me off on a course of reading about static vs. dynamic libraries. (I should add that I’m only mildly experienced with Linux and am often learning as I go.) This ended up causing the Python build to fail with the exact same error message. I punted around a few random things to no avail. But on the new instance, Python built fine with –enable-shared. I tried installing back my packages gradually and rerunning configure and make periodically, but I never reproduced the error and I still don’t know what the problem was. Once everything was installed on the new instance, including mod_wsgi, Django installed fine as well.
I’ve been reading a really great book, Siddhartha Mukherjee’s The Emperor of All Maladies, which is a history (or as the subtitle calls it, a “biography”) of cancer. I’m far from the first person to praise it; in fact, I started reading it on the basis of a strong recommendation from Marginal Revolution. I will say that Mukherjee is particularly good at distilling the history of oncology into meaningful themes: the ancient theory of humors, aggressive amputation as treatment, rivaling schools of thought on carcinogenesis, and so on. Bad history writing becomes a list of then-this, then-this, while good history writing finds cohesive narrative threads; this is good history writing.
At one point the book brings up a point I hadn’t thought about: since cancer cells divide far more rapidly than normal cells, they will also evolve far more rapidly in response to the selective pressure of medication. As bacteria evolve resistance to antibiotics, so too can cancer cells evolve resistance to chemotherapy. I’m no doctor and so I cannot comment on how important of a consideration this is in cancer treatment (perhaps it is actually quite minor?), but I do find it an interesting though unfortunate example of evolution at a sub-organism level.
The book also discusses the approach developed in the 1960s of treating cancer with an aggressive combination of chemotherapeutic drugs. One chapter describes oncologists trying first two, then three, then four cytotoxic drugs at once, seriously endangering patients’ lives in the hopes of eliminating every last trace of their cancers. (Many cancer treatments are harmful to healthy human cells in addition to cancerous ones, making treatment potentially lethal; at the same time there had been cases of cancer returning after having been reduced to undetectable levels, encouraging doctors to pursue forceful medication even after outward signs of the disease had disappeared.) Mukherjee describes a synergistic effect of combinative treatment: “Since different drugs elicited different resistance mechanisms, and produced different toxicities in cancer cells, using drugs in concert dramatically lowered the chance of resistance and increased cell killing.” (p. 141)
An important consideration in combinative treatments, then, is the correlation between the probabilities of cells evolving resistance against them. A really ideal pair of treatments would be two treatments where resistant mutation against one necessarily produced non-resistance against the other. For example, if one treatment’s chemical pathway relied on the presence of a certain protein and the other relied on its absence, the two in combination would be immune to a mutation that toggled production of that protein. Correspondingly, I would guess that it would be easier for cancers to evolve resistance against two chemotherapies with similar chemical pathways.