Traveling Lands Beyond

"Beyond what?" thought Milo as he continued to read.

Hello GTFS, node-gtfs, Mongo, Node.js, nginx, d3.js, world

leave a comment »

I spent parts of the last week and a half trying and learning about a variety of software packages. This type of experience makes me constantly aware of how much stuff is out there to learn that I don’t know yet, which is sort of rewarding and disheartening at the same time. What I thought I’d initially sit down to learn about wasn’t at all what I did learn about in the end.

My starting point was GTFS, which stands for General Transit Feed Specification. This is a format defined by Google to represent transit systems, such as a train, subway, or bus service. It’s a set of files that contains information such as stops, schedules, locations of stops, holiday versus regular services, and so on. Many public transportation systems publish their data; they use the format to begin with because Google will put them into Google Maps if they do so, and they also publish it on their websites (example) simply to put it out there for the programming community, in the hopes that someone will develop some nice apps. As I later found out, there is a server called GTFS Data Exchange that hosts GTFS-formatted data for many transit systems.

Anyway, I live in New York and thought it might be interesting to play around with the MTA’s data. The first thing I did was to try to write a sort of API in Python that would process a set of GTFS files. This was actually fairly straightforward because the GTFS files could each be processed almost the same way. For example the “stops” file contains data on stop locations, the “routes” file on train routes, and the “calendar” file on schedule changes for holidays and different days of the week. But (a little programming lingo ahead) if you define classes Stop, Route, and Calendar, and write constructors for each, you can write a single function to read files that takes the appropriate constructor as an argument and will output a set of accordingly constructed objects.

This worked in the end but it wasn’t exactly speedy, because in GTFS the “stop_times” file tends to be big and I was reading it in each time. The architecturally better solution would have been to write something that loads the GTFS data systematically into a database and then write some functions to read the database into objects. In any case I knew I was reinventing the wheel, as someone else had almost certainly done this before (I did it anyway to get some practice writing Python). So I ended up dropping it and looking around, and that’s how I came across node-gtfs.

node-gtfs is a program that automatically downloads data from the GTFS Data Exchange server (this is in fact how I learned about it) and stashes it in a database. Sounds great, but two problems: it is written for use with node.js and it uses Mongo as its database, neither of which I’d ever used before even once. So that was two brand new things on my platter.

Mongo is a so-called NoSQL database that emphasizes scalability to large data sets and multiple servers (the name “Mongo” comes from the word “humongous”). I haven’t worked with any huge data sets yet, but the query language is much better than SQL, which feels like a clumsy relic in comparison. Database entries are stored and retrieved like objects in a programming language. To get all the entries in a database db and a collection c (a collection being Mongo’s equivalent of a table in SQL) matching criteria X = x, you would simply issue a call db.c.find({X: x}), and you’d get back an iterable set of matching objects. And to get, say, an attribute attr of an object obj, you would simply write attr.obj. If you’ve encountered JSON before, the objects are returned in a format similar to JSON (called BSON, to be exact). And what I think is nice about JSON is that it’s clean and human-readable and represents data the same way we’re accustomed to representing objects in programming.

I’ll quite readily admit that this is my first foray into NoSQL anything and I’m hardly experienced enough with regular SQL databases that I can make a fair pronouncement about which one is better. Hell, the major attraction of Mongo is scalability and I didn’t work with a big enough dataset to even sniff that. Apparently there are some “religious” battles between SQL and NoSQL proponents and I don’t care to step into that (here’s a funny Xtranormal sketch from a SQL defender, language not safe for work, but with good points). There’s no arguing that SQL has been around for a long time and is a battle-tested solution that probably works fine for many use cases. And maybe I have some bias because I remember hammering together queries at my old job for data structures that weren’t a great fit for tabular storage (trees, objects with variably-sized attributes). But I really did like the NoSQL interface and the programmer-friendly way in which data was returned to me.

Node.js is not a traditional Javascript module to be embedded in a web page, as its name might suggest; it’s a server-side application that lets you write web apps in Javascript. The standard “Hello world” application is a bit of Javascript that actually launches a web server that sends a “Hello world” HTTP response; you can actually run a website with Node.js alone, without the typical servers like Apache or nginx (more on that later). It seems to be intended as an alternative to using Apache + a web app framework like Rails or Django, and although I’m not sure exactly why you’d use one versus the other, my understanding is that it is quick and easy and puts a small load on your servers. I guess it’s the Tumblr or Twitter to Apache + Rails/Django’s full-suite blogging software?

Node.js is based on Google’s V8 Javascript engine. Apparently V8 is phenomenally speedy. If you look at the afore-linked Mashable article, you’ll see a comment from a guy who used Node.js because it was fast and simple to develop in. Specifically, he said that “the Python virtual machine is incredibly slow” and that the V8 engine was “high performance.” This is actually the second time in the last few weeks that I’ve come across strong compliments on V8’s speed; the first was at a Meetup about the up-and-coming Julia programming language, which is pitched in part towards academics and mathematical applications. One of Julia’s co-creators, a very technically savvy guy named Stefan Karpinski, came to speak to us and at one point discussed the speed benchmarks displayed front-and-center on Julia’s home page.

You can see that Julia’s closest competitor is not one of the typical mathematical languages/packages, but actually Javascript running with V8. (Python beats most of the remaining competitors.) Karpinski said that V8 was “incredibly good,” I believe those were the exact words he used. If Node.js were built around whatever Javascript engines existed before V8, would it be a viable piece of software at all? Google’s interest in developing V8 and making it free was, I presume, to give its websites’ visitors a better browsing experience; simply by releasing it and building it into Chrome they spur other browsers to make equally good Javascript engines; now there is a server-side application that could theoretically supplant a large part of the traditional web server stack, independent of the browsing experience, fueled by V8. Could someone write a virtual machine/engine for Python that competes with V8 or is there something about Javascript that lends itself to fast processing?

Anyway, on the subject of lightweight applications, one other piece of software that kept popping up in this respect was nginx, a lightweight web server. For those of you completely unfamiliar with web servers, Wikipedia’s article is nice, especially the “Path translation” section; the important takeaway is that a visting a website like http://en.wikipedia.org/wiki/Web_server can be construed as an HTTP request directed to a computer identified by en.wikipedia.org for a resource called /wiki/Web_server. (HTTP is the communications protocol used by the web.) It is the web server’s job, sitting on the other end at en.wikipedia.org, to fetch you that resource.

Apache httpd is easily the most popular web server out there; according to this February 2012 survey, about 65% of websites serve their content with Apache. (Technically, Apache is the name of the software foundation that produces web server software called httpd. Apache makes other stuff too, like Hadoop. But many people metonymically call httpd “Apache.”) As such, when I first wanted to learn how to set up a web server, I started with Apache; whenever you’re learning something new, especially in software, there’s a lot to be said for going along with the way the majority does it because you’ll have a community and more Internet resources to help you if you’re stuck. And I suppose if you say “I know web servers” but you’ve never used Apache, people are going to laugh at you.

But honestly I’ve had a smooth and positive experience with nginx so far. It worked correctly pretty much right away (not that Apache didn’t; it did too). The configuration files are simple, compared to the dizzying (and imposing for beginners) contents of httpd.conf. nginx’s website has decent documentation and offers various sample configurations and a nice common-pitfalls guideline. nginx’s philosophy seems to be oriented around doing a few things well rather than building an all-purpose server like Apache; in fact it seems common to have nginx be the first point of contact for HTTP requests and to then have it proxy out some requests to an Apache server elsewhere. As such I don’t think nginx is necessarily a bad starting point for a newbie.

I’ll touch very briefly on D3.js, which, unlike Node.js, actually is a Javascript module that you include in your web pages with the <script> tag. D3.js provides an interface to bind data to graphics, which makes it easy to embed fancy charts in your web pages based on dynamic data. D3.js’s home page has a really beautiful gallery of examples. I only learned enough to create very simple graphics, but I think D3.js doesn’t seem particularly complicated; it seems to me that having graphical design ability is a much bigger difference maker between a skilled and unskilled D3.js user.

And finally, for all that learning (and various odds and ends that you learn by doing, things like learning to disable journaling on Mongo so that it didn’t blow through my 6 GB partition on Amazon EBS), the end product I produced is actually nothing more than a very simple, nearly “hello, world”-level web page. It plots NYC subway stops’ geographic locations, colored by route, on a white background. It is, in fact, produced by D3.js, served by a Node.js app, proxied from nginx which is the primary server for the domain, reading data from a Mongo database that was populated with node-gtfs. (And it’s running all on one low-powered Amazon EC2 instance and is not fast at all.) I guess this gets me back to my original point of rewarding and disheartening; it was great to learn this all, but I hope better output is down the road.

Written by Andy

Tue 15 May 2012 at 10:48 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: