Sign up FAST! Login

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance, by Todd W. Schneider

Stashed in: And I think to myself... what a wonderful world..., Uber, Uber, New York, Big Data!, GitHub

To save this post, select a stash from drop-down menu or type in a new one:

The whole article is fascinating. I enjoyed this nugget:

In January 2009, just over 20% of taxi fares were paid with a credit card, but as of June 2015, that number has grown to over 60% of all fares.

I like the concept of "medium data".

There are endless analyses to be done, and more datasets that could be merged with the taxi data for further investigation. The Citi Bike program releases public ride data; I wonder if the introduction of a bike-share system had a material impact on taxi ridership? And maybe we could quantify fairweather fandom by measuring how taxi volume to Yankee Stadium and Citi Field fluctuates based on the Yankees’ and Mets’ records?

There are investors out there who use satellite imagery to make investment decisions, e.g. if there are lots of cars in a department store’s parking lots this holiday season, maybe it’s time to buy. You might be able to do something similar with the taxi data: is airline market share shifting, based on traffic through JetBlue’s terminal at JFK vs. Delta’s terminal at LaGuardia? Is demand for lumber at all correlated to how many people are loading up on IKEA furniture in Red Hook?

I’d imagine that people will continue to obtain Uber data via FOIL requests, so it will be interesting to see how that unfolds amidst increased tension with city government and constant media speculation about a possible IPO.

Lastly, I mentioned the “medium data revolution” in my previous post about Fannie Mae and Freddie Mac, and the same ethos applies here. Not too long ago, the idea of downloading, processing, and analyzing 267 GB of raw data containing 1.1 billion rows on a commodity laptop would have been almost laughably naive. Today, not only is it possible on a MacBook Air, but there are increasingly more open-source software tools available to aid in the process. I’m partial to PostgreSQL and R, but those are implementation details: increasingly, the limiting factor of data analysis is not computational horsepower, but human curiosity and creativity.


If you’re interested in getting the data and doing your own analysis, or just want to read a bit about the more technical details, head over to the GitHub repository.

You May Also Like: