Sign up FAST! Login

How (And Why) I'm Circumventing Twitter's API Instead of Using It


Stashed in: PandaWhale, Twitter!, JavaScript, Code, API, Twitter, Software!, Awesome, Sticking it to The Man!, Web Development, The Matrix, APIs, Impossibru!, Ain't Nobody Got Time For That, @lindsaylohan, Hacker News!, dev talk, @lmeadows, Beavis and Butthead, Pacman!

To save this post, select a stash from drop-down menu or type in a new one:

I still like using Twitter. Though I have often complained about their shitty API, I still have respect for the product and the company. The story of Twitter is a valuable case study. From its humble beginnings as Twttr -- an idea so simple and crude that most people casually dismissed it -- to the hugely popular and valuable service we know today. It's a story of a team that persisted, improved its design, and kept pimping the product until it became viable.

@jack's early notes on Twttr

I believe that the API contributed hugely to Twitter's growth, which was really kind of miraculous since nothing about it was particularly well-designed. For one thing it has always been pretty fail-y, even for OAuth. And rather than scaling the service and improving availability, Twitter has consistently fallen back on trying to limit API usage by enacting restrictive policies and rate limits. It's also pretty clear that Twitter never invested much time or thought into refining the API to be a good general resource for developers building arbitrary apps. Rather, the API was built more or less directly out from the same backend infrastructure of the website itself. You have to do significant post-processing of data to get it out of a form that's useful for anything other than replicating the core Twitter experience (which they famously don't want you to do). Couple that with the fact that they also don't want you to deviate too far away from the "core Twitter experience" and you have one fairly fucktarded API.

Twitter Fail

But Twitter has so much awesome data that many devs (including me) were willing to work through the many, many problems with Twitter's API. I've implemented several Twitter-based features on PandaWhale which might have been great cross-platform experiences, but between the constant failures, the rate limits, and the fact that Twitter may cut off API access at any time, Adam and I have decided to take a different route: circumventing the API and fetching the data through what you might call "less official" means.

The first plan we cooked up was to make an underground Twitter API. With the help of 1,000 or so Mechanical Turk helpers we'd create thousands of Twitter apps (i.e., OAuth API key pairs) and tens of thousands of fake users to use these apps so that our servers could consume the bot's feeds and aggregate them into a pseudo-firehose. We figured that if we had ~50K bot user users following the top 250K authors on Twitter (defined by follower counts)(human or bot) then the aggregated home page feeds of the bots would represent the bulk of the useful data on Twitter.

Of course we knew that Twitter would try to shut us down, but they would be forced to play the game of trying to track down which API key pairs and accounts were being used, and with enough help Mechanical Turk churning out new dummy apps and users, EC2 giving us new IP addresses, and perhaps even real people "donating" their feeds to us by adding one of our read-only apps we felt confident that Twitter could never shut us down. And we'd make the underground firehose available for free to anyone who wanted to develop Twitter apps. Pretty cool, huh?

firehose

The problem is that this plan is overly complex and expensive. My work on the PandaWhale bookmarklet taught me that the best way to scrape data from the big "permalink machine" consumer web sites (Twitter, FB, Tumblr et al.) is with Javascript in a browser context. Sites are DOM. Javascript is DOM. HTTP server-based scraping has a rich and noble history, but doesn't work well in this brave new world of session-based feeds rendered with Javascript. But all those convenient programming hooks and DOM node patterns that some dev at Twitter created to make his job easier can (and should!) also be exploited by me to scrape the site. APIs are ultimately just a way to shape scraping behavior of third-party developers by giving them a blessed way to scrape your site. If you have compelling data and your API is a turd, you're asking to be scraped. That's the web.

So we can't use Twitter's API due to a fatal combination of poor technology and poor policy, and for security reasons script injection only works on an ad hoc basis in response to a user action, like in a bookmarklet or browser extension. So how do we take advantage of script injection in an effective and ethical way? We already can (and do) let users stash tweets with our bookmarklet, just in case they want to have some prayer in hell of finding the tweet again in a week's time. But, while useful to the user, this approach doesn't let us do some pretty basic things like stash the @-replies to the tweet which happen after it was stashed. Note that this is often not even possible through the API, since if a user is trying to save someone else's tweet (and the author is not a PandaWhale user) then we can't parse the author's entire @-reply history to find those few replies which were in response to the tweet. It's amazing that after all this time the API doesn't even have a way to fetch @-replies given a tweet id, and we're forced to resort to something as barbaric as parsing entire @-reply histories (side note: this inefficiency puts extra load on Twitter's servers and eats up rate limits -- a little bit of refinement to the API would help everyone a lot in this case).

needs moar refinez

So I took this idea a step further and decided that the best way to deal with Twitter was not to use their API nor create an underground one, but rather to run a headless browser like PhantomJS on a dedicated server. I can input some sockpuppet Twitter creds (hell, I could even input my own Twitter creds) and I've got a valid Twitter session running on a browser in the cloud. Any time PandaWhale needs to fetch additional data from Twitter (like to get @-replies to a tweet which a user stashed a few hours ago) I can just have my cloud browser navigate on over to the tweet's permalink, expand it, and look to see if there are any new replies -- much as I would if I were doing it manually for myself. I've done away with the pain of OAuth credentials. I've done away with the development pain (and the business risk) of relying on Twitter's API.

PhantomJS & Twitter

I believe that this way everyone will be happier, and all Twitter apps should do this. Twitter gets reduced loads on its servers and doesn't need to maintain its API anymore, devs get to roll out awesome features more easily and more effectively, and users get more reliable & performant apps which won't go belly up in a week when Twitter revokes their API keys :)

And perhaps we can stop becoming a web of apps and go back to being a web of links.

First of all, this is brilliant:

The best way to deal with Twitter was not to use their API nor create an underground one, but rather to run a headless browser like PhantomJS on a dedicated server. I can input some sockpuppet Twitter creds (hell, I could even input my own Twitter creds) and I've got a valid Twitter session running on a browser in the cloud...

My question is why Twitter would even want to maintain an API anymore.

The headless browser technique respects the Twitter terms of service but doesn't force developers to suffer through an unreliable, rate-limited API.

It's a win for everyone, IMHO.

Do not try and bend the rate limit. That's impossible. Instead... only try to realize the truth.

Matrix Spoon Boy

Lindsay Logan: The Limit Does Not Exist

Not when you run PhantomJS in the cloud, baby!

PhantomJS in the Cloud

The PhantomJS ghost looks like he wants to take out Pacman.

Bonus points for The Matrix and Lindsay Lohan.

The limit does not exist!!!

Limits funny

Well, the limit shifted. As one of my esteemed colleagues says, it's still just scraping at the mercy of the page layout.

An API is just a social contract that the company agrees to keep things backwards compatible for a long time.

Given that Twitter has changed its social contract several times, we've always been at their mercy anyway.

Social contract

I want to renegotiate my social contract.

i want to renegotiate my Social contract

@greg Not to be rude, but I think you and your "esteemed colleages" are not particularly familiar with developing Twitter apps based on that comment.

Yes, the rate limit "shifted." IT SHIFTED DOWN.

Used to be 150 per hour un-authenticated (based on IP), 360 per hour authenticated (based on per-user oauth tokens).

Now the unauthenticated is still 150, but authenticated is down by 10 per hour to 350. Also they added feature-based rate limiting for certain API methods, which stack with overall rate limits.

As to your second point about being "at the mercy of page layout," do you think are any less at Twitter's mercy when using the API? How about the significant business risk of just being cut off altogether? What if they just shut down the API, period. Yes, I realize that (in theory) APIs are a contract

And what's so bad about having to rewrite the scraper occasionally? So some dev at Twitter spends 2 hours tweaking a feature, and I spend 10 minutes adjusting the scraper logic.

"An API is just a social contract that the company agrees to keep things backwards compatible for a long time."

I see your patronizing remark and raise you one: try developing a Twitter app, or even doing some cursory research on how Twitter treats third-party developers and we can continue the discussion then.

Look, this isn't meant to be a one-size-fits all solution that you implement once and it lasts forever.

But it's a feasible approach which will be more effective in the long run than using Twitter's shithead API. They don't understand what it means to be an API provider, and building out an app infrastructure based on the idea that Twitter will respect the social contract is foolhardy.