Scotlandâ€™s vote likely to be a nail-biter - A Data-Driven Crystal Ball
Mo Data stashed this in Analysis Tips and Tricks
If you only have a minute, read this: "Among the insights that Rothschild has documented and that he puts to considerable use in his methodology is that polls of votersâ€™ expectationsâ€”who they think will winâ€”is a more accurate basis for forecasting than polls asking people how they intend to vote.
â€śThis is because we are polling from a broader information set, and voters respond as if they had polled 20 of their friends,â€ť he wrote in a 2013Â paperÂ co-written with Justin Wolfers of the University of Michigan. Not surprisingly, then, Rothschild regularly includes data from betting markets in generating his predictions, including his forecast of the Scottish independence vote."
â€śScottish independence: polls show itâ€™s too close to call.â€ť
â€śScotlandâ€™s vote likely to be a nail-biter.â€ť
â€śScottish independence vote on a knife edge as polls put both Yes AND No ahead.â€ť
If there was any consensus in the days running up to the momentous Sept. 18 vote in Scotland, it was that no one could predict the outcome. Headlines from Edinburgh, London, and across the globe were in complete agreement: It was impossible to say with any confidence what would happen.
And then there wasÂ David Rothschild, a Microsoft researcher and leading expert in a new kind of data-driven predictive methodology. Three days ahead of the vote in Scotland, he put the chances of a No outcome at 77.4 percent. Two days later, he inched it up to 79.5. On the morning of the vote, before any returns were announced, he went on record on hisÂ blogÂ with an 84 percent chance of defeat for Scottish independence.
Miro Dudik (at whiteboard) confers with Microsoft Prediction Lab colleagues David Rothschild (left) and David Pennock.
This isnâ€™t a mere parlor game for Rothschild, who, along with colleagues at Microsoft and elsewhere, correctly predicted the winners of all 15 World Cup knockout games earlier this year and got the Obama vs. Romney outcome right in 50 of 51 jurisdictions (the states plus the District of Columbia) in the 2012 U.S. presidential election. It seems no contest is beyond the purview of Rothschildâ€™s predictive powers, whether itâ€™s congressional races, the Super Bowl, the Oscars, or the Eurovision Song Contest.
In an era in which traditional political polling is taking a huge reputational hitâ€”just ask Eric Cantor, former majority leader of the U.S. House of Representatives, who lost his Republican primary election in Virginia by 11 percentage points despite his own pollster putting him 34 points aheadâ€”Rothschildâ€™s success rate is gaining notice.
â€śThe polls track the sentiment of the people who are answering the poll at the time,â€ť Rothschild said as he awaited the results in Scotland. â€śMy forecast predicts what will happen on Election Day. Clearly, the sentiment of the people at the time of the polls is a critical component on any forecast of Election Day, but not the only one.â€ť
â€śIt may actually be reasonably convincing,â€ť he said of the victory for the No side. And convincing it was: 55 to 45 percent.
The Problem with Representational PollingConsider conventional political polling, which has a solid track record but is expensive and time-consuming. In recent decades, polling companies have relied on random-digit landline phone calls to determine voter sentiment. The accuracy of such results depends significantly on reaching a representative sample of people who actually will go to the polls. In the era of mobile phones and caller ID, the obstacles are mounting.
Among the insights that Rothschild has documented and that he puts to considerable use in his methodology is that polls of votersâ€™ expectationsâ€”who they think will winâ€”is a more accurate basis for forecasting than polls asking people how they intend to vote.
â€ś[T]his is because we are polling from a broader information set, and voters respond as if they had polled 20 of their friends,â€ť he wrote in a 2013Â paperÂ co-written with Justin Wolfers of the University of Michigan. Not surprisingly, then, Rothschild regularly includes data from betting markets in generating his predictions, including his forecast of the Scottish independence vote.
Another major contribution from Rothschild, who has a doctorate in applied economics from the Wharton School of Business at the University of Pennsylvania, is that by applying the appropriate statistical adjustments, highly unrepresentative samples can be used to generate remarkably accurate forecasts.
He and several colleagues demonstrated this in a novelÂ experimentÂ that polledÂ XboxÂ users before the 2012 U.S. presidential election. They conducted an opt-in poll in the 45 days before the election and enabled people to participate once a day. In addition to asking, â€śIf the election were held today, who would you vote for?â€ť they collected basic demographic information: sex, race, age, education, state of residence, party identification, political leanings, and how the respondent voted in the 2008 presidential election.
As you might expect, the vast majority of Xbox usersâ€”and thus survey respondentsâ€”were male and relatively young. They would make a terrible sample for standard polling. But they served the researchersâ€™ purposes.
â€śStandard polling looks at a respondent as, for example, a male from New York,â€ť Rothschild says. â€śThe way we look we look at it is: a male and a person from New York. I hope to find other potential polltakers who are male and other potential polltakers who are from New York. And from that, by breaking people into their demographics, weâ€™re able to allow all users to inform the likely polling of all other users.â€ť
From there, the researchers â€śpost-stratifiedâ€ť the Xbox responses to mimic a representative sample of likely voters, calculating cell weights by cross-tabulating with exit polls from the 2008 presidential election. As Election Day approached, they used the accumulated data to update their forecasts daily for each state.So even though they were short on women older than 65, for example, they had a number of female respondents and some respondents older than 65, along with others who shared certain other characteristics with older women. In the end, the data from more than 750,000 Xbox surveys taken by almost 350,000 unique respondents yielded 176,000 different demographic â€ścells,â€ť each with a distinct combination of characteristics.
â€śNot only did we match the accuracy of major polling companies,â€ť Rothschild says, â€śbut we also provided a lot of insight that they werenâ€™t able to get, through the fact that we had people coming back again and again.â€ť
Each predictive exercise that Rothschild runs draws from a different pool of data, which is often a combination of polling data, historical results, Internet betting data, routinely collected statistics, and user-generated data. For Major League Baseball playoffs, for example, massive amounts of data are available from the regular season. World Cup soccer doesnâ€™t have that kind of buildup, so it makes sense to engage the crowd to collect new data to augment historical data about the players and teams and the results of the qualifying rounds.
â€śThereâ€™s always something missingâ€”always data we wish we had that didnâ€™t quite exist,â€ť Rothschild says. â€śSo weâ€™ve done a lot of fun experiments.â€ť These include Oscars prediction games and NFL prediction games that were designed to attract people with a high level of expertise in those areas.
â€śThe way Iâ€™ve always looked at it,â€ť Rothschild says, â€śis that any individualâ€”you, me, the guy on the streetâ€”has a certain amount of information about the things the person cares about, but no one has been unlocking it.â€ť
The conventional pollsters â€śdonâ€™t think about somebody who is self-selected,â€ť he explains. â€śThey go to random people. They also use very simple aggregation methods, rather than modeling the results they have. Thatâ€™s what computers are for. Thatâ€™s what our new knowledge is for.â€ť
Rothschild and his colleagues apply deep expertise in machine learning to test and calibrate their models against historical data, and they use advanced algorithms to account for a host of variables, such as the advantages of incumbency and the tendency of bettors to overstate long-shot wins.
Reinventing Survey ResearchThe interactive platform that Rothschild and other researchers launched today houses all of the ongoing predictive work that Rothschild has been featuring on his blog and in academic journals and presentations. TheÂ Microsoft Prediction LabÂ displays his data-driven predictionsâ€”some of them updated in real timeâ€”in a wide range of fields, from sports and entertainment to politics and economics.
â€śWeâ€™re building an infrastructure,â€ť he says, â€śthatâ€™s incredibly scalable, so we can be answering questions along a massive continuum.â€ť
Rothschild sees the new platform as â€śa great laboratory for researchersâ€ť as well as â€śa very socialized experienceâ€ť for interested users. Among other contests, he plans to predict the results of every upcoming U.S. House, Senate, and gubernatorial race. Users will be able to customize views on the site based on their geographic location and their interests. The idea is to collect data quickly and update it as often as possible.
A sample of the Microsoft Prediction Lab interface users will see for every U.S. House, Senate, and gubernatorial race in 2014.
â€śItâ€™s also important to be agnostic and not be wed to one type of data,â€ť Rothschild says. He looks at any data that can contribute to the predictive model, whether itâ€™s stock-market data, Internet page views, or trending topics and word co-occurrence on social media. Collecting â€ścrowd wisdomâ€ť will be a big component of the endeavor.
â€śBy really reinventing survey research, we feel that we can open it up to a whole new realm of questions that, previously, people used to say you can only use a model for,â€ť Rothschild says. â€śFrom whom you survey to the questions you ask to the aggregation method that you utilize to the incentive structure, we see places to innovate. Weâ€™re trying to be extremely disruptive.â€ť
That disruption has ramifications for the polling industryâ€”and beyond.
â€śThere are two reasons to experiment with nonprobability polling,â€ť he says. â€śFirst, I firmly believe the standard polling will reach a point where the response rate and the coverage is so low that something bad will happen. Then, the standard polling technology will be completely destroyed, so it is prudent to invest in alternative methods.
â€śSecond, even if nothing ever happened to standard polling, nonprobability polling data will unlock market intelligence for us that no standard polling could ever provide. Ultimately, we will be able to gather data so quickly that the idea of a decision-maker waiting a few weeks for a poll will seem crazy.â€ť
The ready availability of such data will enable businesses to make strategic investment decisions, such as where to locate a data center or how to invest marketing resources to attain the optimal yield.
â€śWe will be able,â€ť Rothschild says, â€śto gather so much detail from repeated usersâ€”and the quantity of users we can reachâ€”that decision-makers will come to cherish the nearly infinite number of data points that can be efficiently generated to answer the exact questions the question-maker has, not the expedient question or the historical norm.â€ť
One caveat, though: The market intelligence derived from the nonprobability polling data must prove accurate.
â€śThat is what this research is all about,â€ť he adds, â€śreaching that point where the quick, relevant, and cost-effective market intelligence is as accurate as what it supplants. At that point, the demise of standard polling becomes irrelevant, because it will become strictly dominated by nonprobability data collection and analytical techniques.â€ť
The new Microsoft Prediction Lab website draws on the expertise of researchers in Microsoftâ€™sNew York City,Â Redmond, andÂ IndiaÂ labs. Key contributors include noted computer scientistsÂ Miro DudĂkÂ andÂ David Pennock, as well as a research team led byÂ Harry Shum, Microsoft executive vice president of Technology and Research, and the office of Microsoftâ€™s chief economist,Â Preston McAfee.
â€śIt has been,â€ť Rothschild confirms, â€śan incredibly collaborative effort.â€ť
â€śMost researchers get the opportunity to explore a much more narrow set of questions and a much more narrow set of data,â€ť he says. â€śBut through collaboration with an awesome set of researchers, this really allows me to explore things that are so buried. And thatâ€™s really the most exciting thing about this. Itâ€™s not any individual outcomeâ€”itâ€™s the massive amount of questions that weâ€™ll be able to answer in the near future.â€ť