Apple's Differential Privacy
Three Pipe Problem stashed this in Programming & Math
I feel like it's awesome, and I have to hand it to Apple.
They really spent a long time thinking this through.
“We believe you should have great features and great privacy,” Federighi told the developer crowd. “Differential privacy is a research topic in the areas of statistics and data analytics that uses hashing, subsampling and noise injection to enable…crowdsourced learning while keeping the data of individual users completely private. Apple has been doing some super-important work in this area to enable differential privacy to be deployed at scale.”
Differential privacy, translated from Apple-speak, is the statistical science of trying to learn as much as possible about a group while learning as little as possible about any individual in it. With differential privacy, Apple can collect and store its users’ data in a format that lets it glean useful notions about what people do, say, like and want. But it can’t extract anything about a single, specific one of those people that might represent a privacy violation. And neither, in theory, could hackers or intelligence agencies.
“With a large dataset that consists of records of individuals, you might like to run a machine learning algorithm to derive statistical insights from the database as a whole, but you want to prevent some outside observer or attacker from learning anything specific about some [individual] in the data set,” says Aaron Roth, a University of Pennsylvania computer science professor whom Apple’s Federighi named in his keynote as having “written the book” on differential privacy. (That book, co-written with Microsoft researcher Cynthia Dwork, is the Algorithmic Foundations of Differential Privacy [PDF].) “Differential privacy lets you gain insights from large datasets, but with a mathematical proof that no one can learn about a single individual.”
As Roth notes when he refers to a “mathematical proof,” differential privacy doesn’t merely try to obfuscate or “anonymize” users’ data. That anonymization approach, he argues, tends to fail. In 2007, for instance, Netflix released a large collection of its viewers’ film ratings as part of a competition to optimize its recommendations, removing people’s names and other identifying details and publishing only their Netflix ratings. But researchers soon cross-referenced the Netflix data with public review data on IMDB to match up similar patterns of recommendations between the sites and add names back into Netflix’s supposedly anonymous database.
That sort of de-anonymizing trick has countermeasures—say, removing the titles of the Netflix films and keeping only their genre. But there’s never a guarantee that some other clever trick or cross-referenced data couldn’t undo that obfuscation. “If you start to remove people’s names from data, it doesn’t stop people from doing clever cross-referencing,” says Roth. “That’s the kind of thing that’s provably prevented by differential privacy.”