De-Identification, Re-Identification and the risks therein
Mo Data stashed this in Big Data Ethics and Privacy
Does de-identification work or not?"
How we answer this question really boils down to whether we will define de-identification as "working" only if it provides absolute privacy guarantees. Or whether, as we do with many other areas of life (like door locks, seatbelts and other protections), we accept a dramatic reduction from the original risks (without the protection in place) as being worthwhile.
As was mentioned by the authors of the whitepaper, Dr. Latanya Sweeney, a leading de-identification expert, has estimated that the risk associated with HIPAA Safe Harbor de-identification was 0.04 per cent (4 in 10,000), as this proportion of the population would be unique in her calculations and, therefore, potentially identifiable. This is a quite a small risk, which as I've mentioned elsewhere, falls somewhere between one's lifetime odds of being personally struck by lightning (about one in 10,000) and the risk of being affected because someone close to you has been struck (with ten people affected for every one struck). With a probability of re-identification this low, one really has to question whether anyone would have the motivation to undertake a re-identification attempt that would produce such an extremely small chance of success.
As I've written in a recent three-part essay for Harvard Law School's "Bill of Health" Online Symposium on the Law, Ethics & Science of Re-identification Demonstrations, when proper de-identification methods have been used to effectively reduce re-identification risks to very small levels, it becomes highly unlikely that data intruders would conclude that it is worth the time, effort and expense to undertake a re-identification attempt in the first place. With use of proper de-identification best practices, midnight dumpster diving to look for prescription bottles is likely to become the more economically viable approach to violating our neighbors' privacy (if we are inclined toward this type of malfeasance). Real-world risks usually face very pragmatic economic disincentives when we consider that most "data intruders" would usually need to make a profit to bother attempting to re-identify people (at least when the data has been properly de-identified with these data intrusion motivations and considerations properly having been taken into account).
Fortunately, under the HIPAA de-identification requirements, re-identification is typically time-consuming to conduct, expensive (often requiring identified linking data from commercial data vendors), requires serious computer/mathematical skills, is rarely successful and, most importantly, is usually uncertain as to whether it has actually succeeded (due to a high probability of "false positive" re-identifications when the re-identification probabilities are so low).
Ms. Baker's article challenged us to ask ourselves "What risk level is acceptable in our eyes?" and, would we still have the same answer if it was us that might possibly re-identified?
She went on to importantly query:
- How many data users can we count on to use "proper" de-identification techniques in their work every single time?
- Who, if anyone, is policing that policy in organizations or governing agencies? and
- Can we count on them to do that job properly?
Fortunately, HHS has provided some very useful guidance on the HIPAA de-identification requirements and best practices that should be considered as we ask ourselves these important questions. This guidance carefully covers both the requirements of the more easily implemented Safe Harbor (which requires removal of 18 types of potential identifiers), and the Expert Determination method, which is more flexible, but requires significant expertise and experience to implement.
As an HIV epidemiologist, I, like Ms. Baker, am a huge supporter of using individual data for public benefit, such as in finding cures for cancer, or helping to detect and prevent the next world-wide pandemic (such as MERS). I would also be the first to cheer a fail-safe way to de-identify data, but the practical reality is that we face, to quote Al Gore, "an Inconvenient Truth". By this, I mean the simple fact that we face an inevitable trade-off between the quality and correctness of statistical analyses performed with de-identified data and the privacy protections result from the de-identification process. So, while I would love to see de-identification result in perfect protections, I believe the wiser course for our society and for guiding public policy is to realize that when we have achieved extremely small re-identification risks, trying to further reduce them to zero has its own set of very damaging outcomes. This is because some popular de-identification methods (e.g., k-anonymity) can unnecessarily, and often undetectably, degrade the accuracy of de-identified data for multivariate statistical analyses or data-mining (by distorting variance-covariance matrixes, or masking heterogeneous sub-groups which have been collapsed in seeking generalization solutions).
This problem in balancing disclosure risks and statistical accuracy is well-understood by statisticians, but not as well recognized by privacy advocates and the general public. I recently gave a short talk to the National Academy of Science's Institute of Medicine about exactly these issues. I'd encourageFierceBigData readers to view the video of this presentation and examine thepresentation slides (.pdf) in order to better understand my very serious concern that poorly conducted de-identification, particularly when it results in the over-deidentification of data that posed only de minimus re-identification risks, can lead all of us down the pathway to "bad science" and "bad decisions".
Unfortunately, all humans (whether we are in the general public, data scientists, politicians or policy-makers) have an empirically demonstrated diminished capacity to rationally assess and respond to probabilities and risks when fear is invoked. So, when heavily publicized re-identification attacks have been brought to life, like some Frankenstein monster, our objective assessments of the probability of it actually being implemented in the real-world may subconsciously become 100 percent--which is highly distortive of the true risk/benefit calculus that we face.
I hope that in contemplating the very small risks associated with re-identification attacks directed at properly de-identified data (such as data which has been HIPAA de-identified) we will be able to rationally the balance:
- Any risks for privacy protection.
- The risks we might face through poorly conducted science based on over-distorted data.
- The considerable societal benefits that we would unfortunately forgo if we are not able to wisely choose within the mathematically unavoidable trade-off that we'll face if we insist on absolute guarantees for de-identification risks.
As Ms. Baker wisely acknowledges in her article, "all of life entails some risk and this is no exception to the rule".