Sign up FAST! Login

The real problem in using big data is the privacy issue

August 18, 2014 | By Pam Baker

identity parade

The real problem in using big data is the privacy issue. While many researchers and privacy advocates hail the de-identification route to protect privacy, others say that simply won't do. This is because anonymization makes a mess of the data sets. To make the data more open and useful, one group of researchers focused on the social sciences recommends we stop trying to de-identify private data and hold researchers responsible for protecting privacy instead.

"In this article, we show that these and other de-identification procedures necessitate changes to data sets that threaten replication and extension of baseline analyses," write the authors Jon P. Daries, Justin Reich, Jim Waldo, Elise M. Young, Jonathan Whittinghill, Daniel Thomas Seaton, Andrew Dean Ho, and Isaac Chuang. "To balance student privacy and the benefits of open data, we suggest focusing on protecting privacy without anonymizing data by instead expanding policies that compel researchers to uphold the privacy of the subjects in open data sets."

Read their entire paper in the Association for Computing Machinery's (ACM) Queue for more of the reasoning behind their argument. I would like to hear your thoughts on this matter. Please share them in the comments below or send me an email. I'm also hoping Dr. Barth-Jones, who penned a guest post on FierceBigData earlier on the subject of de-identification, and some of his peers that advocate de-identification will also weigh in on this debate regarding which is the best course to protect privacy and data integrity.

Quality social science research and the privacy of human subjects requires trust.


Open data has tremendous potential for science, but, in human subjects research, there is a tension between privacy and releasing high-quality open data. Federal law governing student privacy and the release of student records suggests that anonymizing student data protects student privacy. Guided by this standard, we de-identified and released a data set from 16 MOOCs (massive open online courses) from MITx and HarvardX on the edX platform. In this article, we show that these and other de-identification procedures necessitate changes to data sets that threaten replication and extension of baseline analyses. To balance student privacy and the benefits of open data, we suggest focusing on protecting privacywithout anonymizing data by instead expanding policies that compel researchers to uphold the privacy of the subjects in open data sets. If we want to have high-quality social science research and also protect the privacy of human subjects, we must eventually have trust in researchers. Otherwise, we'll always have the strict tradeoff between anonymity and science illustrated here.

The open in massive open online course has many interpretations. Some MOOCs are hosted on open-source platforms, some use only openly licensed content, and most MOOCs are openly accessible to any learner without fee or prerequisites. We would like to add one more notion of openness: open access to data generated by MOOCs. We argue that this is part of the responsibility of MOOCs, and that fulfilling this responsibility threatens current conventions of anonymity in policy and public perception.

In this spirit of open data, on May 30, 2014, a team of researchers from Harvard and MIT (including this author team) announced the release of an open data set containing student records from 16 courses conducted in the first year of the edX platform. (In May 2012, MIT and Harvard launched edX, a nonprofit platform for hosting and marketing MOOCs. MITx and HarvardX are the two respective institutional organizations focused on MOOCs.)6 The data set is a de-identified version of that used to publish HarvardX and MITx: The First Year of Open Online Courses, a report revealing findings about student demographics, course-taking patterns, certification rates, and other measures of student behavior.6 The goal for this data release was twofold: first, to allow other researchers to replicate the results of the analysis; and second, to allow researchers to conduct novel analyses beyond the original work, adding to the body of literature about open online courses.

Within hours of the release, original analysis of the data began appearing on Twitter, with figures and source code. Two weeks after the release, the data journalism team atThe Chronicle of Higher Education published "8 Things You Should Know about MOOCs," an article that explored new dimensions of the data set, including the gender balance of the courses.13 Within the first month of the release, the data had been downloaded more than 650 times. With surprising speed, the data set began fulfilling its purpose: to allow the research community to use open data from online learning platforms to advance scientific progress.

The rapid spread of new research from this data is exciting, but the excitement is tempered by a necessary limitation of the released data: it represents a subset of the complete data. To comply with federal regulations on student privacy, the released data set had to be de-identified. This article demonstrates tradeoffs between the need to meet the demands of federal regulations of student privacy, on the one hand, and our responsibility to release data for replication and downstream analyses, on the other. For example, the original analysis found that approximately 5 percent of course registrants earned certificates. Some methods of de-identification cut that percentage in half.

It is impossible to anonymize identifiable data without the possibility of affecting some future analysis in some way. It is possible to quantify the difference between replications from the de-identified data and original findings; however, it is difficult to fully anticipate whether findings from novel analyses will result in valid insights or artifacts of de-identification. Higher standards for de-identification can lead to lower-value de-identified data. This could have a chilling effect on the motivations of social science researchers. If findings are likely to be biased by the de-identification process, why should researchers spend their scarce time on de-identified data?

At the launch of edX in May 2012, the presidents of MIT and Harvard spoke about the edX platform, and the data generated by it, as a public good. If academic and independent researchers alike have access to data from MOOCs, then the progress of research into online education will be faster and results can be furthered, refined, and tested. These ideals for open MOOC data are undermined, however, if protecting student privacy means that open data sets are markedly different from the original data. The tension between privacy and open data is in need of a better solution than anonymized data sets. Indeed, the fundamental problem in our current regulatory framework may be an unfortunate and unnecessary conflation of privacy and anonymity. Jeffrey Skopek17 of Harvard Law School outlines the difference between the two as follows:

...under the condition of privacy, we have knowledge of a person's identity, but not of an associated personal fact, whereas under the condition of anonymity, we have knowledge of a personal fact, but not of the associated person's identity. In this sense, privacy and anonymity are flip sides of each other. And for this reason, they can often function in opposite ways: whereas privacy often hides facts about someone whose identity is known by removing information and other goods associated with the person from public circulation, anonymity often hides the identity of someone about whom facts are known for the purpose of putting such goods into public circulation (p. 1755).

Realizing the potential of open data in social science requires a new paradigm for the protection of student privacy: either a technological solution such as differential privacy,3which separates analysis from possession of the data, or a policy-based solution that allows open access to possibly re-identifiable data while policing the uses of the data.

This article describes the motivations behind efforts to release learner data, the contemporary regulatory framework of student privacy, our efforts to comply with those regulations in creating an open data set from MOOCs, and some analytical consequences of de-identification. From this case study in de-identification, we conclude that the scientific ideals of open data and the current regulatory requirements concerning anonymizing data are incompatible. Resolving that incompatibility will require new approaches that better balance the protection of privacy and the advancement of science in educational research and the social sciences more broadly.

BALANCING OPEN DATA AND STUDENT PRIVACY REGULATIONSAs with open-source code and openly licensed content, support for open data has been steadily building. In the United States, government agencies have increased their expectations for sharing research data.5 In 2003 the National Institutes of Health became the first federal agency to require research grant applicants to describe their plans for data sharing.12 In 2013 the Office of Science and Technology Policy released a memorandum requiring the public storage of digital data from unclassified, federally funded research.7These trends dovetailed with growing interest in data sharing in the learning sciences community. In 2006 researchers from Carnegie Mellon University opened DataShop, a repository of event logs from intelligent tutoring systems and one of the largest sources of open data in educational research outside the federal government.8

Open data has tremendous potential across the scientific disciplines to facilitate greater transparency through replication and faster innovation through novel analyses. It is particularly important in research into open, online learning such as MOOCs. A study released earlier this year1 estimates that more than 7 million people in the United States alone have taken at least one online course, and that that number is growing by 6 percent each year. These students are taking online courses at a variety of institutions, from community colleges to research universities, and open MOOC data will facilitate research that could be helpful to all institutions with online offerings.

Open data can also facilitate cooperation between researchers with different domains of expertise. As George Siemens, president of the Society for Learning Analytics Research, has argued, learning research involving large and complex data sets requires interdisciplinary collaboration between data scientists and educational researchers.16 Open data sets make it easier for researchers in these two distinct domains to come together.

While open educational data has great promise for advancing science, it also raises important questions about student privacy. In higher education, the cornerstone of student privacy law is FERPA (Family Educational Rights and Privacy Act). FERPA is a federal privacy statute that regulates access to and disclosure of a student's educational records. In our de-identification procedures, we aimed to comply with FERPA, although not all institutions consider MOOC learners to be subject to FERPA.11

FERPA offers protections for PII (personally identifiable information) within student records. Per FERPA, PII cannot be disclosed, but if PII is removed from a record, then the student becomes anonymous, privacy is protected, and the resulting de-identified data can be disclosed to anyone (20 U.S.C. § 1232g(b)(1) 2012; 34 C.F.R. § 99.31(b) 2013). FERPA thus equates anonymity—the removal of PII—with privacy.

FERPA's PII definition includes some statutorily defined categories, such as name, address, social security number, and mother's maiden name, but also

...other information that, alone or in combination, is linked or linkable to a specific student that would allow a reasonable person in the school community, who does not have personal knowledge of the relevant circumstances, to identify the student with reasonable certainty (34 C.F.R. § 99.3, 2013).

In assessing the reasonable certainty of identification, the educational institution is supposed to take into account other data releases that might increase the chance of identification.22 Therefore, an adequate de-identification procedure must remove not only statutorily required elements, but also quasi-identifiers. These quasi-identifiers are pieces of information that can be uniquely identifying in combination with each other or with additional data sources from outside the student records. They are not defined by statute or regulatory guidance from the Department of Education but left up to the educational institution to define.22

The potential for combining quasi-identifiers to uniquely identify individuals is well established. For example, Latanya Sweeney,21 from the School of Computer Science at Carnegie Mellon University, has demonstrated that 87 percent of the U.S. population can be uniquely identified with a reasonable degree of certainty by a combination of ZIP code, date of birth, and gender. These risks are further heightened in open, online learning environments because of the public nature of the activity. As another example, some MOOC students participate in course discussion forums—which, for many courses, remain available online beyond the course end date. Students' usernames are displayed beside their posts, allowing for linkages of information across courses, potentially revealing students who enroll for unique combinations of courses. A very common use of the discussion forums early in a course is a self-introduction thread where students state their age and location, among other PII.

Meanwhile, another source of identifying data is social media. It is conceivable that students could verbosely log their online education on Facebook or Twitter, tweeting as soon as they register for a new course or mentioning their course grade in a Facebook post. Given these external sources, an argument can be made that many columns in the person-course data set that would not typically be thought of as identifiers could qualify as quasi-identifiers.

The regulatory framework defined by FERPA guided our efforts to de-identify the person-course data set for an open release. Removing direct identifiers such as students' usernames and IP addresses was straightforward, but the challenge of dealing with quasi-identifiers was more complicated. We opted for a framework of k-anonymity.20 A data set is k-anonymous if any one individual in the data set cannot be distinguished from at leastk-1 other individuals in the same data set. This requires ensuring that no individual has a combination of quasi-identifiers different from k-1 others. If a data set cannot meet these requirements, then the data must be modified to meet k-anonymity, either by generalizing data within cases or suppressing entire cases. For example, if a single student in the data set is from Latvia, we can employ one of these remedies: generalize her location by reporting her as from Europe rather than Latvia, for example; suppress her location information; or suppress her case entirely.

This begins to illustrate the fundamental tension between generating data sets that meet the requirements of anonymity mandates and advancing the science of learning through public releases of data. Protecting student privacy under the current regulatory regime requires modifying data to ensure that individual students cannot be identified. These modifications can, however, change the data set considerably, raising serious questions about the utility of the open data for replication or novel analysis. The next sections describe our approach to generating a k-anonymous data set, and then examine the consequences of our modifications to the size and nature of the data set.

DE-IDENTIFICATION METHODSThe original, identified person-course data set contained the following information:

• Information about students (username, IP address, country, self-reported level of education, self-reported year of birth, and self-reported gender).

• The course ID (a string identifying the institution, semester, and course).

• Information about student activity in the course (date and time of first interaction, date and time of last interaction, number of days active, number of chapters viewed, number of events recorded by the edX platform, number of video play events, number of forum posts, and final course grade).

• Four variables computed to indicate level of course involvement (registered: enrolled in the course; viewed: interacted with the courseware at least once; explored: interacted with content from more than 50 percent of course chapters; and certified: earned a passing grade and received a certificate).

Transforming this person-course data set into a k-anonymous data set that we believed met FERPA guidelines required four steps: 1) defining identifiers and quasi-identifiers; 2) defining the value for k; 3) removing identifiers; and 4) modifying or deleting values of quasi-identifiers from the data set in a way that ensures k-anonymity, while minimizing changes to the data set.

We defined two variables in the original data set as identifiers and six variables as quasi-identifiers. The username was considered identifying in and of itself, so we replaced it with a random ID. IP address was also removed. Four student demographic variables were defined as quasi-identifiers: country, gender, age, and level of education. Course ID was considered a quasi-identifier since students might take unique combinations of courses and because it provides a link between PII posted in forums and the person-course data set. The number of forum posts made by a student was also a quasi-identifier because a determined individual could scrape the content of the forums from the archived courses and then identify users with unique numbers of forum posts.

Once the quasi-identifiers were chosen, we had to determine a value of k to use for implementing k-anonymity. In general, larger values of k require greater changes to de-identify, and smaller values of k leave data sets more vulnerable to re-identification. The U.S. Department of Education offers guidance to the de-identification process in a variety of contexts, but it does not recommend or require specific values of k for specific contexts. In one FAQ, the department's Privacy Technical Assistance Center states that many "statisticians consider a cell size of 3 to be the absolute minimum" and goes on to say that values of 5 to 10 are even safer.15 We chose a k of 5 for our de-identification.

Since our data set contained registrations for 16 courses, registrations in multiple courses could be used for re-identification. The k-anonymity approach would ensure that no individual was uniquely identifiable using the quasi-identifiers within a course, but further care had to be taken to remove the possibility that a registrant could be uniquely identified based upon registering in a unique combination or number of courses. For example, if only three people registered for all 16 courses, then those three registrants would not be k-anonymous across courses, and some of their registration records would need to be suppressed in order to lower the risk of their re-identification.

The key part of the de-identification process was modifying the data such that no combination of quasi-identifiers described groups consisting of fewer than five students. The two tools employed for this task were generalization, the combining of more granular values into categories (e.g., 1, 2, 3, 4, and 5 become "1-5"); and suppression, the deletion of data that compromises k-anonymity.21 Many strategies for de-identification, including Sweeney's Datafly algorithm, implement both tools with different amounts of emphasis on one technique or the other.18 More generalization would mean that fewer records are suppressed, but the remaining records would be less specific than the original data. A heavier reliance on suppression would remove more records from the data, but the remaining records would be less altered.

The following section illustrates differential tradeoffs between valid research inferences and de-identification methods by comparing two de-identification approaches: one that favors generalization over suppression (hereafter referred to as the generalization emphasis, or GE, method), and one that favors suppression over generalization (hereafter referred to as the suppression emphasis, or SE, method). There are other ways of approaching the problem of de-identification, but these were two that were easily implemented. Our intent is not to discern the dominance of one technique over the other in any general case but rather to show that tradeoffs between anonymity and valid research inferences a) are unavoidable and b) will depend on the method of de-identification.

The SE method used generalization for the names of countries (grouping them into continent/region names for countries with fewer than 5,000 rows) and for the first- and last-event time stamps (grouping them into dates by truncating the hour and minute portion of the time stamps). Suppression was then employed for rows that were not k-anonymous across the quasi-identifying variables. For more information on the specifics of the implementation, please refer to the documentation accompanying the data release.10

The GE method generalized year of birth into groups of two (e.g., 1980-1981), and number of forum posts into groups of five for values greater than 10 (e.g., 11-15). Suppression was then employed for rows that were not k-anonymous across the quasi-identifying variables. The generalizations resulted in a data set that needed less suppression than in the SE method, but also reduced the precision of the generalized variables.

Both de-identification processes are more likely to suppress registrants in smaller courses: the smaller a course, the higher the chances that any given combination of demographics would not be k-anonymous, and the more likely that this row would need to be suppressed. Furthermore, since an activity variable (number of forum posts) was included as a quasi-identifier, both methods were likely to remove users who were more active in the forums. Since only 8 percent of students had any posts in the forums at all, and since these students were typically active in other ways, the records of many of the most active students were suppressed.

Read the full text here: 

Stashed in: Big Data!, Privacy does not exist.

To save this post, select a stash from drop-down menu or type in a new one:

Nailed it. It's hard for most people to get privacy right. So they don't get privacy right.

You May Also Like: