Enron Email Dataset
Mo Data stashed this in Data Sources
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form email@example.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to firstname.lastname@example.org when no recipient was specified.
I get a number of questions about this corpus each week, which I am unable to answer, mostly because they deal with preparation issues and such that I just don't know about. If you ask me a question and I don't answer, please don't feel slighted.
I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)
March 2, 2004 Version of datasetand the August 21, 2009 Version of datasetare no longer being distributed. If you are using this dataset for your work, you are requested to replace it with the newer version of the dataset below, or make the the appropriate changes to your local copy. A total of four messages have been removed since the original version of the dataset.
- August 21, 2009 Version of dataset (about 423Mb, tarred and gzipped).
Research uses of the datasetThis is a partial and poorly maintained list. If I've left your work out, don't take it personally, and feel free to send me a pointer and/or description.
- A paper describing the Enron data was presented at the 2004 CEAS conference.
- Some experiments associated with this data are described on Ron Bekkerman's home page.
- A social-network analysis of the data, including "useful mappings between the MD5 digest of the email bodies and such things as authors, recipients, etc", is available fromAndres Corrada-Emmanuel.
- A group from SIMS, UC Berkeley provides search, visualization, and some email that has been labeled with topic and sentiment labels
- Jitesh Shetty has put up a database of link-analysis results.
- A version of the dataset with all attachments is available from EDRM.
- Work at the University of Pennsylvania includes a query dataset for email search as well as a tool for generating spelling errors based on the Enron corpus.
- Kimmie Farrington and colleagues published a paper in 2011 that uses the Enron dataset as part of the test corpus for their work on crowdsourcing human vs. computer generated classification explanation: see Hutton, Amanda, Alexander Liu, and Cheryl Martin. "Crowdsourcing evaluations of classifier interpretability." In Proceedings of the 2012 AAAI Spring Symposium on Wisdom of the Crowd
- Parakweet has released an open source set of Enron sentence data, labeled for speech acts.