Don't feel like writing a crawler? someone has done it for you, crawled the web and published the output - 3.5 billion web pages
Mo Data stashed this in Data Sources
http://webdatacommons.org/hyperlinkgraph/index.html
This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.
We hope that the graph will be useful for researchers who develop
- search algorithms that rank results based on the hyperlinks between pages.
- SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
- graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
- Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.
Contents
Stashed in: Big Data!
To save this post, select a stash from drop-down menu or type in a new one:
8:24 AM Nov 17 2013