Don't feel like writing a crawler? someone has done it for you, crawled the web and published the output - 3.5 billion web pages

Mo Data stashed this in Data Sources

http://webdatacommons.org/hyperlinkgraph/index.html

This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.

We hope that the graph will be useful for researchers who develop

search algorithms that rank results based on the hyperlinks between pages.
SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

Contents

<a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html">http://webdatacommons.org/hyperlinkgraph/index.html</a>

This page provides a large hyperlink graph for public download. The graph has been extracted from the <a rel="nofollow" target="_blank" href="http://commoncrawl.org/">Common Crawl</a> 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks
 between these pages. To the best of our knowledge, the graph is the 
largest hyperlink graph that is available to the public outside 
companies such as Google, Yahoo, and Microsoft. Below we provide 
instructions on how to download the graph as well as basic statistics 
about its topology.

We hope that the graph will be useful for researchers who develop

<ul><li>search algorithms that rank results based on the hyperlinks between pages.</li>
<li>SPAM detection methods which identity networks of web pages that are published in order to trick search engines.</li>
<li>graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.</li>
<li>Web Science researchers who want to analyze the linking 
patterns within specific topical domains in order to identify the social
 mechanisms that govern these domains.</li>
</ul>

Contents

<ul><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc0">1. Levels of Aggregation</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc1">2. Data Formats and Download</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc2">2.1 Index/Arc Format</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc3">2.2 WebGraph Framework Format</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc4">2.3 Pajek NET Format</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc5">3. Extraction Process and Source Code</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc6">4. Topology of the Hyperlink Graph</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc7">5. Other Public Hyperlink Graphs and Web Crawls</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc8">6. License</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc9">7. Feedback</a></li><li><a rel="nofollow" target="_blank" href="http://webdatacommons.org/hyperlinkgraph/index.html#toc10">8. Credits</a></li></ul>

Mo Data
8:24 AM Nov 17 2013

Stashed in: Big Data!

To save this post, select a stash from drop-down menu or type in a new one:

That is a LOT of web pages. Whew.

Adam Rifkin
9:01 AM Nov 17 2013

Don't feel like writing a crawler? someone has done it for you, crawled the web and published the output - 3.5 billion web pages

Mo Data stashed this in Data Sources

You May Also Like: