Categories: Tech & Society

Common Crawl – Free Database Of The Entire Web, Competition For Google

We all know Google! Google started off as nothing but a website indexer, built a more efficient algorithm for ranking Web pages and finally built its success on crawling the Web—using software that visits every page in order to build up a vast index of online content.

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. Common Crawl supplies a database of over five billion Web pages that can be accessed and analyzed by anyone, in the hope that it will inspire new research or online services.

“The Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top,” says entrepreneur Gilad Elbaz, who founded Common Crawl. “But simply doing the huge amount of work that’s necessary to get at all that information is a large blocker; few organizations … have had the resources to do that.”

Elbaz says he noticed around five years ago that researchers with new ideas about how to use Web data felt compelled to take jobs at Google because it was the only place they could test those ideas. He says Common Crawl’s data will make it easier for novel ideas to gain traction, both in the world of startups and in academic research.

Common Crawl has so far indexed more than five billion pages, adding up to 81 terabytes of data, made available through Amazon’s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawl’s director.

More details at MIT Technology Review.

Prateek Panda

Prateek is the Founder of TheTechPanda. He's passionate about technology startups and entrepreneurship and enjoys speaking to new founders every day. Prateek has also been consistently regarded as one of the top marketing experts in the region.

Recent Posts

Your next lover might be a bot: Inside the rise of AI porn

Researchers looked at a million ChatGPT interaction logs and concluded that after creative composition, the most popular…

2 days ago

Talk to me, bot: Why AI therapy is both a hug and a hazard

A recent news informs that some therapists are now secretly using ChatGPT during therapy sessions.…

3 days ago

AI social impact: The great divider or the great equalizer?

The social impact of digitization is palpable even before AI enters the picture. Research shows…

4 days ago

New tech on the block: Data analytics, skilling, digital twin, medtech, streaming, digital content, cloud, cybersecurity, app & no code

The Tech Panda takes a look at recent tech launches. Data Analytics: The Most Scalable…

4 days ago

Game on, India: New online gaming bill levels up growth, brands & global clout

With the Promotion and Regulation of Online Gaming Bill, 2025 now in effect, India’s gaming…

7 days ago

Rethinking Flipper Zero: A Personal Take on UX Improvements

Here are 7 ways to improve the UX of Flipper Zero — making it easier…

1 week ago