Where to get public data? In this article, I will share a few of my personal favorite public data sources. Combining public data in your own analytics projects can add powerful external context.

GDELT: Global Data on Events, Location and Tone


GDELT might just be the most awesome big data, full text analytics project in the entire world – no kidding. GDELT is an absolutely phenomenal project despite the controversy and growing pains it has encountered. If you love data and databases, you can’t help but appreciate the magnitude of work that must have been involved to deliver this wonderful resource for public data usage.

The GDELT Project is the largest, most comprehensive open database of human society ever created. Its Event Database archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days making it one of the largest public spatio-temporal datasets in existence. GDELT uses sophisticated natural language and data mining algorithms to extract more than 300 categories of events and the networks of people, organizations, locations, themes, and emotions that tie them together. It is also one of the largest real-time streaming news machine translation deployments in the world covering global news in 65 languages. GDELT truly pushes the boundaries of big data, weighing in at over a quarter-billion rows with 59 fields for each record across the entire planet for more than 35 years.

How can you query, explore, model, visualize, interact, and even forecast using this vast archive? The entire GDELT database is 100% free and open for you to download the raw text data files for use in your own projects. You totally need to play with this one! For data geeks that want a good read on how this project evolved, check out the excellent white paper presented at the International Studies Association.



Now I am giving away my market analysis secrets. I hope I won’t regret it! CrunchBase is an awesome resource for discovering innovative companies and learning more about the people behind them. Founded in 2007 by Mike Arrington, CrunchBase began as a simple database to track startups covered on TechCrunch.com. Today it contains tech market insights, activities, news, investments, funding rounds, IPOs, acquisitions and raised funds from over 400 sources. Unlike other public data sources, you will need to apply for an account, tell them how you plan to use their data and obtain required permissions.

So Many More to Explore

Here are a bazillion more data sources to explore without my over the top sales pitch. Enjoy!



Data Mining

Miscellaneous and More Lists