Where to get public data? In this article, I will share a few of my personal favorite public data sources. Combining public data in your own analytics projects can add powerful external context.
GDELT: Global Data on Events, Location and Tone
GDELT might just be the most awesome big data, full text analytics project in the entire world – no kidding. I don’t care that Google technology is behind it. GDELT is an absolutely phenomenal project despite the controversy and growing pains it has encountered. If you love data and databases, you can’t help but appreciate the magnitude of work that must have been involved to deliver this wonderful resource for public data usage.
The GDELT Project is the largest, most comprehensive open database of human society ever created. Its Event Database archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days making it one of the largest public spatio-temporal datasets in existence. GDELT uses sophisticated natural language and data mining algorithms to extract more than 300 categories of events and the networks of people, organizations, locations, themes, and emotions that tie them together. It is also one of the largest real-time streaming news machine translation deployments in the world covering global news in 65 languages. GDELT truly pushes the boundaries of big data, weighing in at over a quarter-billion rows with 59 fields for each record across the entire planet for more than 35 years.
How can you query, explore, model, visualize, interact, and even forecast using this vast archive? The entire GDELT database is 100% free and open for you to download the raw text data files for use in your own projects. You totally need to play with this one! For data geeks that want a good read on how this project evolved, check out the excellent white paper presented at the International Studies Association.
Now I am giving away my market analysis secrets. I hope I won’t regret it! CrunchBase is an awesome resource for discovering innovative companies and learning more about the people behind them. Founded in 2007 by Mike Arrington, CrunchBase began as a simple database to track startups covered on TechCrunch.com. Today it contains tech market insights, activities, news, investments, funding rounds, IPOs, acquisitions and raised funds from over 400 sources. Unlike other public data sources, you will need to apply for an account, tell them how you plan to use their data and obtain required permissions.
So Many More to Explore
Here are a bazillion more data sources to explore without my over the top sales pitch. Enjoy!
- The World Bank http://www.worldbank.org/
- Gapminder http://www.gapminder.org/data/
- United Nations Datasets http://data.un.org/
- International Monetary Fund http://www.imf.org/external/data.htm
- Open Spending https://openspending.org/
- CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/
- NOAA/NCEI weather and climate data http://www.ncdc.noaa.gov/
- Data.gov http://www.data.gov/
- Federal Reserve Economic Research https://research.stlouisfed.org/fred2/
- U.S. Federal Statistics Data http://fedstats.sites.usa.gov/data-releases/
- U.S. Bureau of Labor Statistics http://www.bls.gov/
- U.S. Federal Agency Expenditures https://www.usaspending.gov/
- U.S. Energy Information Administration http://www.eia.gov/
- U.S. Census Bureau Data http://www.census.gov/
- U.S. Department of Health & Human Services https://www.healthdata.gov/
- U.S. Department of Education http://www2.ed.gov/
- OSCAR data.gov.uk https://data.gov.uk/dataset/oscar
- European Union Open Data Portal http://open-data.europa.eu/en/data/
- Eurostat European Statistics http://ec.europa.eu/eurostat/data/database
- UC Irvine Machine Learning list of data mining sets
- KD Nuggets list of data mining sets
- BigML list of data mining sets
Miscellaneous and More Lists
- Azure Data Marketplace http://datamarket.azure.com/browse/Data
- Amazon Web Services Public Data http://aws.amazon.com/datasets
- Google Public Data https://www.google.com/publicdata/directory
- Google Trends http://www.google.com/trends/explore
- Freebase People, Places, and Things http://www.freebase.com/
- Datahub 10K+ collection of datasets https://datahub.io/
- Github Public Datasets https://github.com/caesar0301/awesome-public-datasets
- Million Song Data Set http://aws.amazon.com/datasets/6468931156960467
- ESPN Sports API http://espn.go.com/apis/devcenter/
- Sports Reference Data http://www.sports-reference.com/