10 Big Data Sites to Watch
The websites that are changing the way we understand everything from higher education to climate patterns.
As Uri Friedman points out in his Anthropology of an Idea on "Big Data" in Foreign Policy‘s November issue, the Internet has sparked an information explosion — to the point where the amount of new data created last year alone surpassed an estimated 1.8 trillion gigabytes, growing by a factor of nine in just five years. But while Web-based household names such as Facebook and Google may have pioneered the Big Data revolution by developing new technologies to help store, process, and mine the trillions of bits making up the foundation of their businesses, numerous startups and established technology companies have followed in their footsteps and discovered new ways of mining data.
After surveying a number of data scientists about their favorite Internet destinations and excluding websites of companies developing and selling Big Data technologies, I’ve selected ten sites that explore this information revolution in interesting and innovative ways. By visiting them, you’ll get a chance not only to play with Big Data but also to learn more about this much-hyped phenomenon and its potential impact on society.
Is the U.S. presidential campaign drivel making you hungry for facts? Data.gov, which was launched by the Obama administration as part of its Open Government Initiative in 2009, offers access to data generated by the Executive Branch of the Federal Government. Enterprising government agencies and private citizens have built on the site’s hundreds of thousands of data sets (and other sources) to help you find everything from the most on-time flight between two airports to the latest product recalls.
I know what you’re thinking. You, too, would like to get in on the Big Data payday, if only you had some "computing for data analysis" skills. Or maybe studying improvisation with a renowned jazz musician is more your speed. Coursera offers these and 196 other online courses from top universities for free. But unlike other initiatives that simply make classroom lectures available on the Internet, Coursera has developed an educational platform at Big Data scale. "We see a future where world-renowned universities serve millions instead of thousands," Coursera co-founder Daphne Koller told ReadWriteWeb. LinkedIn’s Monica Rogati has called Coursera’s approach to assessment (tests are either computer-graded or peer-graded) and use of machine learning to provide feedback to students and instructors "a very interesting application of data science."
Last month, sports fans in the Big Data world could ease their frustration with NFL replacement referees by turning their attention to Kaggle, the people who "make data science a sport." The site allows users to participate in competitions, show off their data science skills, and even win fame and fortune. Or you can bait the data gladiators with a rich data set, a challenging question, and a generous prize. California’s Heritage Provider Network (HPN), for example, recently offered $3 million (and other prizes) to the data science team that can create an algorithm that predicts how many days a patient will spend in a hospital in the next year, based on (anonymized) historical claims data supplied by HPN. If the competition is successful, HPN hopes to use the winning algorithm to both keep people healthy and lower the cost of care.
Do predictions have a future? Philip Tetlock has demonstrated that experts are not very good at making predictions. But the folks at Recorded Future think they have found a substitute for experts: clever algorithms that unlock predictive signals from web chatter. You can sign up for free (premium service is $149 per month) and explore — with the help of nifty visualization tools — a comprehensive index of past, present, and predicted events discussed on the web. Take a look, for instance, at the data the site has collected on protests around the world over the last 12 months and planned demonstrations after the deadly attack on the U.S. Consulate in Benghazi. If all this makes you worried about computers replacing prognosticators, trust my gut-based prediction: We’ll always have pundits.
With over 80 million unique visitors and 1.5 billion job searches per month, Indeed.com knows a lot about where jobs are and which are in high demand — a valuable service during tough economic times. Indeed is also the biggest employer-review site in the world with more than 1 million reviews, and people in search of jobs now upload over 1 million new resumes each month. What’s more, the site offers easy access to some of the information it is amassing on employers and job-seekers; users, for example, can play with Indeed’s database to find out whether their skills are in demand or not. The most competitive job market (by city) in the United States today? Washington, D.C.
Which country has the highest ratio of sheep to humans? DataMarket provides the answer to this and other (more or less) urgent questions by housing thousands of data sets with hundreds of millions of facts and figures from a wide range of public and private sources, including Eurostat, the Economist Intelligence Unit, the International Monetary Fund, the United Nations, and the World Bank. You can search it all, visualize the data in a variety of ways, download your findings, and even publish your own data. Oh, and the country with the highest ratio of sheep to humans? It’s New Zealand.
Last July, the U.S. Census Bureau released its first-ever public Application Programming Interface (API), allowing developers to design web and mobile apps that explore and display data from the 2010 Census and the 2006-2010 American Community Survey. "This opens up our statistics beyond traditional uses," Census Bureau Director Robert Groves noted at the time, expressing hope that developers would create applications that show commuting patterns for American cities or provide local governments with socioeconomic statistics on their population (applications are now posted to the Census’ "App Gallery"). The Bureau, in other words, is transitioning from providing access to data to facilitating its consumption. A government agency that made waves in the late 19th century by using an electric machine to tabulate census data is still finding ways to innovate.
The first presidential debate generated 10.3 million tweets in 90 minutes, "a political-event record," according to Twitter. But even before that Big Data milestone, there was no dearth of Twitter commentary on the presidential race. Four hundred million tweets are posted to the social media site each day, and some of them — to a casual observer, it looks like all of them — mention Barack Obama and Mitt Romney. The Twitter Political Index displays the results of daily sentiment analysis conducted by Twitter’s @gov team to track the Twitterverse’s fluctuating feelings about each candidate.
ManyBills not only lets you search all the legislation passing through Congress but also employs machine-learning algorithms to analyze and categorize different parts of bills. Users help by organizing the legislation into thematic collections and color-coded sections. You can save your collection, share it with others, or even embed it in your blog or website.
Armed with mountains of climate data and machine-learning algorithms, the Climate Corporation predicts the unpredictable weather and offers customized insurance plans to U.S. farmers. Even if you have no desire to farm, you can still delve into the site’s 30 years of historical data on precipitation and temperatures to learn more about the weather where you live.