How we arrived at a term to describe the potential and peril of today's data deluge.
- By Uri Friedman
Uri Friedman is deputy managing editor at Foreign Policy. Before joining FP, he reported for the Christian Science Monitor, worked on corporate strategy for Atlantic Media, helped launch the Atlantic Wire, and covered international affairs for the site. A proud native of Philadelphia, Pennsylvania, he studied European history at the University of Pennsylvania and has lived in Barcelona, Spain and Geneva, Switzerland.
Humans have been whining about being bombarded with too much information since the advent of clay tablets. The complaint in Ecclesiastes that “of making many books there is no end” resonated in the Renaissance, when the invention of the printing press flooded Western Europe with what an alarmed Erasmus called “swarms of new books.” But the digital revolution — with its ever-growing horde of sensors, digital devices, corporate databases, and social media sites — has been a game-changer, with 90 percent of the data in the world today created in the last two years alone. In response, everyone from marketers to policymakers has begun embracing a loosely defined term for today’s massive data sets and the challenges they present: Big Data. While today’s information deluge has enabled governments to improve security and public services, it has also sowed fears that Big Data is just another euphemism for Big Brother.
American statistician Herman Hollerith invents an electric machine that reads holes punched into paper cards to tabulate 1890 census data, revolutionizing the concept of a national head count, which had originated with the Babylonians in 3800 B.C. The device, which enables the United States to complete its census in one year instead of eight, spreads globally as the age of modern data processing begins.
President Franklin D. Roosevelt’s Social Security Act launches the U.S. government on its most ambitious data-gathering project ever, as IBM wins a government contract to keep employment records on 26 million working Americans and 3 million employers. “Imagine the vast army of clerks which will be necessary to keep these records,” Republican presidential candidate Alf Landon scoffs. “Another army of field investigators will be necessary to check up on the people whose records are not clear.”
At Bletchley Park, a British facility dedicated to breaking Nazi codes during World War II, engineers develop a series of groundbreaking mass data-processing machines, culminating in the first programmable electronic computer. The device, named “Colossus,” searches for patterns in intercepted messages by reading paper tape at 5,000 characters per second — reducing a process that had previously taken weeks to a matter of hours. Deciphered information on German troop formations later helps the Allies during their D-Day invasion.
The U.S. National Security Agency (NSA), a nine-year-old intelligence agency with more than 12,000 cryptologists, confronts information overload during the espionage-saturated Cold War, as it begins collecting and processing signals intelligence automatically with computers while struggling to digitize a backlog of records stored on analog magnetic tape in warehouses. (In July 1961 alone, the agency receives 17,000 reels of tape.)
The U.S. government secretly studies a plan to transfer all government records — including 742 million tax returns and 175 million sets of fingerprints — to magnetic computer tape at a single national data center, though the plan is later scrapped amid public concern about bringing “Orwell’s ‘1984’ at least as close as 1970,” as one report puts it. The outcry inspires the 1974 Privacy Act, which places limits on federal agencies’ sharing of personal information.
British computer scientist Tim Berners-Lee proposes leveraging the Internet, pioneered by the U.S. government in the 1960s, to share information globally through a “hypertext” system called the World Wide Web. “The information contained would grow past a critical threshold,” he writes, “so that the usefulness [of] the scheme would in turn encourage its increased use.”
“We are developing a supercomputer that will do more calculating in a second than a person with a hand-held calculator can do in 30,000 years.” –U.S. President Bill Clinton
NASA researchers Michael Cox and David Ellsworth use the term “big data” for the first time to describe a familiar challenge in the 1990s: supercomputers generating massive amounts of information — in Cox and Ellsworth’s case, simulations of airflow around aircraft — that cannot be processed and visualized. “[D]ata sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk,” they write. “We call this the problem of big data.”
After the 9/11 attacks, the U.S. government, which has already dabbled in mining large volumes of data to thwart terrorism, escalates these efforts. Former national security advisor John Poindexter leads a Defense Department effort to fuse existing government data sets into a “grand database” that sifts through communications, criminal, educational, financial, medical, and travel records to identify suspicious individuals. Congress shutters the program a year later due to civil liberties concerns, though components of the initiative are simply shifted to other agencies.
The 9/11 Commission calls for unifying counterterrorism agencies “in a network-based information sharing system” that is quickly inundated with data. By 2010, the NSA’s 30,000 employees will be intercepting and storing 1.7 billion emails, phone calls, and other communications daily. Meanwhile, with retailers amassing information on customers’ shopping and personal habits, Wal-Mart boasts a cache of 460 terabytes — more than double the amount of data on the Internet at the time.
As social networks proliferate, technology bloggers and professionals breathe new life into the “big data” concept. “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear,” Wired‘s Chris Anderson writes in “The End of Theory.” Government agencies, some of the United States’ top computer scientists report, “should be deeply involved in the development and deployment of big-data computing, since it will be of direct benefit to many of their missions.”
The Indian government establishes the Unique Identification Authority of India to fingerprint, photograph, and take an iris scan of all 1.2 billion people in the country and assign each person a 12-digit ID number, funneling the data into the world’s largest biometric database. Officials say it will improve the delivery of government services and reduce corruption, but critics worry about the government profiling individuals and sharing intimate details about their personal lives.
U.S. President Barack Obama’s administration launches data.gov as part of its Open Government Initiative. The website’s more than 445,000 data sets go on to fuel websites and smartphone apps that track everything from flights to product recalls to location-specific unemployment, inspiring governments from Kenya to Britain to launch similar initiatives.
Reacting to the global financial crisis, U.N. Secretary-General Ban Ki-moon pledges to create an alert system that captures “real-time data on the impact of the economic crisis on the poorest nations.” The U.N. Global Pulse program has conducted research on how to predict everything from spiraling prices to disease outbreaks by analyzing data from sources such as mobile phones and social networks.
“There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.” –Google CEO Eric Schmidt
Scanning 200 million pages of information, or 4 terabytes of disk storage, in a matter of seconds, IBM’s Watson computer system defeats two human challengers in the quiz show Jeopardy!. The New York Times later dubs this moment a “triumph of Big Data computing.”
The Obama administration announces a $200 million Big Data Research and Development Initiative in response to a U.S. government report calling for every federal agency to have a “‘big data’ strategy.” The National Institutes of Health puts a data set of the Human Genome Project in Amazon’s computer cloud, while the Defense Department pledges to develop “autonomous” defense systems that can “learn from experience.” CIA Director David Petraeus, marveling that the “‘digital dust’ to which we have access is being delivered by the equivalent of dump trucks,” discusses a post-Arab Spring agency effort to collect and analyze global social media feeds through cloud computing.
U.S. Secretary of State Hillary Clinton announces a public-private partnership called “Data 2X” to collect statistics on women and girls’ economic, political, and social status around the world. “Data not only measures progress — it inspires it,” she explains. “Once you start measuring problems, people are more inclined to take action to fix them because nobody wants to end up at the bottom of a list of rankings.” Let the Big Data race begin.