Think Again: Big Data
Why the rise of machines isn't all it's cracked up to be.
“Big data” is the jargon du jour, the tech world’s one-size-fits-all (so long as it’s triple XL) answer to solving the world’s most intractable problems. The term is commonly used to describe the art and science of analyzing massive amounts of information to detect patterns, glean insights, and predict answers to complex questions. It might sound a bit dull, but from stopping terrorists to ending poverty to saving the planet, there’s no problem too big for the evangelists of big data.
“The benefits to society will be myriad, as big data becomes part of the solution to pressing global problems like addressing climate change, eradicating disease, and fostering good governance and economic development,” crow Viktor Mayer-Schönberger and Kenneth Cukier in modestly titled Big Data: A Revolution that Will Transform How We Live, Work, and Think.
So long as there are enough numbers to crunch — whether it’s data from your iPhone, grocery store purchases, online dating profile, or, say, the anonymized health records of an entire country — the insights that can be gleaned from our computing ability to decode this raw data are innumerable. Even Barack Obama’s administration has jumped with both feet on the bandwagon, releasing on May 9 a “groundbreaking” trove of “previously inaccessible or unmanageable data” to entrepreneurs, researchers, and the public.
“One of the things we’re doing to fuel more private-sector innovation and discovery is to make vast amounts of America’s data open and easy to access for the first time in history. And talented entrepreneurs are doing some pretty amazing things with it,” said President Obama.
But is big data really all it’s cracked up to be? Can we trust that so many ones and zeros will illuminate the hidden world of human behavior? Foreign Policy invited Kate Crawford of the MIT Center for Civic Media to go behind the numbers. —Ed.
“With Enough Data, the Numbers Speak for Themselves.”
Not a chance. The promoters of big data would like us to believe that behind the lines of code and vast databases lie objective and universal insights into patterns of human behavior, be it consumer spending, criminal or terrorist acts, healthy habits, or employee productivity. But many big-data evangelists avoid taking a hard look at the weaknesses. Numbers can’t speak for themselves, and data sets — no matter their scale — are still objects of human design. The tools of big-data science, such as the Apache Hadoop software framework, do not immunize us from skews, gaps, and faulty assumptions. Those factors are particularly significant when big data tries to reflect the social world we live in, yet we can often be fooled into thinking that the results are somehow more objective than human opinions. Biases and blind spots exist in big data as much as they do in individual perceptions and experiences. Yet there is a problematic belief that bigger data is always better data and that correlation is as good as causation.
For example, social media is a popular source for big-data analysis, and there’s certainly a lot of information to be mined there. Twitter data, we are told, informs us that people are happier when they are farther from home and saddest on Thursday nights. But there are many reasons to ask questions about what this data really reflects. For starters, we know from the Pew Research Center that only 16 percent of online adults in the United States use Twitter, and they are by no means a representative sample — they skew younger and more urban than the general population. Further, we know many Twitter accounts are automated response programs called “bots,” fake accounts, or “cyborgs” — human-controlled accounts assisted by bots. Recent estimates suggest there could be as many as 20 million fake accounts. So even before we get into the methodological minefield of how you assess sentiment on Twitter, let’s ask whether those emotions are expressed by people or just automated algorithms.
But even if you’re convinced that the vast majority of tweeters are real flesh-and-blood people, there’s the problem of confirmation bias. For example, to determine which players in the 2013 Australian Open were the “most positively referenced” on social media, IBM conducted a large-scale analysis of tweets about the players via its Social Sentiment Index. The results determined that Victoria Azarenka was top of the list. But many of those mentions of Azarenka on Twitter were critical of her controversial use of medical timeouts. So did Twitter love her or hate her? It’s difficult to trust that IBM’s algorithms got it right.
Once we get past the dirty-data problem, we can consider the ways in which algorithms themselves are biased. News aggregator sites that use your personal preferences and click history to funnel in the latest stories on topics of interest also come with their own baked-in assumptions — for example, assuming that frequency equals importance or that the most popular news stories shared on your social network must also be interesting to you. As an algorithm filters through masses of data, it is applying rules about how the world will appear — rules that average users will never get to see, but that powerfully shape their perceptions.
Some computer scientists are moving to address these concerns. Ed Felten, a Princeton University professor and former chief technologist at the U.S. Federal Trade Commission, recently announced an initiative to test algorithms for bias, especially those that the U.S. government relies upon to assess the status of individuals, such as the infamous “no-fly” list that the FBI and Transportation Security Administration compile from the numerous big-data resources at the government’s disposal and use as part o
f their airport security regimes.
“Big Data Will Make Our Cities Smarter and More Efficient.”
Up to a point. Big data can provide valuable insights to help improve our cities, but it can only take us so far. Because not all data is created or even collected equally, there are “signal problems” in big-data sets — dark zones or shadows where some citizens and communities are overlooked or underrepresented. So big-data approaches to city planning depend heavily on city officials understanding both the data and its limits.
For example, Boston’s Street Bump app, which collects smartphone data from drivers going over potholes, is a clever way to gather information at low cost, and more apps like it are emerging. But if cities begin to rely on data that only come from citizens with smartphones, it’s a self-selecting sample — it will necessarily have less data from those neighborhoods with fewer smartphone owners, which typically include older and less affluent populations. While Boston’s Office of New Urban Mechanics has made concerted efforts to address these potential data gaps, less conscientious public officials may miss them and end up misallocating resources in ways that further entrench existing social inequities. One need only look to the 2012 Google Flu Trends miscalculations, which significantly overestimated annual flu rates, to realize the impact that relying on faulty big data could have on public services and public policy.
The same is true for “open government” initiatives that post data about public sectors online, such as Data.gov and the White House’s Open Government Initiative. More data won’t necessarily improve any functions of government, including transparency or accountability, unless there are mechanisms to allow engagement between the public and their institutions, not to mention aid the government’s ability to interpret the data and respond with adequate resources. None of that is easy. In fact, there just aren’t many skilled data scientists around yet. Universities are currently scrambling to define the field, write curricula, and meet demand.
Human rights groups are also looking to use big data to help understand conflicts and crises. But here too there are questions about the quality of both the data and the analysis. The MacArthur Foundation recently awarded an 18-month, $175,000 grant to Carnegie Mellon University’s Center for Human Rights Science to investigate how big-data analytics are changing human rights fact-finding, such as through development of “credibility tests” to sort alleged human rights violations posted to sites like Crisis Mappers, Ushahidi, Facebook, and YouTube. The director of the center, Jay D. Aronson, notes that there are “serious questions emerging about the use of data and the responsibilities of academics and human rights organizations to its sources. In many cases, it is unclear whether the safety and security of the people reporting the incidents is enhanced or threatened by these new technologies.”
NOEL CELIS/AFP/Getty Images
“Big Data Doesn’t Discriminate Between Social Groups.”
Hardly. Another promise of big data’s alleged objectivity is that there will be less discrimination against minority groups because raw data is somehow immune to social bias, allowing analysis to be conducted at a mass level and thus avoiding group-based discrimination. Yet big data is often deployed for exactly this purpose — to segregate individuals into groups — because of its ability to make claims about how groups behave differently. For example, a recent paper points to how scientists are allowing their assumptions about race to shape their big-data genomics research.
As Alistair Croll writes, the potential for big data to be used for price discrimination raises serious civil rights concerns, a practice that was historically known as “redlining.” Under the rubric of “personalization,” big data can be used to isolate specific social groups and treat them differently, something that laws often prohibit businesses or humans from doing explicitly. Companies can choose to show online ads for a credit card offer to people who are most attractive in terms of household income or credit history to banks, leaving others completely unaware that a particular offer is available. Google even has a patent to dynamically price content: So if your past buying history indicates you are more likely to pay top dollar for shoes, your starting price the next time you shop for footwear online might be considerably higher. Now employers are trying to get apply big data to human resources, assessing how to make employees more productive, all by analyzing their every click and tap. Employees may have no idea how much data is being gathering about them or how it is being used.
Discrimination can also take on other demographic dimensions. For example, the New York Times reported that Target started compiling analytic profiles of its customers years ago; it now has so much data on purchasing trends that it can predict under certain circumstances if a woman is pregnant with an 87 percent confidence rate, simply based on her shopping history. While the Target statistician in the article emphasizes how this will help the company improve its marketing to expectant parents, one can also imagine such determinations being used in other ways to discriminate that might have serious ramifications for social equality and, of course, privacy.
And recently, a big-data study from Cambridge University of 58,000 Facebook “likes” was used to predict very sensitive personal information about users, such as sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parents’ marital status, age, and gender. As journalist Tom Foremski observes of the study: “Easy access to such highly sensitive information could be used by employers, landlords, government agencies, educational institutes, and private organizations, in ways that discriminate and punis
h individuals. And there’s no way [to] fight it.”
Finally, consider the implications in the context of law enforcement. From Washington, D.C., to New Castle County, Delaware, police are turning to “predictive policing” models of big data in the hopes that they will shine investigative light on unsolved cases and even help prevent future crimes. However, focusing police activity on particular big data-detected “hot spots” runs the danger of reinforcing stigmatized social groups as likely criminals and institutionalizing differential policing as a standard practice. As one police chief has written, although predictive policing algorithms explicitly avoid categories such as race or gender, the practical result of using such systems without sensitivity to differential impact can be “a recipe for deteriorating community relations between police and the community, a perceived lack of procedural justice, accusations of racial profiling, and a threat to police legitimacy.”
Tim Boyle/Getty Images
“Big Data Is Anonymous, so It Doesn’t Invade Our Privacy.”
Flat-out wrong. While many big-data providers do their best to de-identify individuals from human-subject data sets, the risk of re-identification is very real. Cell-phone data, on mass, may seem fairly anonymous, but a recent study on a data set of 1.5 million cell-phone users in Europe showed that just four points of reference were enough to individually identify 95 percent of people. There is a uniqueness to the way that people make their way through cities, the researchers observed, and given how much can be inferred by the large number of public data sets, this makes privacy a “growing concern.” We already know, thanks to academics like Alessandro Acquisti, how to predict an individual’s Social Security number simply by cross-analyzing publicly available data.
But big data’s privacy problem goes far beyond standard re-identification risks. Currently, medical data sold to analytics firms has a risk of being used to track your identity. There is a lot of chatter about personalized medicine, where the hope is that drugs and other therapies will be so individually targeted that they work to heal an individual’s body as if they were made from that person’s very own DNA. It’s a wonderful prospect in terms of improving the power of medical science, but it’s fundamentally reliant on personal identification at cellular and genetic levels, with high risks if it is used inappropriately or leaked. But despite the rapid growth in personal health data collectors such as RunKeeper and Nike+, practical use of big data to improve health-care delivery is still more aspiration than reality.
Other kinds of intimate information are being collected by big-data energy initiatives, such as the Smart Grid. This effort looks to improve the efficiency of energy distribution to our homes and businesses by analyzing enormous data sets of consumer energy usage. The project has great promise but also comes with great privacy risks. It can predict not only how much energy we need and when we need it, but also minute-by-minute information on where we are in our homes and what we are doing. This can include knowing when we are in the shower, when our dinner guests leave for the night, and when we turn off the lights to go to sleep.
Of course, such highly personal big-data sets are a prime targets for hackers or leakers. WikiLeaks has been at the center of some of the most significant big-data releases of recent times. And as we saw recently with the massive data leak from Britain’s offshore financial industry, the 1 percenters of the world are just as vulnerable as everyone else to having their personal data made public.
MOHAMMED AL-SHAIKH/AFP/Getty Images
“Big Data Is the Future of Science.”
Partly true, but it has some growing up to do. Big data offers new roads for science, without a doubt. We only need look to the discovery of the Higgs boson particle, a result of the largest grid-computing project in history, with CERN using the Hadoop Distributed File System to manage all the data. But unless we recognize and address some of big data’s inherent weaknesses in reflecting on human lives, we may make major public policy and business decisions based on incorrect assumptions.
To address this, data scientists are starting to collaborate with social scientists, who have a long history of critically engaging with data: assessing sources, the methods of data collection, and the ethics of use. Over time, this means finding new ways to combine big-data approaches with small-data studies. This goes well beyond advertising and marketing approaches like focus groups or A/B testing (in which two versions of a design or outcome are shown to users in order to see which variation proves more effective). Rather, new hybrid methods can ask questions about why people do things, beyond just tallying up how often something occurs. That means drawing on sociological analysis and deep ethnographic insight as well as information retrieval and machine learning.
Technology companies recognized early on that social scientists could give them greater insight into how and why people engage with their products, such as when Xerox’s PARC hired pioneering anthropologist Lucy Suchman. The next stage will be a richer collaboration between computer scientists, statisticians, and social scientists of many stripes — not just to test the findings of each other’s work, but to ask fundamentally different kinds of questions, with greater rigor.
Given the immense amount of information collected about us every day — including Facebook clicks, GPS data, health-care prescriptions, and Netflix queues — we must decide sooner rather than later whom we can trust with that information, and for what purpose. We can’t escape the fact that data is never neutral and that it’s difficult to anonymize. But we can draw on expertise across different fields in order to better recognize biases, gaps, and
assumptions, and to rise to the new challenges to privacy and fairness.
FABRICE COFFRINI/AFP/Getty Images