The Peril and Promise of Big Data

The Peril and Promise of Big Data

The White House released its 79-page report on Big Data on May 1, following a year of scandal and controversy as citizens, activists, and technologists dealt with the revelation that it wasn’t just Facebook and Google collecting data on U.S. citizens, but the NSA too.

With shockwaves from Edward Snowden’s defection continuing to reverberate across Washington, the broader public increasingly sees the U.S. government as having much too much technical capability for collecting, storing, and processing individual-level data, which in turn allows it unprecedented access into the everyday lives of millions of people — including U.S. citizens. Meanwhile, other federal employees process retirements with an archaic system where paper forms are filled out by hand, tens of thousands of government computers still run Windows XP (first released before 9/11 and no longer supported by Microsoft), and even the U.S. president’s leading policy initiative — healthcare — faltered on the shoals of a wrecked website.

There was a time during the Cold War — and up until the early 1990swhen the government defined the bleeding edge of innovation — GPS, voice recognition software, even the Internet itself began in government agencies. But the government has ceded that advantage and it is now largely playing catch-up with the private sector: catching up the IT systems on its workers’ desks, catching up on net-neutrality and other Internet regulations, and now catching up on the value — and hype — of big data. Concerns about privacy and big data pre-date the Snowden leaks, but the revelations about classified NSA programs have cast a very bright light on just how the U.S. government makes use of big data across its vast departments and agencies.

Here we see an odd and fascinating paradox of American government in the 21st century: incredible technical ability for collecting and analyzing digital information in parts of the intelligence community alongside near complete paralysis to leverage even the most basic commercial technologies to solve more mundane elements of daily governance.

The White House report — written by U.S. President Barack Obama’s Council of Advisors on Science and Technology — is ultimately an accounting of government responsibilities, but it serves as a useful point of departure for a wider array of conversations regarding privacy, opportunity, and limitations of the new world of ubiquitous sensors and permanent digital data exhaust. It clearly and concisely provides some useful throat clearing on a vast topic, defining terms and framing key parts of the debate; the footnotes alone will be of great use to researchers and commentators. The report, however, is narrow in its scope, focusing on public policy and governmental applications rather than a more sweeping review of the myriad commercial innovations and intrusions most Americans encounter on a daily basis; as a result it is far from definitive — teeing up more questions than it answers. In fact, nearly all of its policy recommendations are calls for more research or public comment.

So, what then, if anything, can we take away from this multi-month review?

First, there are real perceived limits to the efficiencies of data integration. The tension between the convenience of data integration and fears of government overreach is palpable. We want the government to know enough to be useful and efficient (and clearly we lambast Washington when it isn’t), but we remain deeply uncomfortable with widespread data sharing. These are the twin promises and perils of the information revolution and they continue to dominate conversations in D.C., Silicon Valley, and points in between. The White House report itself highlights a case in point: the 2013 shooting at the Naval Yard. The assailant gained access to the base by dint of having active security clearance — a clearance he retained despite multiple arrests.

Surely the information systems supporting law enforcement and clearance investigations are linked? Alas, no. In a scathing op-ed last year, the Center for Strategic and International Studies’ John Hamre decried the arcane (and seemingly ineffectual) clearance process. Calling it "pathetic," he noted "I have dedicated 38 years of my life to America’s national security. I know there are spies in our midst. We can improve security and save money simultaneously. But our country needs a system built for the 21st century."

While I for one can personally see the appeal of a system that auto-populates my security clearance paperwork with my home address from the U.S. Postal Service (or Amazon), overseas travel data from Customs and Border Protection, and employer data from LinkedIn (or the IRS), I don’t need the Defense Department to know about my unpaid D.C. parking tickets. And while the White House report documents these underlying tensions, it proposes few real remedies.

Second, analytics don’t make policy. There is incredible opportunity for innovations in public policy based on advances in data collection and integration. The report highlights health care and public education as obvious entry points, including opportunities for online education in remote areas and new capabilities for researching student learning techniques alongside digitized and personalized health prescriptions. But there is nothing inherent in data — big, small, or otherwise — that determines appropriate policy interventions.

Everyone knows "correlation isn’t causation" but maybe we need to learn another mantra: "correlation isn’t policy." Or as Evgeny Morozov noted in the New York Times last year, "a problem tackled through correlations alone lends itself to a very different set of solutions than a problem mapped out in all its causal complexity." The report’s sections on discrimination are instructive here. For example, in Boston "the Street Bump team [developing an app to report road conditions] also identified a potential problem with deploying the app to the public. Because the poor and the elderly are less likely to carry smartphones or download the Street Bump app, its release could have the effect of systematically directing city services to wealthier neighborhoods populated by smartphone owners." The experience of thinking through the likely use-cases, target populations, and resulting data led the team to highlight possible inequalities which would have resulted in skewed policies. Bad data — no matter how big — yields flawed policies. A causal story, complete with deep appreciation of the underlying data generating process, is critical for designing policy interventions.  

Further, neither data — nor data scientists — can inform the public on how to balance the tradeoffs in these policy choices. Most major policy debates require us to confront questions of equality and opportunity, privacy and security, or individual rights versus societal benefit. We don’t just want to pr
otect ourselves from terrorists; we also want free-flowing travel and trade alongside constitutional privacy protections. If crime and education levels are correlated, how should we design policy responses? If the protests and violence in Ukraine had been clearly forecast, should we have intervened? These are inherently political questions about risks and resources, and even the most sophisticated sensors and algorithms cannot adjudicate these discussions. 

This brings us to our final take-away: The information revolution is too important to be left to engineers alone. (It may also too important to be left to Congress, a related but separate issue.) The report surprised many by calling for new investments in "cross-cutting research that involves not only computer science and mathematics, but also social science, communications and legal disciplines." This means more federal research monies to academic social science departments and law schools, and (hopefully) more resources for providing technical training to those outside of traditional engineering schools. (The National Science Foundation’s IGERT program serves as a great inter-disciplinary model to build upon, as it provides robust, integrative data and technology training for graduate students in a variety of fields, including social science and public health.) To be clear, these are not problems that lawyers or social scientists or philosophers can resolve alone. The task of designing effective policies that adequately protect civil liberties and prevent discrimination while also fostering innovation is a tall order. But there are two reasons investments in non-engineering fields are necessary.

In my experience, engineers fixate on technical challenges. In general, that’s good — you don’t want the engineer building your dam to be distracted by Rawlsian arguments over who should pay for it. But you also wouldn’t let the engineer design the broader conservation strategy, lest you be left with water-tight dams and no wetlands or downstream drinking water. The same tensions play out in the defense and intelligence communities. The technical abilities of the NSA are forcefully impressive — and that’s the problem. Furthermore, when innovators push social and political boundaries, technical challenges aren’t the only limitations. And yet, those are really the only limits most engineers are trained to care about. Or worse, they see legal restrictions as no different from technical ones merely to be overcome with cleverness and caffeine.

Conversely, those on Capitol Hill and elsewhere in Washington’s vast policy apparatus are not trained technologists. In fact, many revel in their ignorance. In the debate over the Stop Online Piracy Act, one congressman instructed the House Judiciary Committee: "Let’s bring the nerds in and get this right." The lack of technical savvy amongst staffers and lawyers doesn’t just limit their ability to craft meaningful policies, it also limits their ability to challenge the architects and operators of these systems it prevents a dialog among equals. In that same hearing, Rep. Mel Watt declared, "As one who acknowledged in his opening statement that he was not a nerd and didn’t understand a lot of the technological stuff, I’m not the person to argue about the technology part of this," only to then dismiss the opinions of engineers and network experts who highlighted the risks of SOPA. Conversely, those same lawmakers struggle to properly grill members of the intelligence community when they discuss intercepts and online surveillance.

This White House’s new data report is just the beginning of a long-overdue conversation on how we balance the opportunities for better and more efficient governance promised by big data with the intrusions on privacy and risks of discrimination. Its recommendations, while tepid, would constitute meaningful first steps in informing this discussion. But in order to move forward, political and industry leaders are going to have to move past the easy conversations and can-kicking studies to make some difficult decisions about how to even out the government’s ability to manage and collect complex data, modernizing the Department of Health and Human Services, the IRS, and the Department of Veterans Affairs, while possibly constraining the NSA. And ultimately we’ll have to engage in the painful work to build a consensus on how much efficiency we’re willing to forgo in the name of privacy, and vice versa.