Why Big Data Missed the Early Warning Signs of Ebola
Hint: Ils ne parlent pas le français.
With the Centers for Disease Control now forecasting up to 1.4 million new infections from the current Ebola outbreak, what could “big data” do to help us identify the earliest warnings of future outbreaks and track the movements of the current outbreak in realtime? It turns out that monitoring the spread of Ebola can teach us a lot about what we missed — and how data mining, translation, and the non-Western world can help to provide better early warning tools.
Earlier this month, Harvard’s HealthMap service made world headlines for monitoring early mentions of the current Ebola outbreak on March 14, 2014, “nine days before the World Health Organization formally announced the epidemic,” and issuing its first alert on March 19. Much of the coverage of HealthMap’s success has emphasized that its early warning came from using massive computing power to sift out early indicators from millions of social media posts and other informal media.
As one blog put it: “So how did a computer algorithm pick up on the start of the outbreak before the WHO? As it turns out, some of the first health care workers to see Ebola in Guinea regularly blog about their work. As they began to write about treating patients with Ebola-like symptoms, a few people on social media mentioned the blog posts. And it didn’t take long for HealthMap to detect these mentions.”
The U.S. government’s Intelligence Advanced Research Projects Activity (IARPA), which helps fund HealthMap, has used this success story as evidence that the approaches used in its Open Source Indicators program can indeed “beat the news” and provide the earliest warnings of impending disease outbreaks and conflict.
It’s an inspirational story that is a common refrain in the big data world — sophisticated computer algorithms sift through millions of data points and divine hidden patterns indicating a previously unrecognized outbreak that was then used to alert unsuspecting health authorities and government officials. The problem is that this story isn’t quite true: By the time HealthMap monitored its very first report, the Guinean government had actually already announced the outbreak and notified the WHO.
The first public international warning of the impending epidemic came not from data mining or social media, but through more traditional channels: a news article in Xinhua’s French-language newswire titled “Guinée: une étrange fièvre fait 8 morts à Macenta” published late in the day (eastern standard time) on March 13. The article reports that “a disease whose nature has not yet been identified has killed 8 people in the prefecture of Macenta in south-eastern Guinea … it manifests itself as a hemorrhagic fever….” In turn, this newswire article was actually simply reporting on a press conference held earlier in the day by Dr. Sakoba Keita, director of the Division of Disease Prevention in the Guinea Department of Health, broadcast nationally on state television, that announced both the outbreak of the unknown hemorrhagic fever and the departure of a team of government medical personnel to the area to investigate it in more detail. The Xinhua article further notes that the government of Guinea had already formally notified the WHO of the unknown outbreak.
Thus, contrary to the narrative that data mining led to an intelligence coup, HealthMap’s earliest signals on March 14 were actually simply detections of this official government announcement in French. Despite all of the attention and hype paid to social media as a sensor network over human society, mainstream media still plays a critical role as an information stream in many areas of the world. This is not to say that there were not far earlier signals manifested in the myriad social conversations among medical workers and citizens in the region, only that it was not these indicators that HealthMap — or anyone else — detected.
Part of the problem is that the majority of media in Guinea is not published in English, while most monitoring systems today emphasize English-language material. The GDELT Project attempts to monitor and translate a cross-section of the world’s news media each day, yet it is not capable of translating 100 percent of global news coverage. It turns out that GDELT actually monitored the initial discussion of Dr. Keita’s press conference on March 13 and detected a surge in domestic coverage beginning on March 14, the day HealthMap flagged the first media mention (which was, it should be noted, in French). The problem is that all of this media coverage was in French — and was not among the French material that GDELT was able to translate those days.
To give an idea of the importance of monitoring across languages, through a grant from Google Translate for Research, GDELT has been feeding a portion of the Portuguese edition of Google News each day through Google Translate for the past year. It turns out that upwards of 70 percent of the events recorded in Portuguese-language news do not appear in English-language news anywhere else in the world. Further, a large portion of these events relate to situations outside of Portugal and Brazil, including former colonial states in Africa, as the map below shows. Increasing our ability to process all of this material would yield tremendous gains in monitoring local media of the sort that provided the first indicators of the Ebola outbreak.
Click to enlarge.
On a panel I served on last week, we were asked to name what we thought was the greatest challenge to better understanding the world. A representative of a government-funded agency stated that, in his program’s view, it was a need for better computer science tools to better extract patterns from data. That’s a worthwhile goal, but not if the data set is incomplete. While there is certainly great need for better data tools, even if one could perfectly extract every piece of information from the New York Times each day, it would likely not yield a picture of the emerging Ebola outbreak any more detailed than what American government officials already have. Instead, what we truly need is better, more local data (and expanded tools that can translate and process that material) to allow us to more closely listen to and understand local communities.
There is a singular preoccupation in government today with forecasting the future. Yet, we must be careful that among investments of hundreds of millions of dollars in forecasting systems that have yet to produce useful results, we don’t miss the early warning signs of emerging pandemics that are quite literally broadcast for us on national television. Instead of trying to beat the international news through massive investments in computer models, we should instead be focusing on listening better.
Clarification: This article has been update to clarify HealthMap’s initial flagging of the March 14 media mention of the Ebola outbreak.