Why We Can’t Just Read English Newspapers to Understand Terrorism

And how Big Data can help.


A few weeks ago, the White House convened the Summit on Countering Violent Extremism (CVE), a three-day event intended to “discuss concrete steps” that the United States and its allies can take to mitigate violent extremism around the globe. Yet, the sobering reality is that despite 75 years of monitoring the world’s media and spending hundreds of millions of dollars on global monitoring over just the last few years, much of the U.S. government’s understanding of patterns of violent extremism comes from reading Western English-language newspapers. Meanwhile, its intelligence agencies hoover up extremist communications, but find their archives of little use: it’s hard not to shake one’s head upon reading that analysts assigned to the Pakistani terrorist group Lashkar-e-Taiba complained that “most of [the intercepted communications] is in Arabic or Farsi, so I can’t make much of it.”

How can Washington hope to counter violent extremism when the analysts assigned to monitor extremist communications can’t even understand a word of what they are reading?

Helping to kick off the CVE summit was a presentation by William Braniff offering an overview of global terrorism trends from a dataset known as the Global Terrorism Database (GTD). The GTD dataset, created by the University of Maryland, is supported by the U.S. Department of Defense and is widely cited in government reports and the news media, from the New York Times to the Washington Post to CNN. Yet, while GTD is heavily utilized as a definitive source of information on global terrorism, it is actually based nearly exclusively on English-language news sources.

When the Los Angeles Times is cited among the primary source material on an Islamic State kidnapping in al-Bab, Syria, and the Chicago Tribune is listed as a primary source for a grenade attack on a market in Rajuri, India, one must question the comprehensiveness of GTD’s data. In fact, according to GTD, of the top 10 countries with the most terrorist attacks in 2013 (their most recent reporting year), just one features English as its primary language (Nigeria) and just two more (India and the Philippines) feature English among their official languages. Collectively, these three English-speaking countries comprise just 17 percent of attacks and 16 percent of fatalities due to terrorism in 2013 in the top 10 countries. It’s difficult to understand GTD’s emphasis on English-language Western outlets in the place of native language local media outlets in the countries where 83 percent of terrorist attacks allegedly take place.

Yet, GTD is far from alone: this focus on Western English-language sources pervades Washington’s efforts to monitor and understand the world. The Defense Advanced Research Project Agency’s flagship $125 million Worldwide Integrated Crisis Early Warning System (W-ICEWS) program is based almost exclusively on English news outlets with a small amount of human translated material, which it uses to forecast global events and instability. It has achieved an accuracy level of less than 25 percent. The small amount of translated material that W-ICEWS does incorporate is drawn primarily from the U.S. Open Source Center, which is responsible for monitoring and translating global news and social media.

Yet, even as CIA Director John Brennan announces his intentions to expand the agency, the Open Source Center still draws nearly half its material from English-language outlets, relies primarily on European news agencies for coverage of Africa, and has little coverage of Latin America. In fact, it monitors considerably more news from Russia than from the entire continent of Latin America and the countries of Spain and Portugal combined. Its coverage of languages from regions at elevated risks of terrorism is particularly poor: The Open Source Center’s total monitoring volume of Bengali (the official language of Bangladesh, which the 2014 Global Terrorism Index places at high risk of terrorism) averaged just a single translated article per week for over 20 years.

There will never be sufficient human translators to monitor the combined output of all the world’s media in all the world’s languages each day. This is where machine translation, as imperfect as it may be, offers tremendous opportunity. While machine translation is still highly error-prone, it is capable of infinite scaling, processing the entirety of all global accessible media in real time. Indeed, the same week as the president’s CVE summit, the GDELT Project announced one of the world’s largest deployments of streaming machine translation, translating into English the entirety of global news that it monitors in 65 languages, representing 98.4 percent of its daily non-English monitoring volume. Within 15 minutes of monitoring a breaking news report anywhere in the world, GDELT has translated it and processed it to identify events, counts, quotes, people, organizations, locations, themes, emotions, relevant imagery, video, and embedded social media posts. Leveraging the effectively unlimited capacity of Google Cloud, I built the entire system in under two and a half months as a “nights and weekends” project.

The ability to reach across 65 languages, coupled with a high-resolution local media inventory of the world, means that — unlike the Pentagon’s efforts — GDELT is able to operate across the world’s languages in real time, rather than being limited to a small cadre of Western English-language outlets to understand unfolding events in a remote corner of the world. For some languages like Russian and Estonian, GDELT uses translation models contributed by some of the leaders in the field and achieves accuracy on par or surpassing that of Google Translate on the material it monitors. For other languages, especially those with few available computerized linguistic resources like Swahili, it is still able to robustly recognize locations, major person and organization names, themes, and key event types, but is often less able to discern slight nuance and sarcasm, though its dictionaries are designed to grow daily as it learns from open datasets like Wikipedia’s multilingual information. Machine translation cannot yet compete with the accuracy of expert human translation, but even at its worse, GDELT can flag an article as discussing a large violent protest in a specific city, along with the major ethnic, religious, social, and political groups and leaders mentioned, allowing it to be forwarded to a human analyst for further review. After all, the error of machine translation can be fixed in post-processing, but it isn’t possible to fix or filter what hasn’t been monitored and flagged in the first place.

Moreover, as the accuracy of machine translation continues to rapidly improve, and tools and training datasets become available for an ever-increasing number of languages, GDELT’s algorithms will regularly upgraded. The goal of GDELT’s mass translation initiative is to demonstrate the feasibility of mass translation of global information in real time and to offer a living test-bed that can leverage new technologies and approaches for mass translation.

The map below illustrates why it is so critical to look across languages. All global news coverage monitored by GDELT from Feb. 19 through March 1 was scanned for mentions of geographic locations in Yemen. Locations mentioned in English-language news coverage are colored in blue, while locations mentioned in the 65 other languages recognized by GDELT are colored in red. Larger dots indicate greater volume of coverage mentioning that location. English coverage of Yemen largely focuses on several small clusters of locations around major cities, which is a common artifact of English coverage of the non-Western world, while the media of other languages (especially Arabic) discuss a much broader range of locations across the country. Understanding the current situation in Yemen beyond events in Sanaa or Aden clearly requires turning to local press.

Figure 1 - Locations mentioned in global news coverage of Yemen 2/19/2015 – 3/1/2015 (Blue = English news media, Red = Non-English news media)

Figure 1 – Locations mentioned in global news coverage of Yemen 2/19/2015 – 3/1/2015. (Blue = English news media, Red = Non-English news media.)

Moreover, the emotional and thematic contextualization of ongoing events in the local press can yield critical insights: in the case of Russia, while Western governments paint Moscow as the aggressor in Ukraine, a recent poll suggests 81 percent of the population have a negative view of the United States, the highest of the post-Soviet era, while Vladimir Putin’s approval rating sits at 86 percent. Much of this support is due to careful stage managing of the domestic media environment, requiring an understanding of the Russian psyche to fully understand the root underpinnings of Putin’s increasing popularity — even as sanctions devastate his country’s economy.

Similarly, much of the recent conversation on countering extremism has focused on the “root causes” of how individuals become radicalized or join extremist groups. In a much-maligned February interview with MSNBC (which led to the #JobsForISIS hashtag on Twitter), the State Department’s Marie Harf cited “a lack of opportunity for jobs” as the key root cause for radicalization, a position which the president himself expounded upon at length in his own speech later that week. As Foreign Policy’s South Asia Channel editor and CNN commentator Peter Bergen has noted, however, many of the extremists gracing international front pages, from Osama bin Laden to Umar Farouk Abdulmuttallab, Mohamed Atta to “Jihadi John,” have come from relatively wealthy and privileged backgrounds, not abject poverty. Yet even Bergen acknowledges that the foot soldiers of the Islamic State often come from far more modest backgrounds, and as the Washington Post’s Adam Taylor writes, even a middle class upbringing does not always equate to perceptions of infinite opportunity.

The truth is that there is no single “root cause” of extremism. Much as there is no single view on gun control or abortion in the United States, we must accept a far more fragmented and nuanced understanding of world views. Some may indeed join the Islamic State due to a perceived lack of opportunity, while others may join out of religious beliefs. Acknowledging that there is a continuum of rationales allows for the development of multiple tailored responses that transcend the over-simplicity of political soundbites and more precisely target cultural-specific narratives.

Given the enormous complexity of the world’s cultures, how can the U.S. government even begin to interact with the views and beliefs associated with elevated levels or risk of extremism? In our chapter of a report on megacities published last spring and prefaced by Lt. Gen. Michael Flynn, Charles Ehlschlaeger and I noted that linguistic and cultural barriers form the primary obstacles towards understanding the developing world that is “often characterized by complex tribal, ethnic, linguistic, religious, familial, and societal affiliations and interconnections” that are “foreign” to most Western analysts. In short, merely being able to read the language of an extremist group does not automatically provide the necessary insight to understand the underlying world views of that group.

Last fall, in collaboration with Timothy Perkins and Chris Rewerts of the U.S. Army Corps of Engineers published in the journal D-Lib, we demonstrated that it was possible to use large-scale data mining to construct a socio-cultural index over a region of particular interest to CVE: Africa and the Middle East. More than 21 billion words of academic literature on Africa and the Middle East, the entirety of JSTOR, all unclassified/declassified reports from the U.S. Defense Technical Information Center (DTIC), and the Internet Archive’s 1.6 billion archived PDFs were computer-processed to identify all mentions of social, religious, and ethnic groups; locations; major themes; and citations. The index can be used to map the geographic footprint of topics, list the thematic grievances most associated with conflict between ethnic groups in a particular area, and even as a “find an expert” system to identify the most frequently cited researchers specializing in particular issues. For example, mapping all locations associated with food or water security produces the map below, which in the prototype interface allows an analyst to zoom into an area of interest and instantly access the combined scholarly and governmental output regarding that area.

Figure 2 - Map of locations mentioned in academic and U.S. Government articles on food and water security 1950-2014.

Figure 2 – Map of locations mentioned in academic and U.S. Government articles on food and water security 1950-2014.

Similarly, more than 110,000 human rights reports from Amnesty International, Human Rights Watch, the International Criminal Court, the United Nations, and related organizations were processed using the same system to generate a related index of the world’s human rights reports. Instead of keyword queries on the open web, this interface makes it possible to intelligently map relationships and patterns between specific groups, driving forces, human rights abuses, and geography. When combined with the academic literature index above, and the GDELT news index, it is possible to track in near real time the spread of extremism ideologies, beliefs, and actions, and the undercurrents that drive and support them.

Figure 3 – Map of locations mentioned in Amnesty International Reports, Press Materials, Urgent Actions, Event, and “Other” documents published 1960-2014.

Figure 3 – Map of locations mentioned in Amnesty International Reports, Press Materials, Urgent Actions, Event, and “Other” documents published 1960-2014.

The digital era has made us exceptionally good at collecting the world’s information, but in doing so, we’ve emphasized archiving over analysis. Sometimes we have to put down the computer to better see the world, but when it comes to countering global extremism, big data offers the tantalizing ability to augment our human focus on the English language with the ability to listen to the whole world at once, transcending language barriers and reaching deeply into the reactions and emotional resonance of global events to add context and understanding.

If I can build machine translation for 65 languages in just two and a half months — and build an index over half a century and tens of billions of words of cultural knowledge in just half a year — what could the U.S. government achieve if it spent $125 million on listening to the world, rather than reading American newspapers?

Correction, April 16, 2015: The Intelligence Advanced Research Projects Activity-funded HealthMap project monitors news in 15 languages as part of its disease early-warning system. A sentence, now excised from this article, mischaracterized HealthMap’s Ebola alert, implying that it missed the earliest warnings of the disease due to an emphasis on English-language sources. HealthMap first flagged the appearance of Ebola in Guinea on March 14, 2014, as a result of this article in a French-language newspaper.


Kalev H. Leetaru is a senior fellow at the George Washington University Center for Cyber and Homeland Security and a council member of the World Economic Forum Global Agenda Council on the Future of Government. He created the GDELT Project and focuses on big data and global society.