Big Tech Is as Monolingual as Americans

The biggest barrier to policing social media is language.

A woman points to a screen with menu items in Corsican on a Facebook page on October 1, 2016.
A woman points to a screen with menu items in Corsican on a Facebook page on October 1, 2016. Pascal Pochard-Casabianca/AFP/Getty Images)

Future anthropologists, scouring a snapshot of today’s social network sites, might conclude that homo sapiens was a species that worshipped cats and hated each other.

Companies like Facebook and Twitter say they’re doing their best to fight this: They have fairly comprehensive rules about the kind of content they want gone from their platforms. For example, Facebook’s Community Standards say that “in an effort to prevent and disrupt real-world harm, we do not allow any organizations or individuals that proclaim a violent mission or are engaged in violence, from having a presence on Facebook. … We also remove content that expresses support or praise for groups, leaders, or individuals involved in these activities.”

And yet they don’t seem to be doing such a great job of enforcing these rules. Much has been written about the vast cesspools of hate consuming Facebook and WhatsApp in my home country of Sri Lanka, or in Myanmar, or in India. The assumption is always that platforms like Facebook are capable of tackling these problems but just haven’t tried hard enough. Every year, these organizations boast they are using the latest and greatest buzzwords to combat hate speech. The impression is that all Facebook CEO Mark Zuckerberg has to do is wake up on the right side of the bed, make a few phone calls, and the web becomes a utopia.

It’s not that simple. The problem of hate speech is a problem of speech itself—or rather of language, of the near-infinite variety of human language, compared to the narrow Anglocentrism on which global tech is built.


In March 2018, the government of Sri Lanka blocked social media, citing the hate speech running rife at the time — and they’ve just repeated that block in the aftermath of the Easter Sunday attacks. Back in 2018, as part of a delegation of civil society and activists, I met with Facebook policy teams to try to understand why they had let the situation spiral so far.

Facebook moderates content in two ways. The most explainable way is through armies of content moderators: humans clicking through content that has been reported by users as posing a problem, usually working in countries with a mixture of cheap labor and English-language skills, such as in the Philippines. The second way is through what the Facebook representatives I met unhelpfully referred to simply as “artificial intelligence”—AI, which, to anyone who works in computing, is an eye-watering range of technologies that journalists and marketers seem to lump together without discretion.

The first isn’t really scalable on its own. Technically everybody is capable of hate speech, and we’re talking about a network with over 4.75 billion pieces of content shared daily. Facebook would have to hire half its users to moderate the other half or set up some sort of decentralized peer review system for content. Content moderators, meanwhile, are already suffering PTSD-like symptoms from the range of human misery and hate they’re exposed to on a daily basis.

For the second, what AI actually refers to, in practice, is various subfields of automated content detection. The most critical is natural language processing, which concerns itself with extracting information from large volumes of human text. None of it is as accurate as a human sitting down with a pencil and a notebook. But if nothing else, a good natural language processing system can serve as a fantastic, and far faster, first-pass filter for volumes of content no human could ever hope to analyze on their own.

So post-March 2018, LIRNEasia, the think tank where I work, decided to take a crack at this problem. After all, ample natural language processing research exists that is widely cited and tested on large datasets, often on news articles, academic research papers, or genetic data. That meant our team could, with a few nights’ work, explore millions of articles from the New York Times, analyze research trends in fields both new and old, and, we assumed, point the same algorithms at vast amounts of text gathered from social media to spot hate speech.

The problem is that English, for which many of these technologies work, is a West Germanic language. It has very little in common with Sinhala and Tamil, which are the most spoken languages in Sri Lanka. The alphabets and vocabularies are different enough, but the syntaxes, morphologies, and lexicons are far more so—and that’s what really trips up natural language processing work.

Imagine language as a tree, as the artist Minna Sundberg did. Languages on the same branch resemble each other. But as the branches diverge, differences compound. English, which belongs on the West Germanic tree, has three main tenses: past, present, and future. (The complexity of language is such, though, that there is a serious debate over whether English really has a future tense at all.) Sinhala, which belongs on the Indo-Aryan branch and is influenced heavily by Pali and Sankrit, has only two tenses: the concept of past and not-past (atheetha and anatheetha).

There are no natural language processing use cases that can successfully navigate all these language differences. Even similar languages yield different results; the same topic extraction algorithm run on Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese, and Swedish translations of the EuroParl Corpus, documents recording the proceedings of the European Parliament since 1996, all expressed different results. Algorithms already built will be able to work reasonably well within a language family, or a branch of the tree—one could expect a topic extraction algorithm written for English to work, with some fine-tuning, for German, Dutch, Afrikaans, and Yiddish—but even so, they vary in accuracy. For other branches, it ranges from horribly inaccurate to impossible.

As a very visible example, imagine an algorithm that, having solved the intricacies of English, is then introduced to the Riau dialect of Sumatra, where, according to the Atlantic:

[A]yam means chicken and makan means eat, but “Ayam makan” doesn’t mean only “The chicken is eating.” Depending on context, “Ayam makan” can mean the “chickens are eating,” “a chicken is eating,” “the chicken is eating,” “the chicken will be eating,” “the chicken eats,” “the chicken has eaten,” “someone is eating the chicken,” “someone is eating for the chicken,” “someone is eating with the chicken,” “the chicken that is eating,” “where the chicken is eating,” and “when the chicken is eating.”

It’s best to know what being done with, to, or for that chicken. But it seems impossible. The computational analysis of language requires good corpuses, tokenizers, lemmatizers, and many other layers of analysis and software. Many algorithms will have to be rebuilt from scratch. In Myanmar, another country where LIRNEasia works, the most popular online typeface isn’t even compatible with the Unicode Consortium’s industry standard for handling text: Zawgyi, the schema that became the most widely adopted, has major compatibility problems across platforms and programming languages, making it almost impossible to analyze computationally.

Unfortunately, most of the languages in the global south, where we work, are in similar straits. Languages such as Sinhala and Tamil are what practitioners call “resource-poor”—the ones that just don’t have the statistical resources required for ready analysis. Years of work are required before firms can do for these languages what can be done for English with a few lines of code. Until the fundamental data is gathered, these are difficult nuts to crack, even for Facebook. Resource-rich languages, on the other hand, are relatively low-hanging fruit—more researchers hop on those bandwagons, because much of the fundamental work is done. Call it the capitalism of languages.

This has left the world stuck in a vicious cycle where the vast majority of what’s possible exists for resource-rich languages such as English and Chinese—and everyone else is stuck with the language equivalent of horses on a superhighway.

Ludwig Wittgenstein was right: The limits of our language are the limits of our world. And given how difficult this is, perhaps commentators should not be too hasty in demanding that Facebook solve these problems immediately. Perhaps it’s a measure of how accustomed the Western press is to innovations presented daily: Electric cars! Rockets! But even those didn’t appear overnight—the shiny new thing you’re looking at is the tip of many decades of hard work by people long since forgotten. Science doesn’t happen because someone demands it be done. These are languages representing thousands of years of ideas and difference, as well as complex colonial histories; even Facebook, the Silicon Valley titan, will take years to plumb these depths.


There is a possible way around this. René Descartes in 1629 wrote a letter to Marin Mersenne, the French polymath: “There are only two things to learn in any language: the meaning of the words and the grammar. As for the meaning of the words, your man does not promise anything extraordinary; because in his fourth proposition he says that the language is to be translated with a dictionary. Any linguist can do as much in all common languages without his aid. I am sure that if you gave M. Hardy a good dictionary of Chinese or any other language, and a book in the same language, he would guarantee to work out its meaning.”

People have been chasing this goal with computers since the 1954 Georgetown-IBM experiment, which was a stab at translating Russian to English—and where the researchers optimistically predicted machine translation could be a reality within three to five years. Today, with machine learning, breakthroughs in language translation seem to demand only vast amounts of language data—and computing power.

Facebook, Google, and Twitter are probably the world’s largest repositories of language data in the world, and have plenty of processing power to go with it. No translation is ever fully accurate. But they can be good enough. And with enough language data and computing power, if we can reach that stage of good enough for languages in the global south, we might just have a chance at detecting hate speech through the kludge of translation.

With the recent rise in machine learning, functional machine translation may be up and running in years—as evidenced by Google’s Babel fish-like earbuds—instead of the decades of academic research, peer review, and back-and-forth required from every country across the globe in the pursuit of new algorithms for all languages.

Policy, as the saying goes at LIRNEasia, is about the art of the possible. There’s little point in wielding anger like a blunt instrument. So for those willing to pick up a scalpel instead of a hammer, I give them this: If you really want to solve the problem, urge Facebook, Google, and others to work on their translation. They can work with local universities and develop parallel corpuses so that the machine may learn what it needs to, helping the rest of our languages enter the realm of the possible.

And then, perhaps, Mark Zuckerberg can make those phone calls.

Yudhanjaya Wijeratne is a researcher with the Big Data team at LIRNEasia and an award-winning science-fiction author.