Robots do not treat all humans equally.
This summer, Last Week Tonight host John Oliver and our very own Jorge Ramos both kvetched about how computers often don't understand them. It's not because they are particularly complicated men. It's because Oliver is British and Ramos is Mexican; both speak with accents.
“I have to often, with automated machines, do an American accent," Oliver told Ramos. "It’s electronic imperialism.”
Yes, English, especially the American variety, is the web's lingua franca.
YouTube is full of videos of frustrated people, some with accents, trying to talk to voice recognition systems. This dude, who has an accent, wants to play music on his car, but the car hears "play all" as "redial." Another accented speaker gets so frustrated with his car's voice recognition system that he begs it to shut down, but instead, it starts up the navigation feature. During a live demo, Microsoft's voice-to-text app, typed up "dear aunt" when the speaker very clearly, but with an accent, said "dear mom." When the presenter tried to fix it, the machine continued making errors.
Marsal Gavalda, who recently joined messaging app YikYak as chief of machine learning, has been on a mission to make people more aware of "electronic imperialism," giving talks at conferences on the topic, like at SpeechTek in San Francisco this week.
"Speech technologies have proven so useful and successful at powering intelligent applications. At the same time, we need to be cognizant they don't work so well for everyone," Gavalda told me. "We need to prevent a 'speech divide,' a class of people for whom speech technologies work well and another for whom they don't. You're putting those people at a disadvantage."
Those of us who can effectively communicate with devices by voice are doing so more and more. Fifty five percent of teens and 41% of adults use voice search more than once a day, according to a recent Google survey. When you're driving, it's safer to use your voice rather than your fingertips, but not if the voice recognition system doesn't work for you. And that's more likely to happen if you don't speak English.
That's because the more data any artificial intelligence system has, the better it is. American-style English has a huge advantage here. Most of the available voice data that feeds into virtual helpers like Siri, Google Now and Microsoft's Cortana is in standard U.S. English. Only Chinese even begins to rival it.
It shows in how well AI voice systems recognize what we're saying. Google told me that with their voice recognition system, which works on mobile and on the web, the word error rate for U.S. English is just 8%, a number that factors in mistakes made on odd proper names and words, like a street address or unique restaurant names. For Spanish and British English, it's higher, around 10%.
For "tier 2 languages"—those that have gotten less attention from tech companies—error rates hover around or above 20%. That means the machine you're trying to talk to misses at least 1 in every 5 of the words you're saying, which renders it basically unusable. There are billions of people who are out of luck because they don't speak the default voice recognition languages—English, French, Spanish, or Chinese—or speak those languages with heavy accents.
And voice recognition systems embedded in cars and in call centers are particularly awful because they're using an older, less sophisticated type of artificial intelligence, not the type that's powering Google search. That creates a terrible feedback loop: if machines aren't good at understanding you, you're less likely to talk to them, meaning that the systems used to train them aren't going to collect the data they need to get better.
The voice recognition market is estimated to be $2.5 billion this year and it's expected to grow as voice capabilities move from our handheld devices to appliances, robots, and cars. But corporate decisions on which types of languages and accents get better voice recognition isn't likely to be determined by a need to right an unfair "digital speech divide." Businesses will decide based on profit margins, market size, and how much money people that speak like that have to spend.
Baidu, frequently referred to as the Google of China, is focusing on Chinese because that's where their core user base is. Google told me that it's set an end-of-the-year goal to make its voice recognition systems for Italian, German, Spanish, Japanese, Korean and Russian as good as English.
"After [tier 1] we'll pay more attention to tier 2, especially as the next billion users will come from emerging markets," like India, said Johan Schalkwyk, Google's chief of speech recognition. "These users will be coming online soon and voice may be the only way they communicate with the device."
Tech giants like Google, Microsoft and Facebook know that their next billion users will come from markets where "tier 2" languages, like Hindi, are dominant. Some of the work of adapting voice recognition systems for those new audiences is already happening. When Google started building an Indonesian speech recognition system, the word error rate was 40%; now, says Schalkwyk, it's around 18%. Part of what's driving these initial improvements is that Google is launching products in emerging markets, like Indonesia. Google's newly revamped Translate app, for instance, "knows" Indonesian.
Microsoft has similar goals. "We have a pretty aggressive and not fully disclosed language expansion plan for Cortana, over next 18 months to two years…so we're not U.S. centric," said Spencer King, the principal program director for Cortana.
Companies are tackling accents too. Both Microsoft and Google are working on ways to automatically detect whether a user is speaking a given language with an accent. For Cortana, users can opt in to a feature that recognizes non-native accents. That cues Microsoft's AI to "listen" to the user differently, which can result in a significant increase in accuracy, King says.
"We use that information to train that model separately from the native speaker model," King told me. The company has been doing that for about a year and half.
With Google Now, you can also select recognition for U.S. English or Australian English, but in the future, the search giant wants to make that unnecessary and to automatically detect whether an American or an Aussie is talking to it.
All this requires lots of data. For a system to work well, Google says it needs at least 5,000 hours of user voice data. That may not sound like a lot, but getting it is harder than it seems. For Afrikans, for example, Google says it has a hard time getting enough data to dramatically improve the system. And many recordings, in any language, are too noisy or the quality is too poor, so it's unusable.
And yes, to make systems better, that means companies hold on to what you say to your virtual assistant to improve the AI. If you do a query by voice with Google Now, your data is logged anonymously for two years. If you opt in to the Audio History feature, you can manage the audio data and delete it; it's deleted forever, Google's Schalkwyk said. Microsoft also says that it's careful to anonymize the user data it feeds into its voice recognition systems.
So, how long until every user can converse with a virtual assistant without the conversation ending in frustrated cursing? The experts I spoke to say that's hard to predict. The really big improvements will come first in products developed by the likes of Google, Apple and Microsoft. Call centers will take a little bit more time, said Tim Tuttle, the CEO of AI company ExpectLabs.
But Tuttle did offer a glimmer of hope. Companies won't have to start from scratch with every language. Many times, they can take the models they use for one language, and use them to improve another, which should quicken the pace of development. Once you've learned Spanish, picking up Italian or Portuguese is easier. Luckily, machines can, at some level, do this too.
Daniela Hernandez is a senior writer at Fusion. She likes science, robots, pugs, and coffee.