Virtual Assistants: 21st Century’s Towers of Babel?

Virtual assistants, such as Alexa or Siri, are gaining popularity and becoming our new family members. But these technologies are often designed with English in mind. How can we teach our virtual assistants other languages, such as Maltese? THINK’s Christian Keszthelyi speaks with Prof. Patrizia Paggio, Prof. Ray Fabri, and Prof. Albert Gatt from the Institute of Linguistics and Language Technology (University of Malta) to find out more.

As we mope around, bothered by the silence of our condo, we just need to say ‘play music’ and our handy virtual assistant should immediately boom our favourite tunes through the speakers. 

In a fraction of a second, hundreds of signals blitz through the electric circuits of your virtual assistant when we command it to play music. ‘We need to differentiate between two parts: the application functionality, such as being able to interact with iTunes, and the natural language processing parts, which are closely related and interdependent,’ says Patrizia Paggio, Professor at the Institute of Linguistics and Language Technology at the University of Malta (LLT), as we are unpacking the underlying mechanism.

‘When you say “play music”, the same sound waves that hit your ear are the ones that hit your device’s microphone. This continuous waveform is chopped up into portions that correspond to discrete sounds so they are individually identified,’ says Albert Gatt, Associate Professor at the LLT, going further into detail.

This is followed by a process of assembling the most likely sequence of words corresponding to the speech input, which eventually ends in the virtual assistant understanding the command. Understanding involves overcoming ambiguity, as variations in language use are always present. Gatt mentions Jimi Hendrix’s famous line, ‘Excuse me while I kiss the sky’ that is often interpreted as ‘Excuse me while I kiss this guy’. Such foul play always needs to be factored into the process.

In order for software to understand speech, systems need to be trained on large databases of spoken and written language, which allow the system to learn the relevant patterns in the sound and text they are exposed to. ‘As the system is learning, it is trying to make increasingly abstract representations of the input — the sound in this case — in order to enable the mapping of this input to the output: the transcribed text in this scenario,’ Gatt clarifies further.

Ongoing research is trying to explore how the transcription phase can be omitted and how to go directly from speech to recognition. These models are still being researched and need to be improved. Today’s systems require a multi-stage process since they first transcribe the spoken input then carry out the order.

Context Reigns

Besides recognition, the virtual assistant needs dialogue management. This enables it to respond to our linguistic cue, even if we have very limited exchanges. ‘These systems can request clarifications or look things up for you. Therefore, they need to be enabled not just to map your speech commands via text to some action, but also, if necessary, they have to be able to produce an appropriate response, such as “What do you feel like listening to right now?”’ Gatt says. That is the scope of pragmatics, the ability to behave appropriately when using language. Systems need to understand the context of the linguistic exchange, but it is not always easy.

‘Things can get tricky here. The appropriate response depends on the expertise of the user: a child versus an adult, for example. Or, what happens if users fill the systems with inappropriate linguistic exclamations, and they learn to behave like that?’ Gatt further explores the challenges. Inappropriate language, such as swearing, can get recorded in the digital fabric of the virtual assistant, and they may swear back at unwitting others. 

Any form of artificial intelligence relies heavily on big data to build models with lots of parameters. ‘At the moment, the idea is that the more data you have, the more likely you are to get a good system,’ Gatt says, but there are issues with data quality.

Speaking Many Tongues

The English language has had the most significant developments for speech recognition software and virtual assistants. The approach cuts off anyone who does not speak English: around 6-7 billion people. The main bottleneck is having enough data to teach virtual assistants new languages. English is the most used language on the Internet (although Mandarin Chinese is a very close competitor), so naturally, English is the most developed language. Data reigns supreme.

Solutions exist. Pre-training can make a system learn a new language faster. Current systems often rely on large, pre-trained language models, which can also be multilingual. These models are exposed to huge quantities of data from multilingual sources, such as Wikipedia. The idea is that such a model acquires a significant amount of linguistic knowledge, which can be exploited by ‘fine-tuning’ the model on other tasks, such as understanding commands. The same strategy has very recently started being used for speech recognition.

The Local Challenge

How do we teach these virtual assistants to understand Maltese? A large dataset in the target language is needed. Gatt worked on the first automatic speech recognition system for the Maltese language called MASRI (Maltese Automatic Speech Recognition). Now there are 90 hours of transcribed Maltese language data. ‘With this data, we can create half-decent models that are good for experimental purposes but not for commercial use in devices,’ Gatt explains. This can be complemented by pre-training and fine-tuning.

Some bumps exist down the road. ‘Another problem is commercial interest. In Malta, the potential audience is not large enough to spark the curiosity of companies,’ Paggio says.

The Internet of Things market is predicted to boom soon, fueled by worldwide 5G connectivity. With such a high level of connectedness, data input will exponentially grow, and virtual assistants, as well as their underlying ecosystems, may soon reach stellar levels of intelligence. As we are equipping devices with artificial intelligence, let’s make sure that we hand over our best cognisance to these devices. While it is still open for debate whether we are born with preconceived ideas or not, artificial intelligence starts with a tabula rasa and will reproduce our knowledge and biases.

Further Study

The Institute of Linguistics and Language Technology at the University of Malta offers its students a B.A. in Linguistics and a B.Sc. in Human Language Technology, covering linguistics (how language works) and language technology (how language can be realised in computers). While the courses emphasise the importance of theory, students have the opportunity to engage in practical projects in small groups, as well as undertake industrial placements for the B.Sc. in order to apply their knowledge to real-world tasks.

Apart from undergraduate courses, the institute also offers M.A., M.Sc., and Ph.D. courses in both areas and is currently designing a new taught master’s in the Language Sciences. Prof. Ray Fabri, chair of the institute, comments that studies carried out show that students who graduate from the institute find jobs and new opportunities on completion of their studies. Specifically, in the area of language technology, fields range from artificial intelligence to software design and development.

More Stories
Lies, damned lies, and statistics