Research Grants

Articulations Speak Louder Than Words?

Reversing the Tower of Babel: inside the text-agnostic redesign of speech-to-speech translation.

For decades, speech translation systems have followed a simple pipeline: convert voice to text, translate it, and synthesize the result back into speech. Known as the cascaded approach, this chain has powered everything from early translation apps to modern voice assistants.

It also introduced friction. Each stage compresses speech into a simpler representation before passing it forward, accumulating errors, increasing latency, and losing speech nuances like emotions or pronunciation. But what if translation could happen directly from the voice itself, not relying on text at all? That’s the concept behind ArticulateX, a model developed at the Infosys Centre for AI at IIIT Delhi. Backed by a Nebius research grant, this system was built to translate speech through articulatory representations — features that describe how humans physically produce sounds.

The idea is to move closer to the mechanics of speech itself by modeling tongue position, lip movement, and airflow through the vocal tract. For example, when producing a “T” sound, the tongue touches the alveolar ridge or the upper teeth; for bilabial sounds like “B” or “P”, both lips close together. These physical patterns exist across languages and are described with phonetic frameworks.

In a way, ArticulateX echoes the Tower of Babel story in reverse: the model attempts to reunite languages through a common foundation, creating a language-agnostic intermediate space. This design turns out to be efficient. The 70-million-parameter model matches or outperforms systems many times its size. “Every human has the same vocal tract and the same articulators, ” says Vinayak Abrol, the project’s co-author and supervisor. “If you can map speech into articulatory space, you can translate across languages without depending on text.”

French to English Translation Example
Source: C'était une des objections faites à Bolztmann
0:000:00
Target: It was one of the objections that were made to Boltzmann
0:000:00
Prediction: It was one of the objections made to Boltzmann
0:000:00

From Controlled Tasks to Speech Science

Real-time speech translation began moving from dream to reality in the 1990s with ATR’s speech translation research in Japan and, most importantly, the JANUS system, the first one to reach speech-to-speech translation with workable accuracy across controlled tasks. Soon, the field was pushed further by the German Verbmobil project which tackled spontaneous dialogs in German, English, and Japanese.

These systems relied on cascaded frameworks. In the 2000s and early 2010s, statistical models replaced many rule-based components, with larger datasets improving recognition and translation accuracy. Systems became faster and more robust, yet the fundamental architecture remained the same: speech recognition feeding text translation, often followed by synthesized speech.

In the late 2010s, a newer generation of models started moving beyond this cascaded architecture toward end-to-end neural speech translation. Transformer-family systems like Google’s Translatotron and Meta’s SeamlessM4T learn joint representations of sound and language, and process speech more holistically, without explicitly generating intermediate text.

Still, these systems face persistent challenges: autoregressive architectures yield slow inference, non-autoregressive replacements often degrade quality, and most state-of-the-art pipelines remain closed-source and prohibitively expensive to train. “Right now a lot of voice AI is just data and deep learning, ” says Vinayak Abrol. “To be frank, there is very little speech science inside today’s models, even though we understand speech production quite well.”

Parameter-Efficient Model

ArticulateX directly embeds speech science knowledge into the model. Instead of letting neural networks invent arbitrary latent spaces, the researchers guide them using articulatory features derived from phonetics. While text is used as a training signal to align speech with articulatory sequences, it disappears at inference time and the model performs translation without relying on it.

The pipeline has three elements. First, a Speech-to-Articulator (S2A) encoder converts raw audio into articulatory representations of the target language using a Conformer-based neural architecture. Then an Articulator-to-MelSpectrogram generator translates those sparse articulatory features into spectrograms. Finally, a HiFi-GAN vocoder transforms the model’s output into speech, with optional diffusion-based post-processing to improve audio quality.

All trainable layers in the S2A are reparameterized using LoRA, compressing the encoder to 3.1 million parameters. At just 70 million parameters — nearly 40× smaller than Hibiki’s 2.7B — ArticulateX reaches 37.24 BLEU on CVSS French–English, only 1 BLEU point lower.

One of the biggest advantages of ArticulateX is its ability to preserve speech dynamics, speaker identity and expressiveness across languages. In human evaluation, the system scored 4.84 out of 5 for naturalness. It also achieved a real-time factor of 0.97, meaning it processes speech almost as fast as it is spoken. The non-autoregressive design lets the model generate outputs simultaneously instead of sequentially.

Model Parameters BLEU Fr-En BLEU De-En
Translatotron-2 Multi - 26.13 16.92
UnitY Multy - 26.9 16.36
S2UT Mono 130M 22.23 2.99
DASpeech Mono 95M 25.03 -
ComSpeech Multi 112M 28.15 18.16
Hibiki Mono 2.7B 38.21 -
ArticulateX Mono 70M 37.24 20.07

Common Label Set

The Interspeech conference paper covers French-to-English and German-to-English translation, but the team has already moved well beyond that scope. According to Abrol, a multilingual prototype now handles 11 Indian languages, and early results are competitive with Meta’s Seamless model."We have much better real-time factors as well as translation quality with less data” Abrol said.

Despite the promising results, the approach still faces several limitations. Current experiments cover only a small set of language pairs and rely on short audio segments typical of benchmark datasets, leaving open questions about performance on longer, conversational speech. More extensive evaluation across diverse languages, speakers, and real-world conditions will be needed. Abrol notes that broader benchmarking is already underway.

The AI speech translation industry is at a crossroads, he says. On one side, industry labs can pour billions of parameters and millions of hours of data into ever-larger systems. On the other, smaller research groups are asking whether physics-informed priors can achieve comparable results at a fraction of the cost. ArticulateX offers early evidence that the answer might be positive.

Future directions include extending ArticulateX to what Abrol calls a “common label set”, a sound-unit-level representation system covering all phonemes across many languages. This could allow speech translation models to operate across dozens or even hundreds of languages with minimal retraining. In theory, a model trained in this space could translate speech directly between languages it has never seen paired before.

Sign in to save this post