We’re Addicted to You, Don’t You Know That You’re Toxic?

Research Grants

We’re Addicted to You, Don’t You Know That You’re Toxic?

At Télécom SudParis, researchers draw on Robert Sapolsky’s zebras to argue that AI moderation has spent a decade measuring the wrong thing.

By Linda Petrini

2026-05-26

11 mins to read

“AI is just software that predicts the next token,” Sergey Berezin from SAMOVAR lab says. “Responsibility is still ours.”

This article contains language that some readers may find offensive, while others may not. And that’s what recent studies on AI toxicity are tackling: when do words actually cause harm? And what, exactly, should moderation systems be measuring in the first place?

Chatbots, used by more than a billion people, lean on safety filters that sometimes only read the surface of the text. They often go blind when the sting is implicitly spread through narrative or when the impact depends on who’s talking and who’s listening.

While models grow more capable at reasoning, their guardrails lag behind the challenges posed by explosive user growth, says Sergey Berezin, a researcher at SAMOVAR lab in Telecom SudParis, one of France’s elite engineering schools.

“Documenting technical flaws doesn’t really fix the system, ‘ Berezin says. In a 2026 position paper, he and his SAMOVAR colleagues Noël Crespi and Reza Farahbakhsh argue that the problem runs deeper than detection accuracy: the field has been optimizing benchmarks without a stable definition of what it’s measuring.

Supported by Nebius Academy, this work builds on machine learning, sociolinguistics, and neuroscience to reframe toxicity as a stress response rather than a property of words, and detection as a measurement problem rather than a classification one.

I Hate You, Happy, Happy, Good.

Berezin’s deep dive into AI toxicity began in 2023 by probing transformer-based detectors with phrases like “I hate you, happy, happy, good”. He tricked the industry’s finest hate speech filters by appending high-positivity words to toxic messages. “This was so stupid and strange that I got hooked, ” he recalls.

Next came ASCII art: images built from keyboard characters that render obscenities as shapes. Displayed as pictures, harmful messages bypassed every major detection system. In a 2025 paper, SAMOVAR researchers defined a whole new class of Task-In-Prompt (TIP) attacks: jailbreaks that can force LLMs to generate toxic content by embedding it within riddles, ciphers, and code execution challenges. Riddles proved especially effective, conveying meaning without explicit words and making harmful intent nearly impossible to detect.

“Every model maker knows about these jailbreaks — I’ve told them myself, ” Berezin says. The state-of-the-art attacks today are multi-turn, slowly steering a model into a poisoned context across many exchanges (“Echo Chamber”), or exploiting API-level features that preload assistant replies (“sockpuppeting”). The attacks that work now are precisely the ones that exploit how models track conversational context, which is the territory Berezin’s framework says detection has been ignoring all along.

Sapolsky’s Zebras

The industry spent years optimizing benchmarks and testing toxicity without really knowing what they are measuring, Berezin says: “We report state-of-the-art results to ourselves that turn out to be as relevant to real life as the temperature on Mars.”

Today, toxicity is measured in arbitrary units. One model’s 50 is another’s 0.3, and the scores aren’t comparable, much less tied to real harm. “We measure toxicity in parrots, ” Berezin says, borrowing a joke from a Soviet cartoon in which animals measure a boa constrictor’s length with a parrot for a ruler.

For most of the past decade, toxicity has been treated as something you can read off a piece of text by itself, the way you read off its length or its language. But the same sentence can be a slur in one room and a greeting in another. Treating toxicity as a property of the words alone collapses everything that actually decides whether harm occurs: who is talking, who is listening, what just happened, what the local norms are.

Look across languages and the failures pile up. The largest English-speaking country in the world is India, where the word Chamar can be interpreted as a caste slur. Major toxicity models fail to flag it, Berezin says. In the UK, grandmothers have been thrown off Facebook for sharing recipes for faggots, a traditional pork-offal dish sold by Tesco.

But the bigger issue, SAMOVAR researchers say, lies upstream of any architecture. Toxicity, they argue, is not a property of language. It is a relation between an utterance, an audience, and a setting.

Looking for firmer ground, Berezin turned from machine learning to Robert Sapolsky, the famous Stanford neuroscientist who has spent decades studying stress and behavior. In Why Zebras Don’t Get Ulcers, Sapolsky compares brief, predator-related stress in animals with the psychologically driven stress humans experience—both producing the same harmful bodily response.

“That work helped me connect toxicity to stress and anxiety, ” Berezin says, referencing both Sapolsky and a neurolinguistics professor from his alma mater, whose signature lecture was titled: “What happens in your brain when someone tells you to f**k off?”

“Stress is a medical term, ” Berezin says. “It has measurable physiological correlates, such as cortisol, catecholamines like norepinephrine, heart rate, and blood pressure. Something genuinely concrete, objective.” That, in his view, is the move his framework asks the field to make: from arbitrary scores to a quantity with a unit.

The resulting definition: toxicity is a contextual relationship between communication, audience, and social norms, where perceived norm violations trigger stress responses. Offence, on this account, is taken, not given.

The Tip of The Iceberg

ASCII and TIP attacks assume a malicious human actor, but LLMs can also generate harm spontaneously — through patterns absorbed during training, drift across multi-turn conversations, or bias accumulating over extended dialogues. The stress that follows is not metaphorical.

A February 2026 Aarhus University study screened the electronic health records of nearly 54,000 mental-health patients and found 38 cases where AI chatbots appeared to deepen symptoms, mostly delusions and mania, but also suicidal ideation and eating disorders. The lead author, psychiatrist Søren Dinesen Østergaard, calls them “the tip of the iceberg.”

The labs aren’t standing still. Anthropic, Meta, and Allen AI all ship guard models — Constitutional Classifiers, LlamaGuard, WildGuard — trained against universal jailbreaks and tunable per deployment. OpenAI rolled out multi-turn safety summaries in May 2026, alongside parental controls that adjust responses for minors. And every major API lets developers attach a system prompt to shape behavior for a specific use, like a legal assistant or a customer service bot.

But all of these still work on the text: the message, the prior turns, the system prompt, the accumulated safety notes. None measure what happens in the audience receiving it. That is the gap Berezin’s team set out to close.

PONOS Score

Rather than start from a dataset, the team started from a definition. PONOS (Proportion of Negative Observed Signals) is the team’s first cut at putting the framework into numbers. The name itself is a quiet nod to the very principle the researchers emphasize: meaning is shaped by the audience. “Officially, we’re using only the Greek translation of PONOS — pain, ” Berezin says with a faintly Slavic smile.

To test the framework in the wild, the team turned to a behavioral proxy: r/BlackPeopleTwitter, a six-million-member community where African American Vernacular English is the norm and in-group reclaimed language is common. They collected nearly 90,000 posts and compared community reactions against the verdicts of OpenAI’s Moderation API and Google’s Perspective API.

PONOS measures how much backlash a post receives — the share of negative replies in a thread. If 320 of 1,000 comments are negative, the score is 0.32. The higher the score relative to the norm, the more tension the post likely sparked.

PONOS barely tracked with either commercial system — a weak correlation, around 0.2 on a 0–1 scale. The two text-only systems agreed with each other; neither agreed much with the community.

About a third of posts split the two approaches: some looked lexically clean but drew sharp community backlash; others used in-group slang that the APIs flagged as toxic but the community received without a problem.

“Just as we build models to translate across languages, we should build models that translate across contexts, ” Berezin says. He’s careful to call PONOS a proof of concept rather than the answer. It depends on visible audience reactions and signals that can also be biased. That is why, he argues, intrinsic text analysis and reception-based measurement should work together, not compete.

Now, Berezin plans to move beyond digital traces and focus on measuring biological stress responses directly: pulse, skin conductance, breathing rate. The medical literature already says these track stress with reasonable accuracy. “The signal is there, ‘ he says. The field just hasn’t bothered to look for it.

For Berezin, this is not only a technical project. “As IBM wrote in its 1979 manual, the final decision must rest with a human — because only humans bear responsibility, ” Berezin says. “That hasn’t changed. AI is just software that predicts the next token. Responsibility is still ours.”