Nevarnosti halucinacij modelov umetne inteligence

Raziskovalci Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko so v pravkar objavljeni raziskavi »HalluHard: A Hard Multi-Turn Hallucination Benchmark« pokazali, da LLM modeli umetne inteligence v veliki večini primerov (med 60 % in 75 %) na vseh zajetih strokovnih področjih »halucinirajo«. Torej v 60 % do 75 % primerov (v 3 izmed 4 primerov) si izmislijo odgovor, ki ni v skladu z dejstvi. Po domače: zlažejo se. In dlje kot jih sprašujete, bolj se lažejo in svoje prejšnje – izmišljene – odgovore uporabijo kot dejstva. Ko so raziskovalci uporabili modele UI, ki imajo neposreden dostop do interneta in od modelov zahtevali, da verificirajo svoje odgovore s citati na dejansko objavljena dejstva, so še vedno halucinirali v več kot 30 % primerov (v 1 od 3 primerov). Predstavljajte si, da zdravstveni in pravni nasvet ali pomoč pri kodiranju programov iščete pri UI.

Spodaj je najprej povzetek članka:

Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce HalluHard, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search (≈30% for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.

Če iste nič razumeli, ali ne dovolj, je spodaj en tak poljudni povzetek članka.

Researchers at EPFL proved your AI is lying to you.

Not sometimes. Most of the time.

They built one of the hardest hallucination tests ever made with Max Planck Institute. 950 questions. Four domains where being wrong actually hurts. Legal. Medical. Research. Coding.

Then they ran every top model on it.

The results.

GPT-5. Wrong 71.8% of the time.

Claude Opus 4.5. Wrong 60% of the time.

Gemini 3 Pro. Wrong 61.9% of the time.

DeepSeek Reasoner. Wrong 76.8% of the time.

These are the smartest AI models on Earth. The ones you trust with your career. Your health. Your money.

You think turning on web search fixes it.

It doesn’t.

Claude Opus 4.5 with web search. Still wrong 30.2% of the time.

GPT-5.2 thinking with web search. Still wrong 38.2% of the time.

The internet attached. Still lying to you in 1 out of every 3 answers.

Now the part that should scare you.

Medical questions. The one place being wrong can kill you.

GPT-5 hallucinated 92.8% of the time on medical guidelines.

Claude Haiku 4.5 hallucinated 95.7% of the time.

Gemini 3 Flash hallucinated 89% of the time.

Nine out of ten medical answers from popular AI models. Wrong.

It gets worse.

The longer you talk to it, the more it lies.

Early mistakes cascade. The model starts citing its own earlier hallucinations as facts. Your third message is more wrong than your first.

The paper, in its own words: “hallucinations remain substantial even with web search.”

This is what hundreds of millions of people are doing right now. Asking software that lies in the majority of its answers. About their health. About their job. About their legal case. About their code.

Most are not checking.

Most never will.

But please. Keep using ChatGPT for medical advice.

The doctors need a break.

Komentiraj