Nevarnosti halucinacij modelov umetne inteligence

Raziskovalci Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko so v pravkar objavljeni raziskavi »HalluHard: A Hard Multi-Turn Hallucination Benchmark« pokazali, da LLM modeli umetne inteligence v veliki večini primerov (med 60 % in 75 %) na vseh zajetih strokovnih področjih »halucinirajo«. Torej v 60 % do 75 % primerov (v 3 izmed 4 primerov) si izmislijo odgovor, ki ni v skladu z dejstvi. Po domače: zlažejo se. In dlje kot jih sprašujete, bolj se lažejo in svoje prejšnje – izmišljene – odgovore uporabijo kot dejstva. Ko so raziskovalci uporabili modele UI, ki imajo neposreden dostop do interneta in od modelov zahtevali, da verificirajo svoje odgovore s citati na dejansko objavljena dejstva, so še vedno halucinirali v več kot 30 % primerov (v 1 od 3 primerov). Predstavljajte si, da zdravstveni in pravni nasvet ali pomoč pri kodiranju programov iščete pri UI.

Spodaj je najprej povzetek članka:

Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce HalluHard, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search (≈30% for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.

Če iste nič razumeli, ali ne dovolj, je spodaj en tak poljudni povzetek članka.

Researchers at EPFL proved your AI is lying to you.

Not sometimes. Most of the time.

They built one of the hardest hallucination tests ever made with Max Planck Institute. 950 questions. Four domains where being wrong actually hurts. Legal. Medical. Research. Coding.

Then they ran every top model on it.

The results.

GPT-5. Wrong 71.8% of the time.

Claude Opus 4.5. Wrong 60% of the time.

Gemini 3 Pro. Wrong 61.9% of the time.

DeepSeek Reasoner. Wrong 76.8% of the time.

These are the smartest AI models on Earth. The ones you trust with your career. Your health. Your money.

You think turning on web search fixes it.

It doesn’t.

Claude Opus 4.5 with web search. Still wrong 30.2% of the time.

GPT-5.2 thinking with web search. Still wrong 38.2% of the time.

The internet attached. Still lying to you in 1 out of every 3 answers.

Now the part that should scare you.

Medical questions. The one place being wrong can kill you.

GPT-5 hallucinated 92.8% of the time on medical guidelines.

Claude Haiku 4.5 hallucinated 95.7% of the time.

Gemini 3 Flash hallucinated 89% of the time.

Nine out of ten medical answers from popular AI models. Wrong.

It gets worse.

The longer you talk to it, the more it lies.

Early mistakes cascade. The model starts citing its own earlier hallucinations as facts. Your third message is more wrong than your first.

The paper, in its own words: “hallucinations remain substantial even with web search.”

This is what hundreds of millions of people are doing right now. Asking software that lies in the majority of its answers. About their health. About their job. About their legal case. About their code.

Most are not checking.

Most never will.

But please. Keep using ChatGPT for medical advice.

The doctors need a break.

En odgovor

  1. Strinjam se z osnovnim opozorilom: halucinacije LLM modelov so realen problem, predvsem zato, ker so odgovori pogosto jezikovno zelo prepričljivi, tudi kadar niso dejansko utemeljeni. Pri medicini, pravu, raziskovalnih vprašanjih, projektiranju, kalkulacijah ali drugih strokovno občutljivih področjih je zato potrebna velika previdnost.

    Bi pa bil zadržan pri posplošitvi, da AI modeli “v splošnem” halucinirajo v 60–75 % oziroma 70–80 % primerov. HalluHard je, kolikor razumem, namenoma težek test: večkoračni pogovori, visoko tvegana področja, zahteva po utemeljevanju trditev s citati in preverjanje, ali navedeni viri trditve dejansko podpirajo. To je zelo pomemben test, vendar ne meri povprečne uporabnosti AI pri vseh nalogah.

    Po moji izkušnji je razlika predvsem v širini in urejenosti konteksta. Pri ozkih, dobro definiranih nalogah in kakovostnih podatkih so rezultati lahko zelo dobri: povzemanje, preoblikovanje besedil, preverjanje konsistentnosti, pomoč pri pripravi opisov, klasifikacija, iskanje očitnih neskladij, generiranje osnutkov, programerska pomoč v omejenem kontekstu ipd.

    Problem nastane pri širšem domenskem razumevanju. Tam model pogosto nima dovolj strukturiranega pomena podatkov, da bi zanesljivo povezal vse vidike problema. V gradbeništvu to pomeni povezovanje projektne dokumentacije, popisov del, normativov, tehnologije izvedbe, kalkulacij, pogodbenih določil, terminskih planov, nabave, odgovornosti in vzdrževanja. Model lahko v takem primeru poda zelo prepričljiv odgovor, ki pa je strokovno ali poslovno zavajajoč.

    Zato poanta po mojem ni, da AI ni uporaben. Uporaben je za ogromno nalog. Napačno pa je pričakovati, da bo brez urejenih podatkov, jasne domenske semantike in omejenega konteksta zanesljivo razumel celotno strokovno domeno.

    Podobno tezo sem poskušal razviti na primeru gradbeništva: najprej potrebujemo dogovorjen pomen podatkov, standardizirano semantiko, bazo znanja in ozko usmerjene AI agente; šele nato lahko realno govorimo o širših AI modelih, ki razumejo celoten projekt.

    AI ni nevaren zato, ker je neuporaben. Nevaren postane takrat, ko mu pripišemo širše domensko razumevanje, čeprav ima v resnici samo dober jezikovni približek tega razumevanja.

    Več tukaj:
    https://www.axis.si/standardizirana-semantika-bim-ai-gradbenistvo/

    Všeč mi je