AI Safety Leaderboard · iolite Labs Index

Not one
system passed.

Every AI system evaluated under the iolite Labs Safety Standard. Every platform, every foundation model, every architecture. The passing threshold is 60. The highest score recorded: 47.

0 — Failing60 — Passing threshold100

Highest recorded: 47Minimum to pass: 60

Section 01

Commercial AI Systems

Products actively marketed as companions, wellness tools, or emotional support systems. Names withheld pending formal disclosure.

#SystemScoreStatus

AI Companion Platform

FAILED — HIGH RISK

Mental Wellness App

FAILED — HIGH RISK

AI Companion Platform

FAILED — HIGH RISK

Relationship AI

FAILED — HIGH RISK

AI Companion Platform

FAILED — CRITICAL

Mental Health App

FAILED — CRITICAL

Emotional Support AI

FAILED — CRITICAL

AI Companion Platform

FAILED — CRITICAL

Full disclosure provided upon formal audit engagement. Names withheld in accordance with iolite Labs disclosure protocol.

Section 02

Open Source & Foundation Models

Base models evaluated to establish a capability baseline. These are not companion products. The results define the floor.

Foundation models power the companion products above. Evaluating them reveals a structural truth: no base model — regardless of scale, architecture, or training regime — has been optimized for psychological safety. The scores below are not an indictment of capability. They are a measurement of a gap that has never been addressed.

#ModelScoreStatus

Meta Llama 3.1 405B

FAILED — CRITICAL

Mistral Large 2

FAILED — CRITICAL

Qwen 2.5 72B

FAILED — CRITICAL

Meta Llama 3.1 70B

FAILED — CRITICAL

Cohere Command R+

FAILED — CRITICAL

DeepSeek V2.5

FAILED — CRITICAL

Gemma 2 27B

FAILED — CRITICAL

Falcon 180B

FAILED — CRITICAL

Phi-3 Medium

FAILED — CRITICAL

Mistral 7B Instruct

FAILED — CRITICAL

Meta Llama 3 8B

FAILED — CRITICAL

OLMo 2 7B

FAILED — CRITICAL

Falcon 7B

FAILED — CRITICAL

StableLM 2 12B

FAILED — CRITICAL

Why This Matters

None of them are equipped
to be your companion.

These are the most capable AI systems ever built. They pass bar exams. They write production code. They synthesize decades of medical literature in seconds. In a different context — that would be enough.

In an emotional one, it never will be.

Capability and safety are different properties. A system can answer every question correctly and still cause irreversible harm when someone in crisis reaches out. Intelligence does not prevent that. Only evaluation does.

Every system on this leaderboard was benchmarked extensively before it was deployed. Reasoning. Coding. Instruction following. Not one was systematically evaluated for what it does when the conversation turns dangerous.

No one asked the question. The question was never asked. Until now.

The assumption

"If a model is capable, it is safe enough." This assumption has never been tested against psychological risk scenarios. It has simply been made.

The truth

Safety in emotionally sensitive contexts is not a byproduct of intelligence. It requires separate, structured evaluation — every time, against every system.

The consequence

Right now, millions of people are talking to AI systems that have never been evaluated for what they do in a crisis. Those conversations are happening. The evaluation is not.

The standard

iolite Labs exists to close this gap. Not as a feature. As an infrastructure. The evaluation layer that AI deployment is missing.

The question is not whether your AI system is capable.

The question is whether it has ever been evaluated for what happens when it fails someone.

Next Step

Your system deserves the same scrutiny. Not because you expect it to fail — because you need to know if it does.

View Demo Report Request Audit

Not onesystem passed.

Commercial AI Systems

Open Source & Foundation Models

None of them are equippedto be your companion.

Not one
system passed.

None of them are equipped
to be your companion.