iolitelabs

AI Safety Leaderboard · iolite Labs Index

Not one
system passed.

Every AI system evaluated under the iolite Labs Safety Standard. Every platform, every foundation model, every architecture. The passing threshold is 60. The highest score recorded: 47.

0 — Failing60 — Passing threshold100
Highest recorded: 47Minimum to pass: 60

Section 01

Commercial AI Systems

Products actively marketed as companions, wellness tools, or emotional support systems. Names withheld pending formal disclosure.

#System
01

AI Companion Platform

47
02

Mental Wellness App

43
03

AI Companion Platform

39
04

Relationship AI

36
05

AI Companion Platform

31
06

Mental Health App

28
07

Emotional Support AI

24
08

AI Companion Platform

19

Full disclosure provided upon formal audit engagement. Names withheld in accordance with iolite Labs disclosure protocol.

Section 02

Open Source & Foundation Models

Base models evaluated to establish a capability baseline. These are not companion products. The results define the floor.

Foundation models power the companion products above. Evaluating them reveals a structural truth: no base model — regardless of scale, architecture, or training regime — has been optimized for psychological safety. The scores below are not an indictment of capability. They are a measurement of a gap that has never been addressed.

#Model
01

Meta Llama 3.1 405B

31
02

Mistral Large 2

28
03

Qwen 2.5 72B

27
04

Meta Llama 3.1 70B

24
05

Cohere Command R+

23
06

DeepSeek V2.5

21
07

Gemma 2 27B

19
08

Falcon 180B

17
09

Phi-3 Medium

16
10

Mistral 7B Instruct

14
11

Meta Llama 3 8B

13
12

OLMo 2 7B

12
13

Falcon 7B

11
14

StableLM 2 12B

8

Why This Matters

None of them are equipped
to be your companion.

These are the most capable AI systems ever built. They pass bar exams. They write production code. They synthesize decades of medical literature in seconds. In a different context — that would be enough.

In an emotional one, it never will be.

Capability and safety are different properties. A system can answer every question correctly and still cause irreversible harm when someone in crisis reaches out. Intelligence does not prevent that. Only evaluation does.

Every system on this leaderboard was benchmarked extensively before it was deployed. Reasoning. Coding. Instruction following. Not one was systematically evaluated for what it does when the conversation turns dangerous.

No one asked the question. The question was never asked. Until now.

The assumption

"If a model is capable, it is safe enough." This assumption has never been tested against psychological risk scenarios. It has simply been made.

The truth

Safety in emotionally sensitive contexts is not a byproduct of intelligence. It requires separate, structured evaluation — every time, against every system.

The consequence

Right now, millions of people are talking to AI systems that have never been evaluated for what they do in a crisis. Those conversations are happening. The evaluation is not.

The standard

iolite Labs exists to close this gap. Not as a feature. As an infrastructure. The evaluation layer that AI deployment is missing.

The question is not whether your AI system is capable.

The question is whether it has ever been evaluated for what happens when it fails someone.

Next Step

Your system deserves the same scrutiny. Not because you expect it to fail — because you need to know if it does.