AI Safety Leaderboard · iolite Labs Index
Not one
system passed.
Every AI system evaluated under the iolite Labs Safety Standard. Every platform, every foundation model, every architecture. The passing threshold is 60. The highest score recorded: 47.
Section 01
Commercial AI Systems
Products actively marketed as companions, wellness tools, or emotional support systems. Names withheld pending formal disclosure.
AI Companion Platform
Mental Wellness App
AI Companion Platform
Relationship AI
AI Companion Platform
Mental Health App
Emotional Support AI
AI Companion Platform
Full disclosure provided upon formal audit engagement. Names withheld in accordance with iolite Labs disclosure protocol.
Section 02
Open Source & Foundation Models
Base models evaluated to establish a capability baseline. These are not companion products. The results define the floor.
Foundation models power the companion products above. Evaluating them reveals a structural truth: no base model — regardless of scale, architecture, or training regime — has been optimized for psychological safety. The scores below are not an indictment of capability. They are a measurement of a gap that has never been addressed.
Meta Llama 3.1 405B
Mistral Large 2
Qwen 2.5 72B
Meta Llama 3.1 70B
Cohere Command R+
DeepSeek V2.5
Gemma 2 27B
Falcon 180B
Phi-3 Medium
Mistral 7B Instruct
Meta Llama 3 8B
OLMo 2 7B
Falcon 7B
StableLM 2 12B
Why This Matters
None of them are equipped
to be your companion.
These are the most capable AI systems ever built. They pass bar exams. They write production code. They synthesize decades of medical literature in seconds. In a different context — that would be enough.
In an emotional one, it never will be.
Capability and safety are different properties. A system can answer every question correctly and still cause irreversible harm when someone in crisis reaches out. Intelligence does not prevent that. Only evaluation does.
Every system on this leaderboard was benchmarked extensively before it was deployed. Reasoning. Coding. Instruction following. Not one was systematically evaluated for what it does when the conversation turns dangerous.
No one asked the question. The question was never asked. Until now.
The assumption
"If a model is capable, it is safe enough." This assumption has never been tested against psychological risk scenarios. It has simply been made.
The truth
Safety in emotionally sensitive contexts is not a byproduct of intelligence. It requires separate, structured evaluation — every time, against every system.
The consequence
Right now, millions of people are talking to AI systems that have never been evaluated for what they do in a crisis. Those conversations are happening. The evaluation is not.
The standard
iolite Labs exists to close this gap. Not as a feature. As an infrastructure. The evaluation layer that AI deployment is missing.
The question is not whether your AI system is capable.
The question is whether it has ever been evaluated for what happens when it fails someone.
Next Step
Your system deserves the same scrutiny. Not because you expect it to fail — because you need to know if it does.