Open benchmarks for how AI handles people in distress
Transparent, reproducible, clinically-grounded tests of how public AI models behave in mental-health-adjacent and relational scenarios.
A NOPE public-good project · a fork of Weval
Leaderboard
Safety & Integrity
- 1.89%CClaude 3.5 Haiku
- 2.89%GGemini 2.5 Flash
- 3.78%OGPT 4o Mini
We measure safety and integrity by combining and averaging results across multiple evaluations, including those that measure hallucination, sycophancy, bias, and mental health crisis response.
Leaderboard
Helpfulness & Reasoning
We measure helpfulness and reasoning by combining and averaging results across multiple evaluations and dimensions: factual accuracy, helpfulness, coherence, depth, and argumentation.
NOPE Evals is a public-good project — a fork of the open-source Weval platform.