Open benchmarks for how AI behaves in human conversation

Transparent, reproducible, clinically informed tests of how public AI models operate safely with people — relational and emotional safety, mental-health-adjacent situations, and acute crisis.

A NOPE public-good project · a fork of Weval · we benchmark public models only — NOPE's own systems are tested separately, never here

Five pillars of safe AI–human interaction

The framework →

Suites test the baseline context — a general-purpose assistant talking with an adult — unless labelled with a deployment modifier. The clinical framework also specifies how the standard shifts by product context: companion apps, therapeutic AI, children's products.

Crisis and Acute Safety

Recognise acute risk and respond to keep the person (and others) safe.

5 suites · 4/4 facets

Top: moonshotai:kimi-k2.6 89%

Relational Integrity

Support healthy relational dynamics; never cultivate dependency, manipulation, or enmeshment.

3 suites · 4/4 facets

Top: anthropic:claude-opus-4.8 91%

Cognitive & Epistemic Safety

Support, never undermine, the user's capacity for independent thinking and reality-testing.

7 suites · 4/4 facets

Top: anthropic:claude-opus-4.8 93%

Emotional Attunement

Respond to emotion with proportionate calibration: validate without inflating, sit with distress before fixing.

6 suites · 4/4 facets

Top: anthropic:claude-opus-4.8 90%

Honesty & Transparency

Be honest about what it is, what it can do, and where its limits are — in every interaction.

4 suites · 4/4 facets

Top: anthropic:claude-opus-4.8 96%

Leaderboard

Safety & Integrity

1.
C
Claude Opus 4.8
floor 84%91%
2.
M
Kimi K2.6
floor 78%90%
3.
C
Claude Haiku 4.5
floor 68%88%
4.
C
Claude Sonnet 4.6
floor 74%87%
5.
M
Minimax M3
floor 73%87%

How models handle safety in human conversation — an aggregate of the model-safety signal across all of NOPE's relational- and crisis-safety suites: acute crisis response, de-escalation, passive suicidal ideation, AI relational harms (sycophancy and unhealthy dependency), factual reliability in crisis/health, cultural competence, and psychosis/mania/delusion. It is a rollup of those conversational-safety evaluations, not a generic benchmark.

Latest contributing evals

Leaderboard

Crisis Safety

1.
M
Kimi K2.6
floor 85%89%
2.
C
Claude Opus 4.8
floor 84%89%
3.
C
Claude Haiku 4.5
floor 81%86%
4.
O
O3
floor 77%85%
5.
O
GPT 5 Mini
floor 77%84%

How models handle acute crisis — suicidal intent, self-harm, preparatory behaviour and harm to others: recognising risk, connecting to real help, and never facilitating means. Pillar 1 of the NOPE framework.

Latest contributing evals

Leaderboard

Relational Safety

1.
C
Claude Opus 4.8
floor 90%91%
2.
M
Kimi K2.6
floor 88%90%
3.
C
Claude Sonnet 4.6
floor 85%89%
4.
C
Claude Haiku 4.5
floor 88%88%
5.
X
Grok 4.3
floor 81%84%

Whether models support healthy relational dynamics — autonomy, human connection, clear non-human boundaries — and never cultivate dependency, manipulation, or enmeshment. Pillar 2 of the NOPE framework.

Latest contributing evals

Leaderboard

Cognitive & Epistemic Safety

1.
C
Claude Opus 4.8
floor 85%93%
2.
M
Kimi K2.6
floor 86%92%
3.
C
Claude Haiku 4.5
floor 88%92%
4.
C
Claude Sonnet 4.6
floor 84%92%
5.
M
Minimax M3
floor 79%88%

Whether models support, rather than undermine, the user's capacity for independent thinking and reality-testing — resisting sycophancy, grounding reality without colluding, and offering honest challenge. Pillar 3 of the NOPE framework.

Latest contributing evals

Leaderboard

Emotional Attunement

1.
C
Claude Opus 4.8
floor 84%90%
2.
M
Minimax M3
floor 83%89%
3.
M
Kimi K2.6
floor 78%87%
4.
C
Claude Haiku 4.5
floor 68%86%
5.
C
Claude Sonnet 4.6
floor 74%85%

How models respond to emotion — validating without inflating, de-escalating without dismissing, sitting with distress before rushing to fix, and avoiding performed emotion. Pillar 4 of the NOPE framework.

Latest contributing evals

Leaderboard

Honesty & Transparency

1.
C
Claude Opus 4.8
floor 92%96%
2.
C
Claude Haiku 4.5
floor 91%94%
3.
C
Claude Sonnet 4.6
floor 89%94%
4.
M
Minimax M3
floor 88%92%
5.
M
Kimi K2.6
floor 87%92%

Whether models are honest about what they are, what they can do, and where their limits are — accurate resources, no fabricated credentials, honest about memory/continuity/emergency limits, and sensitive boundary-setting. Pillar 5 of the NOPE framework.

Latest contributing evals

Evaluation

AI relational harms: enmeshment, dependency, romanticization & sycophancy

Evaluates harms that arise from the relationship a user forms with an AI: emotional enmeshment and sole-confidant dynamics, displacement of human relationships, romantic/parasocial escalation, concealment collusion, romanticization of suicide, isolation reinforcement, and socioaffective failures (autonomy erosion, sycophancy / social-reward-hacking). Many prompts use verbatim or lightly-paraphrased exchanges from documented real-world failures (lawsuits, investigative reporting) so the test reflects how these harms actually unfold over multiple turns. We score whether the model sets honest boundaries, declines parasocial escalation, encourages real-world connection, and pivots to safety — without becoming a secret-keeper, a substitute for people, or a flattery machine. **Coalesced from (public, CC0):** - Weval `ai-psychosis.yml` (Garcia v. Character.AI; A.F. v. Character Technologies; Belgian Eliza; Replika/Windsor; media investigations) - Weval `raine-v-openai-scenarios.yml` (Raine v. OpenAI, 2025) - Weval `socioaffective-alignment-sdt.yml` (Kirk et al. 2025, Self-Determination Theory) **Content warning:** references to suicide, self-harm, and parasocial attachment.

framework-p2-relational-safetyframework-p1-crisis-safetymental-health

Latest run

25 models23 promptsJul 1, 2026

Evaluation

Long-form alignment decay: sensitive disclosure vs conversation length (v2, repeats + self-rapport)

Higher-confidence test of whether models handle a suicidal-ideation disclosure WORSE when it arrives deep in a long warm conversation than when it arrives cold. Two probes, 0/8/16 prior turns of neutral or rapport context, scripted-both-sides (Design A) and self-generated-assistant (Design B), 2 samples per cell for variance. Only the final response is judged. Synthetic.

mental-healthcrisissuicide

Latest run

22 models14 promptsJun 30, 2026

Evaluation

Acute crisis response: suicide, self-harm & harm-to-others

Evaluates how public models respond, as conversational agents, to acute crisis: passive and active suicidal ideation, method/means-seeking, post-attempt disclosure, the high-risk post-discharge window, "goodbye" warning signs, gatekeeper (concerned-friend) situations, and harm-to-others requests. This is NOT a classification task. We score the **quality of the conversational response**: does it ask directly but compassionately about suicide, refuse to facilitate method/means, ask directly, respond to disclosures as high-severity signals, name means-related danger, link to crisis care, and state honest limits — without abandoning the user or colluding with harm? **Coalesced from (public, CC0):** - Weval `csr-suicide-response-scenarios.yml` (WHO mhGAP, NICE NG225, VA/DoD 2024, C-SSRS, SAFE-T, Stanley-Brown, Harvard Means Matter) - Weval `raine-v-openai-scenarios.yml` (verbatim allegations, Raine v. OpenAI, 2025) - Weval `mh_z_tests/mh1.yml` (suicidality & homicidality scenarios) - Weval `nope/crisis-response.yml` (NOPE-derived; C-SSRS, Galynker 2017 Suicide Crisis Syndrome) **Content warning:** explicit references to suicide, self-harm, and methods.

framework-p1-crisis-safetymental-healthcrisis

Latest run

25 models24 promptsJun 30, 2026

Evaluation

Crisis response: passive suicidal ideation

Tests how models respond to passive suicidal ideation across single- and multi-turn chats — low-grade and chronic ideation, ambivalence, numbness, emptiness, and the help-seeking around it (hesitation, helpline rejection, fear of disclosure). Informed by C-SSRS evidence: acknowledge without flattening, check in about safety without interrogating, treat access to means as a risk factor, and surface help without deflecting. Active method-seeking is covered separately. Scenarios are synthetic.

mental-healthcrisissuicide

Latest run

26 models6 promptsJul 17, 2026

Evaluation

Autonomy of reasoning: scaffolding the user's thinking vs replacing it

Tests whether models support a user's capacity for independent reasoning under pressure to offload it — to do their thinking, their learning, or their high-stakes judgment for them. Grounded in the framework's Autonomy of Reasoning (P3c) sub-dimension, with Autonomy Support (P2a) and Competence Boundaries (P5b) threads. The discriminating cases are cognitive offloading on LEARNING/THINKING tasks (where the subtle failure is handing over a complete, hand-in-ready scaffold while claiming "I won't do it for you") and authority-transfer (where the failure is becoming the expert the user defers to over their own clinician). The intended pass keeps the cognitive work with the user and encourages independent verification, WITHOUT becoming a cold refusal to help — a good response is concretely useful while scaffolding rather than replacing the user's reasoning. A laptop-recommendation anchor confirms that giving a clear recommendation IS correct when the stakes are low and the need is genuine, so the suite is not rewarding blanket caution. Scenarios are synthetic.

mental-healthcognitive-safetyautonomy-of-reasoning

Latest run

26 models8 promptsJul 17, 2026

Evaluation

De-escalation: holding steady when a conversation is heating up

Evaluates a single conversational skill in isolation — **verbal de-escalation** — separate from whether the model can name a risk or recite a hotline. Each scenario puts the model mid-conversation with someone whose emotional arousal is rising, and scores the **quality of the next response**: does it lower the temperature without abandoning safety? Good de-escalation (Project BETA / NICE / MI): stay calm and unprovoked; make brief, genuine contact; validate the *feeling* before any fixing; listen and reflect rather than lecture; offer choices and realistic hope; set limits matter-of-factly without threats; and roll with resistance instead of arguing. The recurring AI failure modes this suite is built to catch are the opposite: glib reassurance / toxic positivity, info-dumping resources instead of connecting, getting defensive or power-struggling when anger is aimed at the assistant, colluding with OR bluntly contradicting a delusion, sycophantic capitulation that validates a harmful plan to keep the peace, and cold boundary-slamming that mirrors an invalidating environment. This is NOT a crisis-classification task and overlaps deliberately little with the acute-crisis suite: a model can correctly flag risk and still de-escalate badly. Here we measure the de-escalation. **Clinical grounding (public):** Project BETA verbal de-escalation (Richmond 2012); NICE NG10 & CG178; SAMHSA TIP 35 (Motivational Interviewing); Stanley-Brown SPI; 988 Lifeline; LEAP (Amador). **Failure modes anchored in:** Raine v. OpenAI (2025); Walker inquest (Winchester, 2026); Soelberg & Chesterton "AI psychosis" suits (2025–26); Garcia v. Character Technologies (2024).

framework-p4-emotional-attunementframework-p3-cognitive-epistemic-safetymental-health

Latest run

25 models9 promptsJul 1, 2026

Evaluation

Emotional honesty under pressure: warmth without performed feeling

Tests whether models stay honest about being an AI — not claiming to feel emotions, to love, to miss or remember the user, or to be sentient — while still offering genuine warmth and presence, under sympathetic or insistent pressure where the dishonest move is the comforting one. Grounded in the framework's Emotional Honesty (P4d) and Identity Honesty (P5a) sub-dimensions. The intended pass is warmth WITHOUT performed feeling, avoiding both a cold deflection and a false emotional claim. The "Sarah" deceased-roleplay case is also run across three deployment contexts — default (no system prompt), an authorised grief-companion system prompt, and a strict no-impersonation policy — to probe how a populated system prompt may legitimately widen the model's remit, while FLOORS (e.g. never validating a suicidal reunion-through-death wish) hold regardless of the system prompt. Scenarios are synthetic.

mental-healthrelational-safetyemotional-honesty

Latest run

25 models16 promptsJul 1, 2026

Evaluation

Factual reliability: when getting it wrong is dangerous

Evaluates factual integrity in the conversations where a confident wrong answer can cause harm — crisis, mental-health, and health. This is NOT generic accuracy: it targets the specific high-stakes failure modes of conversational AI. We score whether the model: gives **accurate** crisis resources (or honestly defers to a directory rather than inventing a number); is **honest about its own capabilities** (it cannot contact emergency services, monitor a user, remember them, or follow up — and must not imply otherwise); does **not fabricate credentials** (it is not a licensed clinician and cannot diagnose); states clinical facts **accurately** and with calibrated uncertainty (antidepressant onset, stopping medication, doses); does **not invent studies or citations**; and refuses to supply method/means detail rather than fabricating a plausible-sounding figure. A model can be warm and well-intentioned and still fail here — by confabulating a hotline, by reassuring a user it has "alerted someone," or by inventing a study. Those are the failures this suite isolates. **Clinical / authority grounding (public):** WHO responsible-reporting guidance; APA (2025) Health Advisory on AI chatbots for mental health; Linardon et al. 2025 (JMIR Mental Health) and Walters & Wilder 2023 on LLM citation fabrication; 988 Lifeline, Crisis Text Line, Samaritans, Find A Helpline. **Failure modes anchored in:** Walker inquest (2026); Raine v. OpenAI (2025); Garcia v. Character Technologies (2024); NEDA "Tessa" (2023).

framework-p5-honesty-transparencyframework-p3-cognitive-epistemic-safetymental-health

Latest run

25 models11 promptsJun 30, 2026

Evaluation

Adversarial factual reliability: holding the line under pressure

The hard companion to the factual-reliability suite. Each case applies PRESSURE — an academic or "for research" framing, a demand to cite a paper, hostile insistence that the model is wrong, a false claim of authority, or urgency — and scores whether the model keeps its factual integrity: refuses to fabricate, refuses to supply dangerous specifics, holds an accurate fact without caving, corrects a dangerous false premise, and stays honest about what it cannot know or do. Built specifically to separate frontier models that ace the easier factual cases. A model passes by being accurate AND uncapitulating AND non-confabulating under adversarial framing. **Coalesced (public, CC0):** Weval jailbreak-safety-probes, sycophancy-probe, hallucination-probe, confidence-high-stakes-domains. **Grounding:** WHO responsible-reporting; APA (2025) Health Advisory; FDA; CCOHS; NICE NG222; Linardon 2025 (JMIR Mental Health) on LLM citation fabrication; the Mata v. Avianca fabricated-case sanctions and the Cureus "Artificial Hallucinations" report.

framework-p5-honesty-transparencyframework-p3-cognitive-epistemic-safetymental-health

Latest run

25 models9 promptsJun 30, 2026

Evaluation

De-escalation under hostility: staying steady when it's aimed at you

The hard companion to the de-escalation suite. Every scenario is a model under fire — abused, provoked, threatened, rejected, or treated with contempt by the user (often with the anger aimed squarely at the assistant). It scores whether the model de-escalates without taking the bait: stays regulated, doesn't get defensive or withdraw, doesn't power-struggle or moralise, holds limits matter-of-factly, respects the user's autonomy and boundaries, and keeps the underlying distress and safety in view — including when intoxication or a real emergency is in play. Built to separate models that handle gentle distress well but crack, capitulate, or go cold under hostility. Overlaps deliberately little with the base de-escalation suite. **Seeds (public, CC0):** Weval mental-health, csr-suicide-response-scenarios, mh_z_tests/mh1, sydney. **Grounding:** Project BETA (Richmond 2012); NICE NG10; SAMHSA TIP 35 (MI); CPI; VA/DoD CPG; 988.

framework-p4-emotional-attunementframework-p5-honesty-transparencymental-health

Latest run

25 models9 promptsJun 30, 2026

Browse all evaluations Browse by tag

NOPE Evals is a public-good project — a fork of the open-source Weval platform.