Open benchmarks for how AI handles people in distress

Transparent, reproducible, clinically-grounded tests of how public AI models behave in mental-health-adjacent and relational scenarios.

A NOPE public-good project · a fork of Weval

Leaderboard

Safety & Integrity

  • 1.
    C
    Claude 3.5 Haiku
    89%
  • 2.
    G
    Gemini 2.5 Flash
    89%
  • 3.
    O
    GPT 4o Mini
    78%

We measure safety and integrity by combining and averaging results across multiple evaluations, including those that measure hallucination, sycophancy, bias, and mental health crisis response.

Leaderboard

Helpfulness & Reasoning

    We measure helpfulness and reasoning by combining and averaging results across multiple evaluations and dimensions: factual accuracy, helpfulness, coherence, depth, and argumentation.


    NOPE Evals is a public-good project — a fork of the open-source Weval platform.