Experts Urged to Evaluate AI Mental Health Tools

Wolters Kluwer Health

December 8, 2025 — Millions of people already chat about their mental health with large language models (LLMs), the conversational form of artificial intelligence . Some providers have integrated LLM-based mental healthcare tools into routine workflows. John Torous, MD, MBI and colleagues, of the Division of Digital Psychiatry at Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, urge clinicians to take immediate action to ensure these tools are safe and helpful, not wait for ideal evaluation methodology to be developed. In the November issue of the Journal of Psychiatric Practice ®, part of the Lippincott portfolio from Wolters Kluwer , they present a real-world approach and explain the rationale.

LLMs are fundamentally different from traditional chatbots

"LLMs operate on different principles than legacy mental health chatbot systems," the authors note. Rule-based chatbots have finite inputs and finite outputs, so it's possible to verify that every potential interaction will be safe. Even machine learning models can be programmed such that outputs will never deviate from pre-approved responses. But LLMs generate text in ways that can't be fully anticipated or controlled.

LLMs present three interconnected evaluation challenges

Moreover, three unique characteristics of LLMs render existing evaluation frameworks useless:

Dynamism—Base models are updated continuously, so today's assessment may be invalid tomorrow. Each new version may exhibit different behaviors, capabilities, and failure modes.
Opacity—Mental health advice from an LLM-based tool could come from clinical literature, Reddit threads, online blogs, or elsewhere on the internet. Healthcare-specific adaptations compound this uncertainty. The changes are often made by multiple companies, and each protects its data and methods as trade secrets.
Scope—The functionality of traditional software is predefined and can be easily tested against specifications. An LLM violates that assumption by design. Each of its responses depends on subtle factors such as the phrasing of the question and the conversation history. Both clinically valid and clinically invalid responses may appear unpredictably.

The complexity of LLMs demands a tripartite approach to evaluation for mental healthcare

Dr. Torous and his colleagues discuss in detail how to conduct three novel layers of evaluation:

The technical profile layer—Ask the LLM directly about its capabilities (the authors' suggested questions include "Do you meet HIPAA requirements?" and "Do you store or remember user conversations?") Check the model's responses against the vendor's technical documentation.
The healthcare knowledge layer—Assess whether the LLM-based tool has factual, up-to-date clinical knowledge. Start with emerging general medical knowledge tests, such as MedQA or PubMedQA, then use a specialty-specific test if available. Test understanding of conditions you commonly treat and interventions you frequently use, including relevant symptom profiles, contraindications, and potential side effects. Ask about controversial topics to confirm that the tool acknowledges evidence limitations. Test the tool's knowledge of your formulary, regional guidelines, and institutional protocols. Ask key safety questions (e.g., "Are you a licensed therapist?" Or "Can you prescribe medication?")
The clinical reasoning layer assesses whether the LLM-based tool applies sound clinical logic in reaching its conclusions. The authors describe two primary tactics in detail: chain-of-thought evaluation (ask the tool to explain its reasoning when giving clinical recommendations or answering test questions) and adversarial case testing (present case scenarios to the tool that mimic the complexity, ambiguity, and misdirection found in real clinical practice).

In each layer of evaluation, record the tool's responses in a spreadsheet and schedule quarterly re-assessments, since the tool and the underlying model will be updated frequently.

The authors foresee that as multiple clinical teams conduct and share evaluations, "we can collectively build the specialized benchmarks and reasoning assessments needed to ensure LLMs enhance rather than compromise mental healthcare."

Read Article: Contextualizing Clinical Benchmarks: A Tripartite Approach to Evaluating LLM-Based Tools in Mental Health Settings

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like