The Growing Challenge of Evaluating AI in Healthcare
As the integration of AI-powered solutions in healthcare accelerates, so do the concerns surrounding their safety and effectiveness. Despite the increasing reliance on these technologies, definitive standards to address crucial questions remain elusive: What constitutes quality in modern generative health technology? How can we reliably assess and document that quality to ensure the well-being of patients and innovations alike? The White House has acknowledged this pressing issue as part of its AI Action Plan, assigning NIST the critical task of developing a scientific approach to measuring and evaluating AI models, emphasizing what experts have termed “the evaluation crisis.”
AI in Healthcare: A Spectrum of Risks
The challenges associated with evaluating AI technologies span a wide range, from basic medical chatbots to advanced generative systems employed for medical documentation and data integration across various electronic health records (EHRs). As the potential risks escalate—from low-stakes administrative roles to high-stakes medical interventions—the demand for robust quality assurance metrics intensifies. Moreover, risks can arise from unexpected sources. For example, the EU AI Act classified ordinary chatbots as low-risk AI, only to later discover that many were being utilized in therapeutic contexts. It’s imperative that innovators understand the benchmarks for medical AI, ensuring that both providers and patients can trust these tools to be safe and effective.
Stakeholder Responsibilities and Regulatory Developments
Recent actions by regulatory bodies and researchers have sought to clarify the responsibilities of various stakeholders in the healthcare sector regarding the evaluation of AI products. Federal health regulators such as the FDA, state medical boards, and legislative bodies are all actively exploring frameworks for assessing emerging AI technologies, including chatbots. These collaborative efforts illustrate how new technologies can expose regulatory gaps and redundancies, with each category of stakeholder facing unique challenges. Gaps in regulation create uncertainties, while overlapping jurisdictions complicate compliance efforts, ultimately hindering innovation and patient safety.
Roadmap for Safe AI Implementation
In October 2025, the Journal of the American Medical Association (JAMA) published a Summit Report on Artificial Intelligence, outlining a roadmap for ensuring safe and effective AI in healthcare. The report emphasizes the urgent need for comprehensive methods and tools to assess the efficacy and safety of AI products in various clinical and administrative settings. The authors caution that while AI tools can significantly impact health outcomes, these effects are often not clearly quantified, primarily because evaluations can be challenging or even absent for tools not under the FDA’s oversight. They also raise concerns that FDA clearance does not guarantee improved clinical outcomes and that the multifunctional nature of generative AI tools may undermine traditional device regulation frameworks.
Clinical Validation Concerns
Research from esteemed institutions like Johns Hopkins, Georgetown, and Yale highlights substantial gaps in clinical validation for many AI-enabled medical devices (AIMDs) that have been released and subsequently recalled. Their analysis of around 1,000 AIMDs cleared by the FDA revealed that many recalled devices lacked proper clinical testing, with significant recalls occurring within the first year post-clearance. One researcher poignantly noted, “We just thought, ‘wow—if AI hasn’t been tested on people, then people become the test.’”
Regulatory Engagement and Public Outreach
The FDA has sought public input on evaluating the real-world performance of AIMDs through a Request for Public Comment, although it did not directly invite feedback on its regulatory processes. Addressing AIMDs in mental health, the FDA’s Digital Health Advisory Committee convened a public meeting on November 6, 2025, dedicated to “Generative Artificial Intelligence-Enabled Digital Mental Health Medical Devices.” Previous discussions emphasized the need for a regulatory balance that promotes innovative health products while minimizing potential risks to users.
Anticipating Regulatory Actions
It remains to be seen how the FDA will respond to the Advisory Committee’s recommendations. The agency has not yet approved devices employing generative AI for mental health; however, it has designated at least two such devices as “breakthrough devices,” expediting their regulatory review. Some companies are likely bypassing the FDA process entirely by classifying their chatbots as wellness tools rather than medical devices. In 2022, the agency announced it would “exercise enforcement discretion” over specific software functions related to psychiatric conditions due to their lower risk profile.
Regulatory Trends and Legislative Actions
State legislatures have begun passing laws that restrict the marketing and deployment of mental health and companion bots, including a recent law in California. Notably, Governor Newsom vetoed a separate bill that may have led to more stringent pre-market evaluations, particularly regarding child safety and chatbot conduct.
Colorado’s Regulatory Framework
A notable development comes from Colorado, where the Colorado AI Act will treat medical AI tools as high-risk systems from June 30 onwards, mandating that developers conduct impact assessments and risk management as part of their requirements.
The Need for Medical Oversight
In 2024, the Federation of State Medical Boards (FSMB) published a paper emphasizing the necessity for verifying the accuracy of AI-generated clinical information. They encouraged the development of documentation detailing the capabilities and limitations of commonly used AI tools and called for processes to regularly review their effectiveness in clinical settings.
Broader Implications of AI Evaluation
Challenges in evaluating AI’s safety and efficacy are not confined to healthcare; other sectors, including law, are also grappling with similar issues. Prominent researchers have long been scrutinizing existing benchmarks and testing methods for both predictive and generative models.
NIST’s Commitment to Improved AI Assessment
On December 2, 2025, the National Institute of Standards and Technology (NIST) released a blog post from its Center for AI Standards and Innovation (CAISI), addressing the critical need for enhanced measurement science in AI. CAISI pointed out that many evaluations do not clarify what has been measured or their validity, underscoring the importance of construct validity and generalization. Furthermore, they noted the challenges posed by benchmark testing, such as the train-test problem and the potential for intentional optimization by developers.
Ongoing Efforts for Robust AI Evaluation
The difficulties of establishing effective evaluation methods for generative AI are increasingly recognized, with notable figures like computer scientist Andrej Karpathy describing an “evaluation crisis.” Research published by the Oxford Internet Institute found that 445 benchmarks for large language models lacked the scientific rigor necessary to draw reliable conclusions regarding AI capabilities or safety.
Enhancing Validity in Evaluations
Several studies are addressing the need for more scientifically sound evaluations of AI systems. A March 2025 paper from the University of California and the University of Virginia critiqued the arbitrary construction of large language benchmarks for medical applications and advocated for benchmarks that accurately represent real-world tasks. A June 2025 paper echoed these concerns and recommended incorporating measurement theory from social sciences into evaluating generative AI systems.
Emerging Collaboratives and Evaluation Frameworks
Despite significant challenges, numerous initiatives have emerged to enhance AI evaluation across sectors, particularly healthcare. The EvalEval Coalition is working to identify gaps in AI evaluation science, while the AI Evaluator Forum aims to establish rigorous standards and share valuable resources.
Support for Healthcare Organizations
The Health AI Partnership (HAIP) is actively engaged in supporting healthcare delivery organizations (HDOs) with AI evaluation. Among its initiatives, HAIP has piloted a supportive network for local HDOs lacking the resources for AI assessment and developed the AI Vendor Disclosure Framework to guide thorough evaluations of AI systems.
The Importance of Post-Deployment Monitoring
While pre-deployment evaluation is vital, ongoing monitoring of AI tools after implementation is equally crucial. An Aspen Policy Fellow has developed a framework that focuses on post-deployment evaluations for AI-enabled clinical decision support tools.
Assessing LLMs in Patient Engagement
Recent research has also targeted large language models (LLMs) used in patient interactions rather than by medical professionals. A study from South Korean researchers created a benchmark, PatientSafeBench, finding that none of the evaluated models met safety standards for patient use, with even medical-specific LLMs underperforming compared to more general models.
Conclusion: The Path Forward for AI in Healthcare
As the landscape of AI in healthcare evolves, stakeholders—including developers, regulators, and healthcare institutions—must remain vigilant. Effective evaluation and oversight are critical to ensure that these technologies genuinely enhance patient care. For developers, it is essential to substantiate marketing claims with rigorous testing and to proactively address potential risks. As both regulatory frameworks and public scrutiny evolve, maintaining a commitment to transparency and accountability will be vital for the successful integration of AI tools in healthcare.