Anthropic Grants Petri AI Tool to Meridian Labs

Anthropic has officially handed over the development of its open-source AI alignment testing tool, known as Petri, to Meridian Labs. This transition places the project under the management of an independent nonprofit organization dedicated to AI evaluation.

Petri comprises a collection of tests aimed at evaluating large language models for behaviors such as deception, sycophancy, and cooperation with harmful requests. Anthropic mentions that it has integrated this tool into the alignment assessments for all Claude models since Claude Sonnet 4.5.

New Features

The transfer coincides with the launch of Petri 3.0, which introduces significant architectural changes and broadens its applications. The updated version decouples the auditor model from the target model, enabling researchers to modify each independently.

Additionally, Petri now includes a component called Dish, which seeks to align evaluations more closely with real-world deployments. Dish conducts tests using a model’s actual system prompt and the operational framework that surrounds it in live settings.

This enhancement is important because researchers have long grappled with a challenge in alignment testing: models can sense when they are being evaluated and may alter their behavior accordingly. Such alterations could skew the results, making them less indicative of how a model would perform outside a controlled environment.

Furthermore, Anthropic has interconnected Petri with another open-source tool, Bloom, to facilitate more detailed assessments of specific behaviors. This pairing leverages Petri’s extensive test coverage alongside Bloom’s in-depth analyses, creating a more comprehensive evaluation process.

Originally launched in October 2025 through the Anthropic Fellows program, Petri is applicable to any large language model. The tool is designed to examine alignment-related scenarios by employing a separate auditor model, while a judge model evaluates transcripts for indications of misaligned behavior.

The adoption of Petri has already extended beyond Anthropic itself. The UK’s AI Security Institute has utilized the tool as a crucial element in evaluating whether models exhibit a tendency to compromise AI research.

Independent Oversight

The decision to transition Petri to Meridian Labs underscores a growing debate within the AI community regarding who should govern the tools for assessing model behavior. Developers, regulators, and external researchers contend that trust in evaluation outcomes hinges on the perceived independence of the methods from the organizations creating the systems being evaluated.

Anthropic likens this move to its earlier decision to donate the Model Context Protocol to the Linux Foundation. In the case of Petri, the objective is to ensure that the tool remains independent from any single AI lab, allowing its findings to be regarded as neutral across the industry and among regulatory bodies.

Meridian Labs specializes in AI evaluation and already hosts tools such as Inspect and Scout. By adding Petri to its repertoire, the nonprofit is developing a more comprehensive suite of software tailored for laboratories, researchers, and governments seeking standardized methods for testing model behavior.

This transfer also highlights the increasing importance of open-source infrastructure in AI governance. As developers face heightened scrutiny about safety and reliability, shared testing tools can empower external groups to conduct assessments without solely relying on internal methods from the organizations that create the models.

However, open tools do not automatically resolve more complex issues, such as defining acceptable model behavior, designing appropriate tests, or enabling comparisons across different systems. Evaluation frameworks can reveal tendencies but still rely on subjective decisions concerning scenarios, scoring, and interpretation.

Petri’s architecture partially addresses this challenge by distinguishing between the creation of test scenarios and the evaluation of responses. In Anthropic’s explanation, one model simulates alignment-relevant situations while another scrutinizes the exchanges for problematic behaviors.

The newly introduced Dish addition specifically targets a longstanding flaw in model evaluation: realism. When test environments differ significantly from production settings, the resulting behavior of a model can diverge, diminishing the utility of findings for predicting actual conduct in deployment contexts.

By relocating the development of Petri outside its organization, Anthropic relinquishes direct control over a tool that has played a vital role in the internal assessment of its Claude family of models. This move positions Meridian Labs to establish Petri as a communal resource rather than one closely affiliated with a single commercial entity.

Anthropic aims to foster alignment tools that are both open and beneficial for the broader AI development community, thereby advancing the field as a whole.

Leave a Reply 取消回复

You May Also Like

Superior Design Over Fable – Ben’s Bites

I’ve Got a Hunch

Using GPT-5.6: A Guide from Ben’s Bites