AI Tool Generates Millions of New Molecules

The exploration of molecular possibilities has long puzzled chemists, as the potential array of useful molecules is so extensive that the ones currently known represent only a minuscule fraction of what could potentially exist.

A research team at the Universitat Rovira i Virgili in Spain has recently developed an artificial intelligence system capable of venturing into this uncharted territory. This system can generate millions of molecular structures not found in existing databases, all while adhering to the fundamental principles of chemistry. Their findings, published in Nature Machine Intelligence, promise to facilitate a quicker exploration of chemical space—the vast domain of atom combinations that may eventually result in new drugs, materials, refrigerants, and other compounds.

The enormity of this molecular space is truly staggering. The authors assert that the number of drug-like molecules might reach around 10⁶⁰, which is vastly greater than the quantity of water molecules in all of Earth’s oceans. As a result, discovering new molecules feels less like searching through a library and more akin to navigating through an entire universe.

The heart of this groundbreaking research lies in a model named CoCoGraph, which operates similarly to image-generating AI. Instead of creating visuals from random noise, it constructs molecular structures by learning how valid molecules can be deconstructed and reassembled.

Roger Guimerà, an ICREA research professor within the Department of Chemical Engineering at URV, explained it succinctly: “Our algorithm functions in the same manner as those used for images, but focuses on molecules.”

Roger Guimerà, Manuel Ruiz-Botella, and Marta Sales from the Department of Chemical Engineering are spearheading this research. (CREDIT: Universitat Rovira i Virgili)

Establishing Rules Before Innovation

This project addresses a significant shortcoming in many previous molecular generation systems. Many past models, such as variational autoencoders, generative adversarial networks, and graph neural networks, have advanced the field but often faced challenges in scalability, efficiency, or chemical accuracy. Some were able to create imaginative structures but violated fundamental chemical principles.

CoCoGraph takes a distinct approach. Rather than requiring the model to independently learn the rules, the researchers integrated these principles directly into the generation process. Each atom maintains the appropriate number of bonds, ensuring that valence is preserved and the molecular formula remains intact during the entire process.

This design decision is crucial, as it guarantees that every molecule produced by the system is chemically valid within the framework used for the study. The authors note that CoCoGraph achieved 100% chemical validity while generating highly novel outputs.

Marta Sales-Pardo, also in the Department of Chemical Engineering at URV, elucidated the process: “We begin with an actual molecule, break its bonds, and then randomly create new ones. The model learns to reverse this process and reconstruct coherent structures.”

Distinct from images, molecules do not create smooth visual representations; they are discrete structures composed of atoms and bonds, complicating the mathematical challenges involved. To address this, CoCoGraph employs what the team refers to as a constrained discrete diffusion process based on double-edge swapping. Essentially, this involves repeatedly switching bonds while maintaining the overall bonding requirements of the molecule.

The system also includes a second model, known as a time model, which evaluates how close a partially reconstructed graph is to an actual molecule. This additional feedback helps the main diffusion model enhance its denoising process.

The constrained collaborative graph diffusion model, CoCoGraph. (CREDIT: Nature Machine Intelligence)

Compact Model, Enhanced Realism

The researchers conducted a comparison of CoCoGraph with six other top-performing molecule generators using the GuacaMol benchmark, a recognized standard test suite in this domain. They assessed all models against a filtered PubChem reference database containing 94.7 million molecules with no overlap with the training data.

Two variations of CoCoGraph were evaluated. The smaller BASE model comprised a total of 534,000 parameters, while a fingerprint-enhanced version, known as FPS, used 4.4 million parameters. Even the larger model had fewer parameters than many competing systems.

Despite its streamlined design, CoCoGraph demonstrated impressive performance. Both variants attained 100% chemical validity, with uniqueness rates of 99.8% and 99.9%, and novelty rates of 95.7%. Within the GuacaMol benchmark, the KL divergence scores for property matching reached 95.7% for the BASE model and 96.3% for the FPS version, outperforming the benchmarks the team used for comparison.

Such results are significant because mere novelty is insufficient. A functional model should generate new molecules that remain plausible and possess physicochemical properties resembling those found in actual chemistry.

The authors also expanded their evaluation beyond the conventional test of ten properties. Across a broader spectrum of 36 chemical properties, CoCoGraph surpassed competing models in at least 66.6% of those aspects. The team observed particular strengths in topological features, electronic characteristics, and structural descriptors—qualities vital for medicinal chemistry and drug discovery.

Can Chemists Differentiate?

A particularly intriguing aspect of this research was when the team shifted focus from automated benchmarks to obtaining human insights on the results.

They constructed a database of 8.2 million synthetic molecules, noting a redundancy rate of 7.1%. Based on the reported novelty rate, this database encompasses approximately 7.3 million unique, chemically valid molecules that are not represented in PubChem.

Next, they conducted an experiment resembling a molecular Turing test. A total of 121 participants with backgrounds in organic chemistry, biochemistry, and related fields evaluated 20 pairs of molecules. Each pair consisted of one molecule from the original dataset and another generated by CoCoGraph, both sharing the same molecular formula. This design compelled participants to assess structure rather than size or composition.

Across 2,420 evaluations, experts identified the authentic molecule correctly 62% of the time. Undergraduate participants achieved a score of 60%, while graduate participants scored 64%.

While this is an improvement over random guessing, it is not substantial.

In particular categories, including acyclic and predominantly aliphatic molecules, the results aligned closely with random guessing. The authors exercise caution, avoiding claims of complete indistinguishability, but argue that the findings suggest many generated molecules are convincing even to experienced chemists.

A Step Toward Targeted Design

Currently, CoCoGraph does not enable chemists to input specific requests and receive a tailored molecule in return. It is not yet capable of directly designing compounds to meet particular functions.

Nonetheless, the study includes early demonstrations of potential applications. The team searched its database of 8.2 million molecules for structures exhibiting physicochemical properties analogous to paracetamol, identifying top candidates based on nine key properties. They also experimented with an inpainting-style approach, preserving a portion of an existing molecule while adding small or medium fragments to create related variations.

A selection of 50 randomly generated molecules from CoCoGraph FPS. (CREDIT: Nature Machine Intelligence)

This type of controlled editing could be significant for drug optimization, where researchers typically aim to maintain a molecular scaffold while modifying other portions of the structure.

“At this stage, we are solely generating molecules,” noted Manuel Ruiz-Botella, a doctoral student involved in the project. “The next objective will be to incorporate specific goals into this process.”

The study also acknowledges its limitations. CoCoGraph fixes the molecular formula during generation, which might restrict its applications. The model was designed for molecules containing up to 70 atoms, and extending this capacity to larger structures will necessitate retraining and additional computing resources. Furthermore, the authors highlight future applications in mass spectrometry and conditional molecule generation, although these remain areas for further exploration rather than proven capabilities.

Real-World Implications of the Research

The primary immediate benefit of CoCoGraph is not that it has already produced a novel drug or material; it has yet to do so. Instead, it provides a faster and more efficient means of exploring a chemical landscape too immense for humans to navigate manually.

By generating only chemically valid structures and utilizing fewer parameters than many of its competitors, this system may significantly lower the computational costs associated with large-scale molecular exploration. Furthermore, its database of 8.2 million molecules could serve as a valuable starting point for identifying realistic candidates in drug development and materials research.

More broadly, this work suggests that AI systems in chemistry can become more effective when they do not merely replicate known data but are structured around the rigorous constraints inherent to the field. In this instance, the foundational chemistry rules are integral to the model’s success.

Establishing Rules Before Innovation

Compact Model, Enhanced Realism

Can Chemists Differentiate?

A Step Toward Targeted Design

Real-World Implications of the Research

Related Stories

Leave a Reply 取消回复

You May Also Like

Superior Design Over Fable – Ben’s Bites

I’ve Got a Hunch

Using GPT-5.6: A Guide from Ben’s Bites