Recent advancements in artificial intelligence (AI) have led to the approval of six platforms by the US Food and Drug Administration (FDA) for the ultrasound evaluation of thyroid nodules. Studies suggest that these AI tools offer diagnostic performance that is comparable to or even better than that of less experienced physicians, as highlighted in an article from the journal Thyroid. Additionally, ongoing research into systems for assessing lymph nodes and slide specimens indicates that AI’s role in diagnostic evaluations is likely to expand.
In a comprehensive review, endocrinologist Dr. Johnson Thomas from Mercy Hospital in Springfield, Missouri, and Dr. Franklin N. Tessler, a radiologist at the University of Alabama at Birmingham, revisited their 2022 overview of AI’s role in assessing thyroid nodules. They summarized the latest evidence surrounding ultrasound techniques, cervical lymph node assessments, evaluations of cytology and histology specimens, and molecular testing. The authors noted that the pace of research and commercialization has accelerated since their initial publication, emphasizing that these AI systems are intended to enhance, rather than replace, clinical judgment.
FDA-Cleared Systems
All currently available AI platforms analyze ultrasound images to estimate the risk of malignancy, employing established risk stratification systems such as the American College of Radiology’s Thyroid Imaging, Reporting and Data System (ACR TI-RADS), Korean TI-RADS, European TI-RADS, and the American Thyroid Association guidelines.
Multiple studies indicate that these tools improve diagnostic accuracy for physicians across varying levels of expertise, even aiding early-career professionals. One study involving 130 pathologically confirmed nodules showed that the AmCAD-UT system significantly enhanced accuracy among junior readers. Similarly, the Koios DS™ Thyroid system increased the overall area under the curve (AUC) from 0.776 to 0.817 in a retrospective study of 172 nodules. It improved sensitivity from 82% to 86% and specificity from 38% to 45% when combined with physician interpretations. In a cohort comprising 28 indeterminate nodules, the system decreased the number of nodules that met biopsy criteria from 24 to 10 while successfully identifying 6 out of 7 malignancies.
Pioneering the field, S-Detect—the first FDA-cleared AI product—was examined in a prospective study of 312 nodules from 236 patients, achieving a sensitivity of 95% and specificity of 56%. Its performance was comparable to that of seasoned radiologists and superior to that of residents, resulting in a reduction of unnecessary biopsy rates by up to 28% when compared with interpretations made by residents.
Large Language Models
On the other hand, multimodal large language models have shown inconsistent results. For instance, in a prospective study involving 106 nodules, ChatGPT-4o misclassified 26 benign and 11 malignant nodules based on ultrasound images, and its accuracy declined when shear wave elastography data were included. In another assessment of 202 nodules, these systems exhibited high specificity but low sensitivity across most ACR TI-RADS categories.
The researchers expressed that the subpar performance of these models was “not surprising” and recommended against their current use in clinical decision-making.
Lymph Nodes and Pathology
While there are presently no commercially available systems for lymph node assessment, research is actively investigating this capability. A meta-analysis of 27 studies involving patients with papillary thyroid carcinoma revealed that AI systems achieved 80% sensitivity and 83% specificity in detecting cervical lymph node metastases via ultrasound, compared to a sensitivity of 51% and a specificity of 84% for medical professionals.
In regards to cytology, a multicenter retrospective and prospective trial involving 537 thyroid nodules found that an AI model achieved an AUC of 0.977 in distinguishing benign from malignant nodules on whole-slide images. This model also enhanced junior cytopathologists’ specificity from 89% to 99% and accuracy from 88% to 95%. Additionally, a proteomics-based machine learning model evaluated in 294 nodules reached 85% accuracy and 92% sensitivity. However, no AI tools currently hold FDA clearance for thyroid cytopathology or histology images.
Dr. Tessler speculated, “This is just a guess, but I think cytology AI applications will become routine first, driven by the trend toward digital pathology. Nodal assessment will likely take longer.”
Dr. Thomas mentioned that reimbursement will be a crucial factor in adoption. While “many pathology labs are already utilizing AI for slide assessment,” he noted that the costs of whole-slide scanners and related software are barriers to broader implementation. He also suggested that large language models could be applied more widely in clinical operations, such as creating medical notes and managing order entries.
Implementation Considerations
The researchers indicated that FDA-approved AI systems for thyroid nodule assessment have yet to see widespread adoption. Dr. Thomas identified “last-mile” obstacles, including workflow friction, uncertain return on investment (ROI), and insufficient prospective validation in settings where most thyroid ultrasounds are performed. He emphasized that “clear reimbursement and/or value-based justification could facilitate adoption,” noting that practices may be more inclined to move forward when there is a predictable path for covering software costs or when clearer operational benefits are recognized, such as reduced unnecessary biopsies.
Dr. Tessler emphasized the need for “compelling evidence” that AI systems can reduce workload while maintaining or even improving diagnostic accuracy, making the costs associated with implementation and continued use justifiable. He emphasized that such evidence should originate from independent, unbiased trials and that direct comparisons between AI software would help practices choose the most suitable system.
Beyond the findings of published studies, both experts stressed the importance of thoughtful workflow integration. Dr. Thomas stated, “Integrating AI into thyroid risk stratification can minimize subjectivity. However, multicenter prospective trials are necessary to ensure the accuracy of risk stratification and to reduce unnecessary biopsies.”
Dr. Tessler pointed out that implementation strategies should vary across specialties due to differences in workflows. He suggested that all practice types should outline the entire process—from patient arrival to the integration of ultrasound findings in medical records—and assess how AI systems fit into that pathway. “It’s beneficial to discuss this with prospective vendors during the selection and trial process,” he advised.
In conclusion, the potential of AI tools for evaluating thyroid nodules and potentially lymph nodes, along with biopsy and surgical specimens, is significant. Nevertheless, clinicians and healthcare leaders must be aware of AI’s limitations and need to ensure that they select and implement technologies that are compatible with existing infrastructures, applicable to the intended patient population, and easy to learn while enhancing current workflows.
Disclosure: Dr. Thomas holds intellectual property rights/patents related to the application of AI for thyroid nodule risk stratification (AIBx, not included in the review). Dr. Tessler reported no conflicts of interest.
Source: Thyroid