As large medical imaging datasets become widely available, researchers are increasingly turning to artificial intelligence to extract useful information from scans that were never manually annotated. Automated "anatomy segmentation" tools—programs that label organs and structures in images such as CT scans—promise to make large-scale studies feasible. But as many new models appear, an important question remains: how can researchers compare these tools when no ground truth exists?
A recent study published in the Journal of Medical Imaging tackles this problem by introducing a practical framework for comparing AI-based anatomy segmentation models in the absence of expert reference annotations. The work focuses on chest CT scans from the National Lung Screening Trial (NLST), a major public dataset used for cancer research, and evaluates how consistently different open-source models label the same anatomical structures.
The challenge of comparison without ground truth
Most public imaging datasets, including NLST, contain thousands of scans but lack detailed annotations of organs or bones. Creating such labels by hand would require years of expert effort. While AI models can generate these labels automatically, their results vary. Models may use different naming conventions, define boundaries differently, or include anatomy that others exclude. Without a reference standard, judging which model performs best is difficult.
To address this, the researchers designed a workflow based on agreement rather than accuracy. Instead of asking which model is "correct," they examined where models converge and diverge when applied to the same scans. The idea is simple: if several independent models produce similar results for a structure, that result is more likely to be reliable.
A standardized foundation
The first step was to make the models comparable. Six widely used, open-source segmentation models were selected, including two versions of TotalSegmentator, Auto3DSeg, MOOSE, MultiTalent, and CADS. Each model originally produced output in its own format, using its own structure names.
The team converted all results into a standardized DICOM segmentation format. They harmonized labels using SNOMED-CV—a common medical terminology—and assigned consistent colors and identifiers to each structure. This made it possible to load segmentations of the same structure, as produced by different models, side by side and compare them directly.
To support visual review, the researchers extended two open-source tools. They integrated results into OHIF Viewer, a web-based viewer that runs in a browser, and they developed a new plugin for 3D Slicer, a desktop imaging platform widely used in medical research. Together, these tools allow users to inspect the same organ across multiple models with a few clicks.
Measuring agreement across models
The study analyzed 18 chest CT scans from four NLST participants. After filtering out structures that were inconsistently present or only partially imaged, the analysis focused on 24 anatomical structures, including lung lobes, the heart, ribs, thoracic vertebrae, and the sternum.
For each structure, the researchers identified a "consensus" region: the set of image voxels labeled by all models that included that structure. Individual model results were then compared to this consensus using two measures. One captured how much the shapes overlapped, and the other compared the segmented volumes. The results were displayed in interactive plots that made it easy to spot outliers and problematic cases. A publicly available website was shared, allowing anyone interested in the study to explore those results firsthand.
What worked — and what didn't
The lung segmentations showed the strongest agreement. Across all models and lung lobes, overlap was consistently high, and visual inspection confirmed that boundary differences were minor. This suggests that lung segmentation is a mature task, even in low-dose screening CT scans.
Other structures were more challenging. Heart segmentation showed only moderate agreement at first, largely because one model defined the heart more narrowly than the others. When that model was excluded, agreement among the remaining tools improved substantially.
The ribs and thoracic vertebrae revealed more serious issues. Four of the six models showed frequent errors, such as merging adjacent bones or labelling the wrong vertebra. In contrast, two models—trained on different data—produced more consistent and anatomically complete results. These differences were not obvious from summary statistics alone but became clear through side-by-side visualization.
Why this matters
The study shows that even well-regarded AI tools can fail in systematic ways, especially when they share training data. It also demonstrates that meaningful evaluation is possible without ground truth, using a combination of standardization, quantitative agreement measures, and targeted visual review.
Beyond the specific findings, the authors emphasize the broader value of their framework. All software, mappings, and example datasets are openly available, and the approach can be applied to other imaging collections and other segmentation tasks. As researchers increasingly rely on AI-generated annotations to study population-scale datasets, such tools will be essential for choosing models wisely—and for understanding their limits.
Rather than offering a single "best" model, the work highlights a more realistic goal: enabling informed decisions, grounded in evidence, when perfect answers are not available.
For details, see the original Gold Open Access article by L. Giebeler et al., " In search of truth: evaluating concordance of AI-based anatomy segmentation models ," J. Medical Imaging 13 (6), 062204 (2026), doi: 10.1117/1.JMI.13.6.062204