Jan 20, 2026
WCO-IOF-ESCEO
Abstract
OSTEO
DEEP LEARNING BASED OPPORTUNISTIC OSTEOPOROSIS SCREENING ON CHEST RADIOGRAPHS WITH UNCERTAINTY AND CALIBRATION
K. W. Kim, M. Kim, G. Lee, N. Kim
K. W. Kim 1, M. Kim1, G. Lee1, N. Kim2
1Promedius Inc., Seoul, South Korea,2University of Ulsan College of Medicine, Asan Medical Center, Seoul, South Korea
Objective: To evaluate the performance of AI-based opportunistic osteoporosis (OP)
screening with predictive uncertainty and calibration in heterogeneous clinical environments.
Material and Methods: Two AI-based screening models were developed using chest
radiographs (CXRs): a model distinguishing non-OP from OP (Model M1), and the other model
classifying normal, osteopenia (OPe), and OP (Model M2). These models were trained on CXRs
from a tertiary hospital (Hospital A; N, Ope %, OP %; 62,420, 34.82, 5.85, respectively).
Internal (A; 15,357, 34.42, 6.21, respectively), and external 1 from a secondary hospital
(Hospital B; 3,338, 34.90, 8.78, respectively) in South Korea, and external 2 validation from a
global platform in America (Hospital Cs; 1,026, 53.31, 19.49, respectively) were performed. To
enable fair comparison, performance was evaluated using the area under the receiver
operating characteristic curves (AUCs) and the area under the precision-recall curves
(AUPRCs) for screening OP. Uncertainty estimation was conducted using the Laplace
approximation, and calibration was assessed using expected calibration error (ECE).
Results: Both models achieved identical AUCs of 0.96 and 0.92 in A and B, respectively. In
Cs, M1 demonstrated a significantly higher AUC than M2 (0.84 vs. 0.81; DeLong p = 0.01).
Despite comparable AUC performance, M2 consistently achieved higher AUPRC across all
cohorts (0.73, 0.53, and 0.61) than M1 (0.70, 0.52, and 0.60). In addition, M2 exhibited higher
uncertainty across internal, external, and global validations (48.83 ± 17.36%, 48.86 ±
18.05%, and 57.41 ± 22.03%) compared with M1 (40.77 ± 12.32%, 43.34 ± 20.57%, and
49.89 ± 22.77%), respectively (Wilcoxon p < 0.01). Despite this increased uncertainty, M2
demonstrated superior calibration with lower ECE values in Hospitals A and B (2.83% and
3.70%) than M1 (6.71% and 7.01%), respectively. However, under a strong domain shift in Cs,
the ECE of M2’s degraded substantially (23.40%), whereas M1 remained relatively stable
(6.15%).
Conclusion: Both models demonstrated comparable performance across cohorts for
opportunistic OP screening; however, M2 showed improved performance under class
imbalance and more conservative behavior in ambiguous cases, as reflected by higher
predictive uncertainty and better calibration.


