Jan 20, 2026

WCO-IOF-ESCEO

Abstract

OSTEO

DEEP LEARNING BASED OPPORTUNISTIC OSTEOPOROSIS SCREENING ON CHEST RADIOGRAPHS WITH UNCERTAINTY AND CALIBRATION

K. W. Kim, M. Kim, G. Lee, N. Kim

K. W. Kim 1, M. Kim1, G. Lee1, N. Kim2

1Promedius Inc., Seoul, South Korea,2University of Ulsan College of Medicine, Asan Medical Center, Seoul, South Korea


Objective: To evaluate the performance of AI-based opportunistic osteoporosis (OP)

screening with predictive uncertainty and calibration in heterogeneous clinical environments.

Material and Methods: Two AI-based screening models were developed using chest

radiographs (CXRs): a model distinguishing non-OP from OP (Model M1), and the other model

classifying normal, osteopenia (OPe), and OP (Model M2). These models were trained on CXRs

from a tertiary hospital (Hospital A; N, Ope %, OP %; 62,420, 34.82, 5.85, respectively).

Internal (A; 15,357, 34.42, 6.21, respectively), and external 1 from a secondary hospital

(Hospital B; 3,338, 34.90, 8.78, respectively) in South Korea, and external 2 validation from a

global platform in America (Hospital Cs; 1,026, 53.31, 19.49, respectively) were performed. To

enable fair comparison, performance was evaluated using the area under the receiver

operating characteristic curves (AUCs) and the area under the precision-recall curves

(AUPRCs) for screening OP. Uncertainty estimation was conducted using the Laplace

approximation, and calibration was assessed using expected calibration error (ECE).


Results: Both models achieved identical AUCs of 0.96 and 0.92 in A and B, respectively. In

Cs, M1 demonstrated a significantly higher AUC than M2 (0.84 vs. 0.81; DeLong p = 0.01).

Despite comparable AUC performance, M2 consistently achieved higher AUPRC across all

cohorts (0.73, 0.53, and 0.61) than M1 (0.70, 0.52, and 0.60). In addition, M2 exhibited higher

uncertainty across internal, external, and global validations (48.83 ± 17.36%, 48.86 ±

18.05%, and 57.41 ± 22.03%) compared with M1 (40.77 ± 12.32%, 43.34 ± 20.57%, and

49.89 ± 22.77%), respectively (Wilcoxon p < 0.01). Despite this increased uncertainty, M2

demonstrated superior calibration with lower ECE values in Hospitals A and B (2.83% and

3.70%) than M1 (6.71% and 7.01%), respectively. However, under a strong domain shift in Cs,

the ECE of M2’s degraded substantially (23.40%), whereas M1 remained relatively stable

(6.15%).


Conclusion: Both models demonstrated comparable performance across cohorts for

opportunistic OP screening; however, M2 showed improved performance under class

imbalance and more conservative behavior in ambiguous cases, as reflected by higher

predictive uncertainty and better calibration.

PROMEDIUS INC.

Copyright 2025 PROMEDIUS INC. All rights reserved.

13, Olympic-ro 35da-gil, Songpa-gu, Seoul, 05510 Republic of Korea

PROMEDIUS INC.

Copyright 2025 PROMEDIUS INC. All rights reserved.

13, Olympic-ro 35da-gil, Songpa-gu, Seoul, 05510 Republic of Korea