Interpretable machine learning model for digital lung cancer prescreening in Chinese populations with missing data
Data source and setting
This is an observational retrospective study. The data was obtained from the Cheeloo Lifespan Electronic-Health Research Data-library (Cheeloo LEAD) and the Boxing database. The Cheeloo LEAD database integrates data from 5,152,597 individuals across 4909 hospitals and primary health-care institutions in 39 urban areas from Shandong Province, China. This collection includes 1214 secondary and tertiary hospitals as well as 3695 primary medical institutions. Each individual can be linked to their comprehensive data across all 4909 medical institutions through an encrypted unique ID. Additionally, the Cheeloo LEAD database retrospectively collected EHR data from 2009 to 2023, covering a follow-up period of around 14 years. The Boxing database, serving as one of the external validation sets, has compiled EHR data from three sub-district offices and nine towns in Boxing County, Shandong Province, China, for the period from 2018 to 2023. This dataset does not overlap with the Cheeloo LEAD database. Both datasets comprise demographic, diagnostic, medication, and surgery information. We follow the common data model33 to conduct data quality control and standardization on data from various sources and formats, thereby establishing a unified terminology.
Study population
Participants for the training and internal validation sets were selected from the Cheeloo LEAD database, covering the period from Jan 1, 2015, to Oct 31, 2017. Inclusion criteria were individuals aged 40 years and above, with exclusions for those previously diagnosed with lung cancer before recruitment and individuals with abnormal lung cancer diagnosis dates. The lung cancer diagnosis code was determined based on the International Classification of Diseases, 10th edition (ICD-10), specifically code C34. Data were randomly divided into the training set and internal validation set using a 7:3 ratio.
For external validations, the study conducted both temporal and geographic validations. Temporal validation involved participants from the Cheeloo LEAD database between Jan 1, 2018, and Oct 31, 2020, adhering to the same inclusion and exclusion criteria as the training set. Geographic validation included individuals aged 40 years and above from the Boxing database, spanning Jan 1, 2019, to Dec 31, 2020, with similar exclusions for those with prior lung cancer diagnoses.
Figure 1 illustrates the flowchart of enrollment. The study adhered to the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD).
Variable selection
In this study, we collected 4148 potential predictors, including demographic variables, diagnoses, surgical interventions, and medication usage. Age was categorized into a five-level variable, while all other variables were dichotomized (see Supplementary Note 2 for detailed values of variables). All candidate variables were initially evaluated using a univariate logistic regression model. Given the multitude of variables, to control the overall risk of Type I error (false positive rate), p-values underwent Bonferroni correction, selecting variables with p < 0.05/4,148. To fully leverage the interrelationships among variables and enhance prescreening accuracy in the presence of missing data, we implemented Incremental Feature Selection (IFS), which selects an appropriate number of variables. For more details of the IFS, please see Supplementary Note 3. Definitions and coding formats of predictors for BOUND can be located in the Supplementary Table 6.
Development of the BOUND model
We developed BOUND using the complete training set sourced from the Cheeloo LEAD database, which has the advantage of linking comprehensive data from 4909 hospitals and primary health-care institutions. Each individual’s records across these institutions are indexed using an encrypted unique ID, ensuring a dataset with no missing entries.
The development of BOUND is divided into three steps: First, learning the BN structure; second, acquiring the BN parameters; and finally, conducting BN’s uncertainty inference to get the lung cancer risk and reverse inference of high-risk factors. BN’s uncertainty inference can predict risk even when data is missing. Figure 2 illustrates the workflow involved in constructing the BOUND model.
Step 1: Learning the BN structure of BOUND
Firstly, we used the TABU search algorithm to learn the BN structure, allowing us to intuitively display the dependency relationships between variables26. The TABU search algorithm optimizes BN structures by iteratively modifying initial structures and evaluating them with a scoring function like BIC. The best-scoring structure that isn’t on the TABU list, which prevents revisiting recent structures, is selected to continue the search. This process repeats until a stopping criterion is met, resulting in an optimized BN structure. The following formula represents the principle of the TABU algorithm:
The structure learning algorithm of BN explores the entire feasible structure space \(\mathbbG\) using the given data D and identifies the optimal structure \(\mathcalG\) by maximizing a defined objective score function (Equation (1)), which measures the fit between the structure and the data.
$$\mathcalG=\arg \mathop\max \limits_\mathbbG^* \in \mathbbGS(\mathbbG^* ,\bfD)$$
(1)
During the network structure learning, we integrated some prior knowledge to enhance the network’s interpretability and accuracy (see Supplementary Note 4 for details).
Step 2: Learning the BN parameter of BOUND
Faced with the challenge of sparse data in EHR, the traditional BN parameter learning method, specifically the maximum likelihood estimation, often encounters limitations in prediction due to the generation of missing values. In Supplementary Note 5, we comprehensively explain the occurrence of missing values, detailed through formula derivation. To overcome this challenge, we developed a new parameter learning algorithm, BN-logistic. BN-logistic learns the effects β between variables and applies them across all possible combinations of variable values, generating conditional probability tables for all scenarios.
Here is an explanation of the principles behind BN-logistic:
The BOUND model is represented by \(\mathcalB=\langle \mathcalG,\boldsymbol\theta \rangle\), where \(\mathcalG\) denotes the network structure learned in Step 1, and θ represents the network parameters. Assume the network contains p discrete variables X = x1, x2, ⋯, xp. Each node xi in the network can take ri possible values, specifically 1, 2, ⋯, ri. The value combinations of the parents \(\bfpa_i^\mathcalB\) of xi amount to qi, ranging from 1, 2, ⋯, qi. If xi has no parents, then qi is 1. Consequently, the parameters of BOUND are defined by the following conditional probability:
$$\boldsymbol\theta _ \bfpa_i^\mathcalB=P_\mathcalB(x_i| \bfpa_i^\mathcalB)$$
(2)
Here, \(\bfpa_i^\mathcalB\) denotes the set of parents of variable xi in \(\mathcalB\). The formula can be expanded as:
$$\theta _ijk=P(x_i=k| \bfpa_i^\mathcalB=j),$$
(3)
where xi represents the i-th variable among the p variables, i ranges from 1 to p. k denotes the k-th value of xi, with a range from 1 to ri. j signifies the j-th value combination of the parents \(\bfpa_i^{\mathcalB}\) of xi, where j ranges from 1 to qi. θ represents the vector comprising all θijk combinations.
BN-logistic involves treating each child node as a dependent variable and its parent nodes as independent variables, fitting logistic models for each node in the network, as delineated in Equation (4).
$$\log \left(\fracP_x_i = 11-P_x_i = 1\right)=\beta _i0+\mathop\sum \limits_m_i=1^M_i\beta _im_i\textpa_im_i,$$
(4)
where xi represents the i-th node. Mi is the count of parent nodes associated with xi, \(\boldsymbol\beta _i=\\beta _i0,\beta _i1,\ldots ,\beta _iM_i\^T\) is the coefficient of the parent nodes pai of xi to xi. \(\textpa_im_i\) denotes the m-th parent node of xi. \(P_x_i = 1\) is used to represent the probability when xi = 1.
Subsequently, we determine the conditional probability for each node by enumerating all possible parent node combinations for every node:
$$P_x_i = 1=P(x_i=1| \bfpa_i)=\frac11+\exp \left(-\left[\beta _i0+\mathop\sum \nolimits_m_i = 1^M_i\beta _im_i\textpa_im_i\right]\right).$$
(5)
When xi = 1, the probability of that class is P(xi = 1∣pai); otherwise, the probability is 1 − P(xi = 1∣pai). Consequently, regardless of the values of new data, the corresponding conditional probability tables, which are the parameters of BN, can be derived based on the effects β learned. BN-logistic effectively addresses the issue of certain variable combinations remaining unlearned due to sparse data, which consequently cannot be predicted. We evaluated BN-logistic against traditional parameter learning methods using simulation data across various levels of sparsity and real-world data, including internal and external validation sets (see Supplementary Note 6 for details).
Step 3: BN’s uncertainty inference for BOUND prediction with missing data
To ensure the BOUND model remains versatile and effective even with missing data, we utilize BN uncertainty inference capabilities. Specifically, we employ the likelihood weighting method within BN inference algorithms, which converges quickly and efficiently handles large networks24. This method generates a large number of samples from the network, where the known variables are fixed and the missing variables are sampled according to the network’s conditional probabilities, i.e., the parameter of BOUND24. Each sample is assigned a weight that reflects the likelihood of the evidence variables given the sampled values. By aggregating these weighted samples, we can approximate the posterior distribution, providing the probability of lung cancer risk. The formula for likelihood weighting is as follows23,24:
$$P(Q=1| \bfE=\bfe)\approx \frac\mathop\sum \nolimits_s = 1^SI_Q = 1(D_s)\omega (D_s)\mathop\sum \nolimits_s = 1^S\omega (D_s),\quad D_s \sim P^\prime (\bfX),$$
(6)
where, the observed variables are referred to as evidence variables, denoted as E = e; Q = 1 denotes a diagnosis of lung cancer; P(Q = 1∣E = e) represents the posterior probability of lung cancer risk, calculated using existing known variables’ data E = e; IQ=1 is the indicator function;
$$\omega (D_s)={\left.\mathop\prod \limits_Y\in \bfEP(Y| {{\bfpa}}_Y)\right| }_\bfX = D_s.$$
(7)
Thus, the likelihood weighting method essentially assigns a likelihood rate weight, denoted as ω(Ds), to each sample Ds. In this paper, we calculate the lung cancer risk for each individual by performing 19,760 samplings (i.e., S = 19, 760) based on their known variables. The sample size is the default value in the R package bnlearn: 5000*log10(nparams(fitted_model)).
Moreover, what’s even more interesting is that BOUND can leverage BN’s uncertainty reasoning to backtrack from predicted outcomes and known medical history to identify high-risk factors. This capability enables doctors to take timely preventive measures. The underlying principle is as follows:
According to the Markov property34, once the parent nodes of lung cancer are given, the lung cancer risk becomes conditionally independent of other risk factors. This allows us to directly control the risk factors associated with lung cancer, which correspond to the parent nodes of lung cancer in the network. Therefore, we use the predicted outcomes and existing medical history (i.e., the set of variables where X = 1) as evidence variables to infer high-risk factors that are directly related to lung cancer but have not yet been clearly diagnosed. The calculation formula is as follows:
$$P(\bfQ=1| \bfE=\\bfX=1,Y=y\)\approx \frac\mathop\sum \nolimits_s = 1^SI_Q = 1(D_s)\omega (D_s)\mathop\sum \nolimits_s = 1^S\omega (D_s),\quad D_s \sim P^\prime (\bfX),$$
(8)
where, Q represents a vector set corresponding to risk factors that are directly related to lung cancer in the network but have not yet been clearly diagnosed. E = X = 1, Y = y denotes the set of evidence variables, which includes the confirmed variables in the network and the value of Y (where Y = 1 indicates a high predicted risk of lung cancer, and Y = 0 indicates a low risk). P(Q = 1∣E = X = 1, Y = y) represents the probability of risk factors taking values of 1 given the evidence variables. ω(Ds) represents the likelihood weight assigned to each Ds. We define variables with an output probability greater than 0.5 as high-risk factors.
In summary, the BOUND model’s output, including lung cancer risk predictions and identified high-risk factors, serves as a reference for doctors, who can then take preventive measures based on the patient’s specific circumstances.
Validation of the BOUND model
We conducted both internal and external validations of the model, as described in the Study population section. The external validations assessed the model’s performance from both temporal and geographical perspectives.
Lung cancer risk scorecard
To enhance the convenience of lung cancer prescreening, we have developed a lung cancer risk scorecard based on the BOUND model:
$$\,\textScore\,=A-B\log \left(\fracp1-p\right),$$
(9)
where p represents the probability of having lung cancer predicted by the BOUND model; A and B, model parameters determined through training, are used to calibrate the scoring scale35. This scorecard maps the risk probability into a 0–100 point scoring system, achieving a quantitative assessment of the patient’s lung cancer risk, where a lower score indicates poorer health and a higher risk of lung cancer. Due to the BOUND’s ability to effectively predict risk in the presence of missing data, our scorecard can also be applied to risk assessments with incomplete information, significantly enhancing its practicality and flexibility. Furthermore, by identifying the optimal Area Under the ROC Curve (AUC) cut-off value, lung cancer risk is categorized into high and low levels. To thoroughly evaluate the public health value of the scorecard, we analyzed the positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, and Youden’s index at different scores, comparing these with the current higher detection rate of 2.3% in lung cancer screening6,13,14, calculating the detection rate ratio at different scores.
Model visualization and online software
We utilized Cytoscape.js for visualizing the BOUND model. We developed a model usage interface specifically designed for hospitals or governments to obtain lung cancer risk levels. This platform also allows for the identification of high-risk factors based on predicted outcomes and known medical history.
Statistical analysis
In this study, the variables used were all categorical, described by frequency (n) and proportion (%). For the baseline comparison between the training set and two external validation sets, the chi-square test was utilized for binary variables, while the Wilcoxon rank-sum test was applied for the ordered categorical variable of age.
We evaluated BN-logistic against traditional parameter learning methods (maximum likelihood and Bayesian estimation) on internal and external validation sets, using the Delong test to assess whether BN-logistic achieved a higher AUC. We also compared these methods on simulated data at various sparsity levels (0.005, 0.015, 0.025, 0.035, 0.045, 0.055). For each sparsity level, 1000 data sets were generated to evaluate these methods. After confirming that each evaluation index approximated a normal distribution, we conducted a paired t-test to assess whether BN-logistic outperformed traditional methods. Sensitivity analysis of BOUND was conducted on the external validation set in Boxing, setting a missing data rate between 10% and 70% (notably, age and gender data were complete without missing values), aiming to evaluate the model’s performance under various missing data conditions.
For comparison, we used the logistic regression model, which cannot directly handle missing data, was combined with different imputation methods to make predictions. The imputation techniques used were multiple imputation36 and random forest imputation37. As shown in Supplementary Fig. 1, the model’s AUC stabilized after incorporating the top eight features. Therefore, these eight features were selected to construct the logistic model. Like BOUND, the logistic model was trained on the Cheeloo LEAD training set and then applied to the Boxing external validation set under different missing data conditions.
This study comprehensively assessed BOUND’s performance in terms of discrimination, calibration, and clinical utility. The model’s discriminative ability was evaluated using AUC, assessing the model’s effectiveness in distinguishing between lung cancer patients and those non-lung cancer population. Calibration refers to the agreement between predicted risks and observed outcomes. Additionally, decision curve analysis was employed to evaluate the clinical utility of the model. Furthermore, the rescaled Precision-Recall (PR) curve was used to evaluate BOUND’s prediction performance for imbalanced data. The detailed process of rescaled PR curve25 can be found in Supplementary Note 1. All analyses were performed in R, version 4.3.0, and PostgreSQL, version 15.0.
Data approval
This study was approved by the Institutional Review Board of the Shandong University School of Public Health, China. No informed consent was required, because the EHR in this study were anonymized.
link