Data-driven Prediction of Early-onset Colorectal Cancer using Electronic Health Records and Machine Learning

Authors: Xu J, Mobley EM, Quillen MB, Parker M, Awad ZT, Daly MC, Fishe JN, Parker AS, George TJ, Bian J

Category: Early Detection & Risk Prediction
Conference Year: 2023

Abstract Body:
Purpose of the study: To build prediction models that identify patients at higher risk of developing early-onset colorectal cancer (CRC) prior to 50 years of age with electronic health record (EHR) data using machine learning (ML) techniques. Methods: We obtained EHR data from the OneFlorida+ Clinical Research Consortium and extracted demographics, diagnoses, vitals, medications, medical procedures, and lab tests. Diagnoses were mapped to PheWas groups and medications were mapped to the ingredient level. We encoded all categorical features using the one-hot encoding scheme. Prediction models were built separately for the two outcomes: colon cancer (CC) or rectal cancer (RC). Cases and controls were matched 1:5 using propensity score matching. We defined prediction windows at 0, 1, 3, and 5 years prior to the index date (i.e., the first CRC diagnosis date). We tested two common ML methods: logistic regression (LR) and Gradient boosting Tree (GBT). We applied the SHAP (SHapley Additive exPlanations) approach to identify the risk factors that contributed to the prediction of CRC diagnosis and evaluated the performance using AUC, sensitivity, and specificity. Results: A total of 751 CC and 249 RC patients were included. Patient data from the 0-year prediction window before early-onset CRC diagnosis using the GBT model showed the best results (CC: AUC [95% CI] = 0.792 [0.788, 0.795], sensitivity [95% CI] = 0.656 [0.642, 0.670], specificity [95% CI] = 0.804 [0.791, 0.817]; and RC: AUC [95% CI] =0.828 [0.822, 0.834], sensitivity [95% CI] = 0.700 [0.683, 0.718], specificity [95% CI] = 0.857 [0.842, 0.872]). There were some differences in the risk factors among different prediction windows, and the top ranked risk factors included essential hypertension, diabetes, obesity, and renal dysfunction (e.g., acute renal failure, abnormal creatine). Preventive care such as routine medical exam is negatively associated with the risk of CRC. This may due to preventive care is a surrogate marker for social determinants of health. Conclusions: Leveraging ML models with EHR data can help predict the risk of early-onset CRC. Future work should externally validate the proposed model and features to better guide clinical support of those who may be at risk of developing early-onset CR

Keywords: Colorectal cancer (CRC), early prediction, risk factors, electronic health records (EHR), machine learning (ML)