Lessons Learned from Leveraging Cancer Epidemiology Cohort Data for AI/ML Applications

Authors: Lacey JV, Spielfogel ES, Savage KS, Anderson CA, Benbow JE, Clague-DeHart J, Duffy CN, Park HL, Thompson C, Wang SS, Martinez EM, Chandra S

Category: Early Detection & Risk Prediction
Conference Year: 2023

Abstract Body:
Background: Many cancer epidemiology cohorts (CEC) have uniquely valuable real-world data on lifestyle, environment, and cancer risks and outcomes. Cohorts often use multivariate regression to evaluate associations between exposures and outcomes in hypothesis-driven research. Many cohorts also include large-scale data that could be used in artificial intelligence (AI) or machine learning (ML) projects. Subtle differences in data strategies for AI/ML vs. for cohort research could influence the outcomes of cohort-based AI/ML projects. We recently conducted two collaborative projects to evaluate how ready our cohort's data were for AI/ML applications. Purpose: Our goals were to 1) use data from the California Teachers Study (CTS), a prospective CEC, to assess readiness for AI/ML modeling; and 2) identify and evaluate aspects of our CTS data strategy that should be improved to better facilitate AI/ML applications. Methods: Since 1995, the CTS has collected survey data and linked hospitalization, cancer, and mortality data on N=133,477 adult female volunteers. Approximately 33,000 participants have died during follow-up, and 45% of those deaths occurred among participants who had been discharged from the hospital less than 30 days before their date of death. Using CTS data, we trained and tested predictive models to assess factors associated with deaths occurring less than one month after hospital discharge. Results: Three key "data readiness

Keywords: cohort studies; artificial intelligence