Understanding pre-diagnosis and overall experiences of young women with breast cancer: a machine learning approach using social media posts

Authors: Ulanday KT, Topaz M, Lewis S, Walker D, Terry MB, Houghton LC

Category: Behavioral Science & Health Communication
Conference Year: 2022

Abstract Body:
Background: Breast cancer is increasing in women under 55 years, and is increasing at a faster rate in women under 40 years. As guidelines for population-based breast cancer screening start after 40 years old, we aimed to identify how young women under 40 years first detect their breast cancer and how they navigate the health system. Methods: This study used natural language processing and machine learning to detect deductive themes related to pre-diagnosis and overall healthcare experiences of young women with early onset breast cancer, who were members of the online forum "Young Survival Coalition,"¬ù an international organization serving young women who are diagnosed with breast cancer. In the training dataset, we reviewed text from 750 posts of the forum's 571,602 posts published between March 2009 and December 2019. Then, using qualitative content analysis, posts were coded for the presence of "first signs and symptoms,"¬ù "steps to diagnosis,"¬ù "health-care interactions,"¬ù "patient-provider feelings,"¬ù and "staging type."¬ù Next, using an open-source analytics platform, KNIME, we implemented three algorithms (i.e., support vector machine (SVM), random forest (RF), and decision tree (DT)) to build classification models. For each model, we calculated accuracy statistics (summarized as the F-measure-harmonic mean of the positive predictive value and sensitivity of the classification model). Finally, for each of the three classification models, we calculated the average across the five codes. The model with the best F-measure was identified.Results: About 16% of posts were coded for the presence of "first signs and symptoms"¬ù, 25% for "steps to diagnosis,"¬ù 39% for "health-care interactions,"¬ù 17% for "patient-provider feelings,"¬ù and 48% for "staging type"¬ù in the training dataset. The average F-measure across codes were 79%, 77%, and 72%, for the SVM, DT, and RF models, respectively. Conclusions: The SVM classification model best fit the training dataset. Next steps include application of the classification model to the larger dataset and further evaluating each code. Examining the pre-diagnosis experiences of early onset breast cancer patients may offer initial data to guide further research and inform clinical practice including the screening of young women.

Keywords: machine learning, breast cancer, early-onset, cancer screening