Machine Learning — Based Diabetes Risk Prediction Using Questionnaire Data

This project develops a machine learning–based diabetes risk prediction framework using non-invasive, questionnaire-derived health indicators as an alternative to traditional blood-based diagnostics. The study is based on the 2015 Behavioral Risk Factor Surveillance System (BRFSS) released by the U.S. Centers for Disease Control and Prevention, covering 253,680 observations with lifestyle behaviors, health conditions, and basic demographic features. The primary objective is to assess whether low-cost, highly accessible survey data can support early diabetes risk screening and provide practical value for public health decision-making.

From a technical perspective, the project implements the entire modeling pipeline from scratch, including data preprocessing, model training, threshold selection, and evaluation. Multiple approaches are explored, including Ordinary Least Squares, Naive Bayes, and Logistic Regression trained via Gradient Descent, along with an alternative pipeline using random undersampling to analyze class imbalance. The workflow applies stratified data splitting, selective feature standardization, and strict separation of training, validation, and test sets to prevent data leakage. Model performance is evaluated using Precision, Recall, and the $F_\beta$-score, with threshold calibration conducted on the validation set to explicitly prioritize recall in early screening scenarios.

The $F_2$-score is adopted as the primary evaluation criterion to explicitly place greater weight on recall, reflecting the public health objective of minimizing false negatives in early screening scenarios, where missed high-risk individuals may delay timely intervention and lead to higher downstream healthcare costs. Empirical results show that the best-performing models achieve recall above 0.84 with $F_2$-scores around 0.59 on the held-out test set. Least Squares and Logistic Regression offer the most balanced performance under this recall-oriented framework, while Naive Bayes further increases sensitivity at the expense of additional false positives. Overall, the project demonstrates that effective diabetes risk prediction is feasible using only questionnaire data, highlighting the potential of interpretable, low-cost machine learning models for preventive healthcare and policy-level risk assessment.