Project information

  • Category: Machine Learning / Healthcare Analytics
  • Project duration: Sep '24 - Dec '24
  • Team size: 3
  • Dataset: CDC Diabetes Health Indicators (253,680 samples)
  • Report URL: Detailed Report

Diabetes Risk Prediction

Technology(s) Used: Python, Scikit-learn, Pandas, GridSearchCV, Random Forest, SVM

  • Data Engineering: Cleaned and processed the massive CDC Diabetes Health Indicators dataset. Addressed significant class imbalance using oversampling techniques to ensure the minority (diabetes/prediabetes) class was properly represented.
  • Feature Selection: Conducted multi-collinearity analysis and used Random Forest Gini importance to distill 21 variables down to 7 key health indicators, optimizing model efficiency.
  • Model Development: Implemented and compared Logistic Regression, SVM, Random Forest, and KNN. Utilized GridSearchCV for hyperparameter tuning and cross-validation to prevent overfitting.
  • Performance: Achieved an 88.79% accuracy and 89.24% precision. The model provides a reliable framework for early screening based on behavioral and health metrics.