AI-Based Heart Failure Risk Prediction

Graduate capstone project integrating polygenic risk scores, clinical phenotypes, and proteomic data to predict heart failure risk using five ML models — demonstrating that multi-omics integration significantly outperforms genetic risk scores alone.

Machine LearningPythonGenomicsProteomicsBioinformatics

Overview

  • Graduate capstone project for an M.S. in Bioinformatics
  • Developed and benchmarked machine learning models for heart failure risk prediction
  • Integrated three data types from the UK Biobank: polygenic risk scores (PRS), clinical/phenotypic variables, and proteomic expression data
  • Central question: does adding phenotypic and proteomic layers significantly improve prediction accuracy beyond genetic risk scores alone?

Data Sources

  • UK Biobank genotype data (502,151 individuals) — processed with PLINK and PRSice-2
  • Phenotypic covariates extracted via Hail, selected by correlation with heart failure incidence
  • Proteomic data: ~2,924 proteins measured across 53,018 participants; 5 literature-validated proteins selected (HAVCR1, IGFBP7, LTBP2, NTproBNP, TNXB)
  • Final dataset: 45,920 individuals after filtering (45,692 controls, 228 cases)

Models & Methods

  • Five algorithms trained across two parallel pipelines — one with PRS + phenotypes, one adding proteomic data
  • Ridge Regression, Logistic Regression, Random Forest, XGBoost, Neural Network (Keras/TensorFlow)
  • Class imbalance handled with SMOTE and focal loss
  • Common optimizations: cross-validation, threshold tuning, StandardScaler normalization

Key Results

  • All five models showed improved performance when proteomic data was included
  • F1 score used as primary metric given severe class imbalance (228 cases vs. 45,692 controls)

| Model | Test AUC | F1 Score | |---|---|---| | Ridge Regression | 0.949 | 0.397 | | Neural Network | 0.960 | 0.126 | | Logistic Regression | 0.950 | 0.180 | | XGBoost | 0.916 | 0.258 | | Random Forest | 0.905 | 0.174 |

  • Ridge Regression achieved the best precision-recall balance (F1 = 0.397)
  • XGBoost achieved the highest cross-validation AUC (1.000)
  • Both showed well-generalized learning curves without overfitting

Conclusions

  • Proteomic integration consistently improved prediction across all model families — suggesting genuine biological signal rather than algorithmic artifact
  • Multi-omics approaches significantly outperform genetic-only (PRS) models for complex disease prediction
  • Ridge Regression is the most practical model for clinical deployment given its F1 score and generalization
  • Class imbalance at this scale remains a core challenge for rare disease prediction pipelines

Tools & Technologies

  • Python, R, scikit-learn, XGBoost, Keras/TensorFlow
  • Hail, PLINK, PRSice-2, Google Cloud Storage