Week 1 - Machine Learning Focus - Gene-Disease Association Prediction

These suggestions could potentially enhance the robustness, performance, and interpretability of the gene-disease association prediction model:

  1. Data quality and preprocessing:

    • Add a step for handling missing values and outliers in the OpenTargets and STRING data.
    • Consider normalizing or scaling features before model training.
  2. Feature engineering:

    • Explore more complex network features like betweenness centrality or eigenvector centrality.
    • Consider creating interaction terms between features.
  3. Model selection and evaluation:

    • Include cross-validation for more robust performance estimation.
    • Add other metrics like Matthews Correlation Coefficient (MCC) for imbalanced datasets.
    • Consider using SHAP (SHapley Additive exPlanations) values for more interpretable feature importance.
  4. GNN approach:

    • Experiment with other GNN architectures like GraphSAGE, GAT, or more recent ones like GraphTransformer.
    • Implement early stopping to prevent overfitting.
    • Use k-fold cross-validation for more reliable GNN performance estimation.
  5. Biological interpretation:

    • Include gene set enrichment analysis (GSEA) on top predictions to identify overrepresented pathways or functions.
    • Validate top predictions against recent literature or experimental data.
  6. NLP integration:

    • Consider using biomedical-specific language models like BioBERT or PubMedBERT for feature extraction.
    • Implement named entity recognition to extract specific biological entities from text.
  7. Ensemble methods:

    • Implement stacking or blending of different models (traditional ML, GNN, and NLP-based) for potentially improved performance.
  8. Explainability:

    • Implement techniques like LIME or SHAP for explaining individual predictions, which is crucial in biomedical applications.
  9. External validation:

    • Include a step to validate the model on an independent external dataset to assess generalizability.
  10. Time-based splitting:

    • If the data has a temporal component, consider using time-based splitting instead of random splitting to mimic real-world scenarios.