Parallel Tasks:
Project Scoping and Disease Selection
Objective: Define project scope and select one target disease for gene-disease association prediction.
Starting point: OpenTargets and STRING database data.
Details: Bioinformatics Focus - Gene-Disease Association Prediction
Tasks:
- Select and justify one target disease
- Analyze data characteristics for chosen disease
- Define project scope and objectives
- Create project documentation
Expected outcome:
- A clear, written project objective specifying the prediction task and evaluation criteria
Data Processing, Feature Extraction, and ML Pipeline
Objective: Create a flexible pipeline for data processing, feature extraction, and ML model training that can be applied to any disease.
Starting point: OpenTargets and STRING database data
Details: Machine Learning Focus - Gene-Disease Association Prediction - #2 by stemaway
Tasks:
- Develop modular code for:
- OpenTargets data extraction and processing
- STRING database data processing
- Graph construction and feature extraction (e.g., centrality measures, clustering coefficients)
- Dataset preparation (positive and negative examples)
- Traditional ML model training and evaluation (e.g., Random Forest, Gradient Boosting, Logistic Regression)
- Implement the pipeline for the initial selected diseases
- Ensure the pipeline can easily accommodate new diseases
Expected outcome:
- A complete, modular pipeline that can process data, extract features, and train traditional ML models for gene-disease association prediction
- Initial results for the selected diseases
Graph Neural Network Implementation
Objective: Develop a GNN model for gene-disease association prediction using the same data as the traditional ML approach.
Starting point: OpenTargets and STRING database data (same as traditional ML approach)
Details: Machine Learning Focus - Gene-Disease Association Prediction - #2 by stemaway
Tasks:
- Adapt the graph representation created in the traditional ML pipeline for GNN input
- Implement a basic GNN architecture (e.g., Graph Convolutional Network)
- Develop node feature integration mechanism using features extracted in the traditional ML pipeline
- Train the GNN using the prepared dataset
- Implement evaluation metrics for the GNN model
Expected outcome:
- A functional GNN model capable of predicting gene-disease associations
- Comparative results between GNN and traditional ML approaches
Integration Plan:
- Both traditional ML and GNN teams will work in parallel using the same initial data and graph representation
- Teams will compare results and insights from both approaches
Alternate Path
An interesting idea proposed by a student was to use a pretrained GNN and fine-tune it for gene-disease association prediction. Mentors had concerns about the compute power requirements (pretrained GNNs are usually very large) and the learning aspect (it may lead to treating the model as a black box). However, this path could be a great step towards a final product. If anyone has feedback, please reply in this thread.