Please reply with choice of task and questions/comments

stemaway · July 29, 2024, 4:23pm

Parallel Tasks:

Project Scoping and Disease Selection

Objective: Define project scope and select one target disease for gene-disease association prediction.
Starting point: OpenTargets and STRING database data.

Details: Bioinformatics Focus - Gene-Disease Association Prediction

Tasks:

Select and justify one target disease
Analyze data characteristics for chosen disease
Define project scope and objectives
Create project documentation

Expected outcome:

A clear, written project objective specifying the prediction task and evaluation criteria

Data Processing, Feature Extraction, and ML Pipeline

Objective: Create a flexible pipeline for data processing, feature extraction, and ML model training that can be applied to any disease.
Starting point: OpenTargets and STRING database data

Details: Machine Learning Focus - Gene-Disease Association Prediction - #2 by stemaway

Tasks:

Develop modular code for:
- OpenTargets data extraction and processing
- STRING database data processing
- Graph construction and feature extraction (e.g., centrality measures, clustering coefficients)
- Dataset preparation (positive and negative examples)
- Traditional ML model training and evaluation (e.g., Random Forest, Gradient Boosting, Logistic Regression)
Implement the pipeline for the initial selected diseases
Ensure the pipeline can easily accommodate new diseases

Expected outcome:

A complete, modular pipeline that can process data, extract features, and train traditional ML models for gene-disease association prediction
Initial results for the selected diseases

Graph Neural Network Implementation

Objective: Develop a GNN model for gene-disease association prediction using the same data as the traditional ML approach.
Starting point: OpenTargets and STRING database data (same as traditional ML approach)

Details: Machine Learning Focus - Gene-Disease Association Prediction - #2 by stemaway

Tasks:

Adapt the graph representation created in the traditional ML pipeline for GNN input
Implement a basic GNN architecture (e.g., Graph Convolutional Network)
Develop node feature integration mechanism using features extracted in the traditional ML pipeline
Train the GNN using the prepared dataset
Implement evaluation metrics for the GNN model

Expected outcome:

A functional GNN model capable of predicting gene-disease associations
Comparative results between GNN and traditional ML approaches

Integration Plan:

Both traditional ML and GNN teams will work in parallel using the same initial data and graph representation
Teams will compare results and insights from both approaches

Alternate Path

An interesting idea proposed by a student was to use a pretrained GNN and fine-tune it for gene-disease association prediction. Mentors had concerns about the compute power requirements (pretrained GNNs are usually very large) and the learning aspect (it may lead to treating the model as a black box). However, this path could be a great step towards a final product. If anyone has feedback, please reply in this thread.

stemaway · July 29, 2024, 4:37pm

Suggestions from our end based on meetings and preinternship codealong tasks:

Note this is not the full team! For some of you, we are not yet sure where you will fit best. Please reply to this post with your choices. Or send a DM to stemaway.

For the people who are listed, please note these are just suggestions, feel free to reply with your choices!