Week 2 - Gene-Disease Association Prediction

As @ayahashim16 mentioned, we performed link prediction by combining two graphs: one representing disease-gene interaction and another representing protein-protein interaction. We followed the code snippet that labels existing edges as positive samples (1) and randomly samples non-edges for negative samples (0). This approach ensures that positive and negative samples are of equal length by using random.sample(non_edges, len(edges)), resulting in a balanced dataset

Experiment 1:

I used a subset of the STRING database that matches the length of Alzheimer’s disease dataset (6000 rows). After creating and composing the graphs, I split the dataset into training, test, and validation sets. I then applied different classifiers, and the metrics evaluated on the validation set are provided below

lr_results rf_results gb_results

Experiment 2:

I used the full STRING database to construct a PPI graph and combined it with the Alzheimer’s disease graph. The resulting graph was approximately 900MB in size, contained 19801 Nodes. For visualization purposes, I extracted a 1000 nodes sample and displayed it. However, for training, testing and validation, I utilized the entire graph using logistic regression

lr_results2

The notebook has also been pushed to GitHub on my branch, ahmed