As @ayahashim16 mentioned, we performed link prediction by combining two graphs: one representing disease-gene interaction and another representing protein-protein interaction. We followed the code snippet that labels existing edges as positive samples (1) and randomly samples non-edges for negative samples (0). This approach ensures that positive and negative samples are of equal length by using random.sample(non_edges, len(edges))
, resulting in a balanced dataset
Experiment 1:
I used a subset of the STRING database that matches the length of Alzheimer’s disease dataset (6000 rows). After creating and composing the graphs, I split the dataset into training, test, and validation sets. I then applied different classifiers, and the metrics evaluated on the validation set are provided below
Experiment 2:
I used the full STRING database to construct a PPI graph and combined it with the Alzheimer’s disease graph. The resulting graph was approximately 900MB in size, contained 19801 Nodes. For visualization purposes, I extracted a 1000 nodes sample and displayed it. However, for training, testing and validation, I utilized the entire graph using logistic regression
The notebook has also been pushed to GitHub on my branch, ahmed