Week 2 - Gene-Disease Association Prediction

stemaway · August 6, 2024, 10:01pm

Our next meeting is with Anubhav. Please let us know whether Thursday at 6pm Pacific or Thursday at 7pm Pacific works better. We will go with whichever time slot has more votes by Wed noon Pacific:

Thursday 6pm Pacific
Thursday 7pm Pacific

0 voters

@Updates-2024 @Preinternship

Before joining this meeting, make sure you have followed the discussions from Week 1.

stemaway · August 6, 2024, 10:10pm

Summary from Week 1:

The majority of students voted to work on a project involving Machine Learning in the domain of Bioinformatics.
In the Bioinformatics focus meeting led by Anya, an interesting addition was proposed: incorporating network structure as a feature in the machine learning model (credit: @moneuron).
The Machine Learning focus meeting led by Sam covered the initial steps in detail. All details can be found in the post.
Github setup started by @Prasun_Sharma
A more detailed summary will be shared by end of day for those who need additional guidance. For the next meeting, all members should try and complete:
- Downloading data manually or via api
- Creating a network
- Visualizing the network (see entries from @Moh_Saiger and @hahaharsini
- Coding a basic ml model for gene-disease association.

Please try out this example if you are new to ML or need a refresher: 🔸 Code Along for F-IE-1: Text Classification using NLP

stemaway · August 7, 2024, 10:14pm

The votes are exactly 50-50!

Confirming tomorrow’s meeting from 6:30pm Pacific.

Launch Meeting - Zoom. Feel free to attend a portion of the meeting if you can’t stay for the full hour.

Notes

Data Acquisition and Initial Network Construction.
All members are expected to have completed this step. If you have any difficulties, please reply to this post and we will provide you help to catch up.
- Data sources:
  - OpenTargets: For disease-gene associations
  - StringDB: For protein-protein interactions (PPI)
- Initial networks should be constructed with the following elements:
  - Nodes: Representing diseases and gene/proteins
  - Edges: Representing disease-protein associations and protein-protein interactions (PPI)
Foundational Machine Learning
This will be the main focus of this meeting.
- Binary classification
  If you haven’t started this task, that is okay. But it is absolutely essential that you have some knowledge of a ML pipeline before this meeting.
Advanced Feature Implementation
Our next focus will be on incorporating more sophisticated network analysis techniques:
- Network Cluster Information as a Feature
  - We’ll shift from individual gene/protein analysis to gene set analysis
  - We’ll utilize network cluster information of disease-associated genes as a feature

@Updates-2024 @Preinternship

stemaway · August 8, 2024, 10:15pm

GitHub Information and Procedure:

Commit Process: Based on the results shared on this forum, one team member will be designated to commit the main file. Other members can then pull this file into their branches, make additions, and merge their changes back into the main branch.
Contribution Details: Each member is responsible for clearly detailing their contributions in the commit message when committing code. It is entirely possible that subsequent commits may alter parts of the original commit.
Initial Commit: For the first piece of code, Ahmed and Aya will be invited to commit as they have successfully merged the two datasets. @Prasun_Sharma, please coordinate with @Ahmedz and @ayahashim16 to ensure their code is properly committed. Once this is done, please inform the rest of the team.
Approval Process: As leads, @Prasun_Sharma and @Moh_Saiger will be responsible for approving future commits. They will use the forum or chat channel to gather the team’s votes before finalizing any changes.
Next Steps: The upcoming code will focus on machine learning, specifically on binary classification (determining whether a link is present) and regression (predicting the probability of a link being present).

Ahmedz · August 9, 2024, 4:10pm

I’ve pushed the latest code to the GitHub BI-ML_Disease-Prediction_2024, specifically on the dev branch. I’ve included a step-by-step walkthrough notebook to help clarify any potential confusion. Please, take a note that the repo is private, but I hope everyone has access.

I apologize for missing the last meeting due to the time difference. Thanks to the @stemaway moderators and the team for the continuous updates and feedback

stemaway · August 9, 2024, 4:53pm

Awesome! Thanks Ahmed!

Main points from the ML meeting: Attendees, please reply with any other salient points you captured.

Visualization vs ML: Use a small STRINGdb dataset for the visualization. For the ML work, the entire Homo sapiens dataset needs to be utilized.
ML Data Balance: The main issue with the ML data is that it will be unbalanced, with significantly more negatives than positives (pointed out by @Thuraya_Ayman). This is a real challenge, so please research ways to tackle it. Addressing this in a team publication could be very valuable.
Modeling Approach: Start with logistic regression, then move to more complex models to avoid overfitting.
Threshold for Performance Metrics: The default threshold is 0.5. However, for medical data, it’s better to prioritize false positives over false negatives. Explore what this means in the context of your project—another potential topic for publication.
Multi-Classification: After each team member has selected and completed the classification for one disease, combine the data for a multi-classification problem. Anubhav will provide further guidance. This approach could yield interesting results.
Features for Initial Models: For the initial models, we are not using any features beyond the association scores. However, if the team is interested, network analysis could be conducted to add additional features. This would be a complex but very interesting study! Anya and Sam will provide further guidance.

@Updates-2024 @Preinternship

Samuel_bharti · August 9, 2024, 5:03pm

Hi @Ahmedz,

The template repo is not to be modified. This is a repo to derive project repo from. Please reach out to your project lead to create a new project repo under your team on GitHub and push the code there. Using the formats shared to @Prasun_Sharma and @Moh_Saiger .

There’s also a readme on our mentor chains profile page. It has resources for leads but open for everyone on how to create a project repo from template and how to structure code. Let me know if you have any questions.

Ahmedz · August 9, 2024, 6:03pm

Hi @Samuel_bharti, Sorry for the inconvenience caused, that was on me.

ddas · August 9, 2024, 8:15pm

No worries Ahmed! Keep pushing! Sam is there to help with questions and/or if any mistakes are made.

Samuel_bharti · August 10, 2024, 12:59am

No worries Ahmed, here to guide you guys. I’ll fix the template.

This is a new setup of creating project repositories through a template. I’m in process of setting right permissions for everyone. We’ll have more documentation and resources for easy navigation on GitHub.

ayahashim16 · August 13, 2024, 5:56pm

Dears, @Ahmedz and I are confused regarding building the classifier for level 1. As per our understanding, I tried to build a classifier for Lung cancer using the composed graph (which combines both the disease-gene graph and protein-protein graph ). I have assigned label 1 for each edge in the graph, and label 0 for non-edges (randomly sampled and have a size equal to the number of edges to balance the dataset), to predict whether a link exists or not. The association scores were the only feature. I tried this model using three different classifiers and two of them scored 100% (Logistic Regression scored 80.04%, Gradient Boosting scored 100%, and Random Forest scored 100%). It seems that the models overfit the training data. Can you please advise if my labelling strategy is correct or is it just because we are using one feature?

Ahmedz · August 13, 2024, 7:24pm

As @ayahashim16 mentioned, we performed link prediction by combining two graphs: one representing disease-gene interaction and another representing protein-protein interaction. We followed the code snippet that labels existing edges as positive samples (1) and randomly samples non-edges for negative samples (0). This approach ensures that positive and negative samples are of equal length by using random.sample(non_edges, len(edges)), resulting in a balanced dataset

Experiment 1:

I used a subset of the STRING database that matches the length of Alzheimer’s disease dataset (6000 rows). After creating and composing the graphs, I split the dataset into training, test, and validation sets. I then applied different classifiers, and the metrics evaluated on the validation set are provided below

Experiment 2:

I used the full STRING database to construct a PPI graph and combined it with the Alzheimer’s disease graph. The resulting graph was approximately 900MB in size, contained 19801 Nodes. For visualization purposes, I extracted a 1000 nodes sample and displayed it. However, for training, testing and validation, I utilized the entire graph using logistic regression

The notebook has also been pushed to GitHub on my branch, ahmed

stemaway · August 13, 2024, 10:18pm

@ayahashim16 and @Ahmedz, we have sent your query to Anubhav, we will get back to you shortly.

ddas · August 14, 2024, 8:51pm

Hi @ayahashim16 @Ahmedz, can you try running with a bigger set of ppi edges. This is a link to the complete home sapien String-DB PPI from a previous intern: ppi.csv - Google Drive. Add code to discrad rows that may have NAN or empty values. We arer also trying out the code ourselves, will update you shortly!

ddas · August 15, 2024, 5:28pm

Hi @ayahashim16 @Ahmedz, we believe the issue is with how the positive and negative edges are created. In the meetings, we talked about how disease to gene from OpenTarget are positive edges and disease to other genes from stringDB are negative edges (ie no known association with that disease to other genes).

However, the code snippet we share doesn’t show this distinction. It does the sampling on the full graph.

You will need to add code to create correct edges:

Positive Edges:

Should be disease-gene associations from OpenTargets only.
Do not include protein-protein interactions as positive examples.

Negative Edges:

Should be potential disease-gene pairs with no known association.
Use genes from the STRING database (PPI network) that are not associated with a given disease in OpenTargets.

If you were already constraining edges correctly:

The issue might be related to the dataset size.
Try using the entire PPI network and check your results.

Sorry about any confusion this may have caused. Please share this information with your entire team. And let us know if you have any questions.

ayahashim16 · August 15, 2024, 7:11pm

Dear, Thanks for your response. I have tried the models for the Lung Cancer disease, and considered the labels as follows based on your recommendation:

Positive labels: disease-gene association from open target DB.
Negative labels: non-associated genes from the entire Strnig DB.

I only used the association score as a feature and I have set it to zero when the gene has no association with the disease. But unfortunately, I have faced the same issue and the model overfits the training data. I think the model learns to assign a negative label when the association score is equal to zero and to assign a positive label when the association score is greater than zero. Please advise.

ddas · August 15, 2024, 8:21pm

Hi @ayahashim16 You analysis looks correct, can you remove the association score and use network features instead. There were some suggestions in the original code snippets, and you could also try something like this:

Generate node embeddings to capture graph structure:

node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)

Create edge features from node embeddings:

def get_edge_features(edge):
    return model.wv[edge[0]] * model.wv[edge[1]]

X = [get_edge_features(edge) for edge in positive_edges + negative_edges]
y = [1] * len(positive_edges) + [0] * len(negative_edges)

Let me also ping Anya/Sam to see if we can add biological features.

On emore thing: Apply stronger regularization in your model to prevent overfitting. For example, in logistic regression:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=0.01, penalty='l2', solver='liblinear')
model.fit(X_train, y_train)

Let me know what you find!

stemaway · August 16, 2024, 10:03pm

Firstly, a big thank you to @ayahashim16 and @Ahmedz for sharing their findings.

Key Points to Address:

1. Issue with Edge Construction:

One issue we’ve identified is related to how positive and negative edges were being constructed in the code snippets we previously shared. The examples showed edges across the entire network, but to correctly build these edges, we need to constrain them to relevant subsets of the network. See details in previous replies.

2. Switch from Association Scores to Network Properties:

Using non-zero association scores for negative edges provides too much information to the model (identified by Aya). So, it is time to jump to network-based features for building your edge representations. Here’s a code snippet that demonstrates a simpler way to generate these features:

def simple_node_embedding(G, dim=64):
    embeddings = {}
    for node in G.nodes():
        degree = G.degree(node)
        neighbor_degrees = [G.degree(n) for n in G.neighbors(node)]
        avg_neighbor_degree = np.mean(neighbor_degrees) if neighbor_degrees else 0
        embedding = np.zeros(dim)
        embedding[0] = degree
        embedding[1] = avg_neighbor_degree
        embedding[2:] = np.random.randn(dim-2)
        embeddings[node] = embedding / np.linalg.norm(embedding)
    return embeddings

While using more advanced methods like Node2Vec can yield better embeddings, it requires substantial RAM and might only be feasible on a paid tier of Colab or another cloud platform.
By using network properties (like node degrees or embedding vectors), we’re leveraging the structure of known interactions to predict unknown ones, which is more generalizable and biologically meaningful.

3. Play with PPI Data:

Experiment with the PPI (Protein-Protein Interaction) data and use as much as your computational resources can handle. The more data you can include, the richer the network will be, and the more informative your models will become. Here’s a snippet to help you manage large datasets:

# Cap the number of PPI interactions
if len(ppi_ot) > 100000:
    ppi_filtered = ppi_ot.sample(n=100000, random_state=42)
else:
    additional_needed = 10000 - len(ppi_ot)
    ppi_non_ot = ppi_df[~((ppi_df['GeneName1'].isin(ot_genes)) | (ppi_df['GeneName2'].isin(ot_genes)))]
    additional_sample = ppi_non_ot.sample(n=min(additional_needed, len(ppi_non_ot)), random_state=42)
    ppi_filtered = pd.concat([ppi_ot, additional_sample])

4. Model Selection and Tuning:

When trying out different models, pay attention to the performance metrics. If a model is overfitting or if its performance metrics are too perfect (all metrics = 1), you’ll want to make the model “weaker” by reducing its complexity. For example, in Random Forest, you can reduce the number of trees, limit the depth of each tree, or increase the minimum number of samples required to split a node. Here’s an example:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=50,        # Fewer trees
    max_depth=10,           # Limit depth
    min_samples_split=10,   # Larger splits
    min_samples_leaf=5,     # Larger leaf nodes
    random_state=42
)
rf.fit(X_train, y_train)

5. Training:

Use separate tests sets and cross-validation to further help prevent overfitting: (Cross-validation helps assess how well your model generalizes across different subsets of the data)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def cross_validate_and_evaluate(model, X_train_val, y_train_val, X_test, y_test, model_name, cv=5): # Perform cross-validation cv_scores = cross_val_score(model, X_train_val, y_train_val, cv=cv, scoring=‘f1’)
print(f"\n{model_name} Cross-Validation Results:")
print(f"Mean F1-score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Train on full training set and evaluate on test set
model.fit(X_train_val, y_train_val)
y_pred = model.predict(X_test)

Finally, here is some results we generated:

Play around with model parameters and amount of data and see what you can get!!

And this is a rough analysis:

Model Performance:

a) Logistic Regression:

Consistent performance between cross-validation (CV) and test set.
F1-score: 0.8049 (CV) vs 0.8088 (Test)
Good balance of precision and recall, but room for improvement

b) Random Forest:

High performance, but potential overfitting.
F1-score: 0.9752 (CV) vs 0.9668 (Test)
The small drop in test set performance suggests it’s generalizing well, despite high scores.

c) Gradient Boosting:

Best performing model with consistent results.
F1-score: 0.9914 (CV) vs 0.9916 (Test)
Excellent balance of precision and recall.d)

d) SVM:

Solid performance, consistent between CV and test set.
F1-score: 0.8476 (CV) vs 0.8437 (Test)
Good balance, but not as high as ensemble methods.

@Updates-2024 @Preinternship