Meetings and Pre-meeting tasks for this week (08/19)

stemaway · August 19, 2024, 6:18pm

Hi all,

We hope you’ve had enough time to experiment with different models! This week, let’s focus on a more structured deep dive. We are sharing sample code and not just snippets. Please remember that simply copying the code won’t lead to much learning. It’s crucial that you’ve put in your best effort to understand the bigger picture, both in terms of the domain and the machine learning concepts involved.

Task 1: Data Preparation and Graph Creation

This code prepares the dataset and creates a graph for disease-gene association prediction. Focus on adjusting these key parameters:

Global Score Cutoff:
- Purpose: Controls number of positive edges
- Adjust: Increase to reduce positive edges if too many
Max Number of PPI Interactions:
- Purpose: Limits protein-protein interaction edges
- Adjust: Reduce if facing resource constraints
Negative-to-Positive Sample Ratio:
- Purpose: Determines balance between negative and positive edges
- Adjust: Increase for larger overall dataset

Overall Objective: Tune parameters so logistic regression shows low-medium performance, while other models (Random Forest, Gradient Boosting, SVM) perform better. Avoid near-perfect performance, as it may indicate oversimplification or overfitting.

Submit your results using your chosen disease, even if unable to attend the meeting.

Even if you can’t attend the meeting, please make sure to submit your results for this piece of code (Task 1), and stick to the disease you’ve chosen.

import pandas as pd
import networkx as nx
import random
from tqdm import tqdm

# Load data
print("Loading data...")
ot_df = pd.read_csv('/content/drive/MyDrive/ot.tsv', sep='\t')
ppi_df = pd.read_csv('/content/drive/MyDrive/ppi.csv')

# Filter PPI data
print("Filtering PPI data...")
ppi_df = ppi_df.dropna()
ppi_df = ppi_df[(ppi_df['GeneName1'] != '') & (ppi_df['GeneName2'] != '')]
print(f"Filtered PPI data shape: {ppi_df.shape}")

# Sample PPI data if necessary
max_ppi_interactions = 5000000
if len(ppi_df) > max_ppi_interactions:
    ppi_filtered = ppi_df.sample(n=max_ppi_interactions, random_state=42)
else:
    ppi_filtered = ppi_df
print(f"Final PPI data shape: {ppi_filtered.shape}")

# Create graph
print("Creating graph...")
G = nx.Graph()

# Add OpenTargets edges
diseases = ['lung cancer']
positive_edges = []
for disease in diseases:
    for _, row in ot_df.iterrows():
        if row['globalScore'] > 0.1:
            G.add_edge(disease, row['symbol'], weight=row['globalScore'], type='disease-gene')
            positive_edges.append((disease, row['symbol']))

print(f"Number of positive edges: {len(positive_edges)}")

# Add PPI edges
ppi_edges_added = 0
for _, row in tqdm(ppi_filtered.iterrows(), total=len(ppi_filtered), desc="Adding PPI edges"):
    if row['GeneName1'] != row['GeneName2']:  # Avoid self-loops
        G.add_edge(row['GeneName1'], row['GeneName2'], type='ppi')
        ppi_edges_added += 1

print(f"Number of PPI edges added: {ppi_edges_added}")

# Create negative examples
print("Creating negative examples...")
all_genes = set(ppi_filtered['GeneName1']).union(set(ppi_filtered['GeneName2']))
print(f"Total unique genes in PPI network: {len(all_genes)}")

negative_edges = []
for disease in diseases:
    associated_genes = set(G.neighbors(disease))
    print(f"Genes associated with {disease}: {len(associated_genes)}")
    non_associated_genes = all_genes - associated_genes
    print(f"Genes not associated with {disease}: {len(non_associated_genes)}")
    negative_edges.extend([(disease, gene) for gene in non_associated_genes])

print(f"Number of potential negative edges: {len(negative_edges)}")

# Maintain imbalance
desired_ratio = 10  # 10 times more negative than positive
num_negative_samples = min(len(negative_edges), desired_ratio * len(positive_edges))
negative_edges = random.sample(negative_edges, num_negative_samples)

print(f"Final number of positive edges: {len(positive_edges)}")
print(f"Final number of negative edges: {len(negative_edges)}")

# Additional diagnostics
print("\nAdditional Diagnostics:")
print(f"Total nodes in graph: {G.number_of_nodes()}")
print(f"Total edges in graph: {G.number_of_edges()}")
print(f"Nodes with 'lung cancer' as neighbor: {len(list(G.neighbors('lung cancer')))}")
ot_genes = set(ot_df['symbol'])
ppi_genes = set(ppi_filtered['GeneName1']).union(set(ppi_filtered['GeneName2']))
print(f"Genes in OpenTargets: {len(ot_genes)}")
print(f"Genes in PPI network: {len(ppi_genes)}")

Task 2: Model Implementation and Tuning

This code introduces simplified node embeddings based on basic graph features. These embeddings are used to create edge features for machine learning models.

Instructions:

Choose at least one model: Logistic Regression, Random Forest, Gradient Boosting, or SVM.
Experiment with hyperparameters:
- Logistic Regression: Adjust C
- Random Forest: Vary n_estimators and max_depth
- Gradient Boosting: Modify learning_rate and n_estimators
- SVM: Tweak C and kernel
Document changes in F1-score, precision, recall, and accuracy as you adjust parameters.
Compare tuned model performance with default settings.
Explain the impact of hyperparameter changes in the context of gene-disease association.

Focus on understanding how these simpler embeddings and different hyperparameters affect model performance in predicting disease-gene associations.

# Import necessary libraries
import networkx as nx
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import numpy as np
import random
from tqdm import tqdm

def simple_node_embedding(G, dim=64):
    print("Generating simple node embeddings...")
    embeddings = {}
    for node in G.nodes():
        # Use node degree as a feature
        degree = G.degree(node)
        
        # Use average neighbor degree as another feature
        neighbor_degrees = [G.degree(n) for n in G.neighbors(node)]
        avg_neighbor_degree = np.mean(neighbor_degrees) if neighbor_degrees else 0
        
        # Create a simple embedding vector
        embedding = np.zeros(dim)
        embedding[0] = degree
        embedding[1] = avg_neighbor_degree
        
        # Fill the rest with random values (you could add more graph-based features here)
        embedding[2:] = np.random.randn(dim-2)
        
        embeddings[node] = embedding / np.linalg.norm(embedding)  # Normalize
    
    return embeddings

# Replace Node2Vec with this simpler embedding
node_embeddings = simple_node_embedding(G, dim=64)

# Use the embeddings in your existing pipeline
def get_edge_features(edge):
    return node_embeddings[edge[0]] * node_embeddings[edge[1]]

X_positive = np.array([get_edge_features(edge) for edge in tqdm(positive_edges, desc="Processing positive edges")])
X_negative = np.array([get_edge_features(edge) for edge in tqdm(negative_edges, desc="Processing negative edges")])

X = np.vstack((X_positive, X_negative))
y = np.array([1] * len(positive_edges) + [0] * len(negative_edges))

# First, let's split our data into training+validation set and a separate test set
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def cross_validate_and_evaluate(model, X_train_val, y_train_val, X_test, y_test, model_name, cv=5):
    # Perform cross-validation
    cv_scores = cross_val_score(model, X_train_val, y_train_val, cv=cv, scoring='f1')
    
    print(f"\n{model_name} Cross-Validation Results:")
    print(f"Mean F1-score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    # Train on full training set and evaluate on test set
    model.fit(X_train_val, y_train_val)
    y_pred = model.predict(X_test)
    
    print(f"\n{model_name} Test Set Evaluation:")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score: {f1_score(y_test, y_pred):.4f}")

# Define models, remember these hyperparameters maynot be the best for your dataset!
models = {
    "Logistic Regression": LogisticRegression(class_weight='balanced'),
    "Random Forest": RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced', random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel='linear', C=1.0, class_weight='balanced', random_state=42)
}

# Perform cross-validation and evaluation for each model
for name, model in models.items():
    cross_validate_and_evaluate(model, X_train_val, y_train_val, X_test, y_test, name)

Pinging members active in meetings/ forums: @Prasun_Sharma @Moh_Saiger @ayahashim16 @ahmedsalim @Huikun_Li @Thuraya_Ayman @hahaharsini.

Please forward to your team channel to include other members who are working on the code but are unable to make mentor meetings.

ahmedsalim · September 3, 2024, 6:41am

I found a two great libraries, Nodevectors uses CSR Matrix to quickly generate node embeddings, they support multiple algorithms, like Node2Vec. I tested three of their algorithms to generate embeddings and then applied classification. I conducted two experiments with open target dataset: one where I collected the dataset in TSV format from the open target platform and another where I used BigQuery to retrieve data from their database. Despite using several classifiers, including TensorFlow’s Sequential API, the results weren’t ideal, but if I’m not mistaken the second data source gave better results.

Another promising library I came across is GRAPE. It’s a package with Rust bindings that offers most of the networkx algorithms along with additional graph ML tools. It’s incredibly fast and can handle graphs on a billion-scale. It also provides a way to compute embeddings through its pipeline, though I haven’t figured out how to effectively use these embeddings yet. Their documentations aren’t the best, but they have many jupyter notebook tutorials that’s showcase it’s capabilities, and in their paper.

I’ve uploaded two notebooks on GitHub to show the experiments I mentioned

Samuel_bharti · September 16, 2024, 10:26am

Hi @ahmedsalim ,

Apologies for the late reply I was out for a conference.

These libraries look very interesting.

I looked at your notebooks on the GitHub branch for the outputs and 2 datasets you picked.

The main difference I observed in your two datasets was:

Scores:

Dataset 1 uses the globalScore (seems to include both direct and in-direct association scores).
Dataset 2 uses the overall_direct_score (which seems to be a score based on direct associations for a specific disease).

Check this link for score definitions.

Consider the following insight about why dataset 2 might be performing better for you:

Global Score:

The global score is a combined score that takes into account multiple pieces of evidence from different datasets to represent the overall strength of the association between a gene and a disease. This score provides a comprehensive view of the gene-disease link, but it may include noise due to indirect evidence, leading to less precision in classification and potentially introducing false positives.

Direct Score:

On the other hand, the direct association score focuses on specific evidence that directly connects a gene to a disease, such as experimental validation through genetic association studies. Using direct association scores is likely to result in higher precision for predictions as the evidence is more reliable.

Another important point is which score you are using from StringDB. The use of indirect/direct association scores can significantly increase or decrease the amount of noise present.

Additional Notes:

I may have missed it, but I can’t see the edge weight in the PPI network. It should correspond to the score from the PPI dataset. Also, ensure that you normalize scores when integrating them.
Besides your current metrics, look at PR-AUC scores, especially when we have a potentially imbalanced dataset like this. Check more at this link.
For now, it’s good to work with a single association score but you can consider adding more features (other direct/in-direct scores) instead of aggregation.
Consider moving common functions to a separate py file in the final version. For notebooks, it’s totally fine to experiment.
When working with a single disease for now, consider using an API call within your notebook instead of locally stored data. This will make it more reproducible for people to try when we publish on GitHub.

Final note: You guys have done a great job and I’m sure you have learned overall concepts. Experimenting and getting to the desired output takes time. Even if this is a work-in-progress let’s wrap with current findings/learnings and host on our GitHub. We can update on the go later.

Sam

ahmedsalim · September 17, 2024, 6:17pm

Hi @Samuel_bharti,

No worries at all, hope everything went well with the conference.

Thanks for taking the time and reviewing the notebooks/datasets.

Earlier, I was confused about the score differences, even though the gene symbols were the same across both datasets, I thought that the global scores are actually the direct scores. Later, @anya and I realized that the global score from the TSV dataset (dataset 1) is actually the indirect score from the BigQuery dataset (dataset 2), which lines up with your insights about noise from the indirect evidence. I’ve sent you an invite to that convo for more details.

That said, can you confirm which dataset and score we should stick with? Also, since you suggested adding more features (other direct/in-direct scores), I’d like to know which score to keep as the target and which as a feature.

As for the edge weights, you are right that I didn’t add the PPI scores (as weights) in the graph. Because the idea was to use the PPI dataset only to collect the total genes and then subtract the associated ones from the OT dataset to get the non-associated genes. However, i’ll work on adding this weight, normalizing, and refining that process

Again, really appreciate all your feedback! I’ll re-experiment based on your suggestions, update the evaluation metrics (PR-AUC ), and I’ll explore API use for real-time data access. I’ll also organize the Python files into source folder and update the GitHub branch soon

Samuel_bharti · September 18, 2024, 11:28am

Hello @ahmedsalim

This is for you and anyone who is coding along with the team.

Let’s do the following:

Since we now have disease-specific and different types of scores. We can use multiple methods and show them as an App. Check out Streamlit.io documentation to quickly set up a Python web application with your current code. Start here for docs.
Add functionality of API calls in your current notebook/python file if possible else we can move forward with CSV files.
Create a directory app and put all your app-related files within your branch.
This app should include the following:
- Input options to select which dataset to use (two should be fine: one with only direct scores and the other with global (indirect + direct) scores.
- Ability to select some model tuning parameters in the following section.
- Then a section for 3-4 performance metrics plots from model prediction along stats.
Selecting which scores to use is a bit experimental. The final product we’ll have here will be an app that can help researchers decide which data is performing better in which ML model. Selection can be user-based.
We’ll set up a simple GitHub webpage for this 1-2 paragraph documentation of your app.

Bonus (If we have time): We can write a blog as well on STEMAway and/or medium just to showcase the methodology and a boilerplate for researchers to build on.

ahmedsalim · September 28, 2024, 3:04pm

Hi @Samuel_bharti and everyone,

So, I have built the app (not yet deployed) and updated the source scripts based on your earlier suggestions. These are some of the key changes I made:

I updated the graph building pipeline by adding combined scores (from PPI) as weights to the edges.
Since there’s a ratio indicating the number of negative edges to positive ones (e.g., 10:1), I sorted the potential negatives based on their combined scores in descending order so we get a negative edges that represent meaningful interactions, rather than random selection
I added an additional evaluation metric: PR-AUC scores for both classes, with a visualization
I included edge splitting in the data processing and enabled predictions after model training

These changes have indeed made a difference, and the results are now somewhat acceptable, especially when using the advanced embedding models, like ProNE and a disease that has numerous protein associations, have a look at the main notebook

As for the app, it has multiple input options to select data, embedding mode, model hyperparameters, and more. I’ve pushed the code to my branch and updated the , README.md on how to run along with the requirements.txt and other files as well. For a quick glance, check the screenshots here

Oh, and @ayahashim16 also contributed to this work!

Samuel_bharti · October 1, 2024, 7:12am

Thanks for the update! Let me take a look and get back to you.

Samuel_bharti · October 14, 2024, 11:43am

Hi, @ahmedsalim @ayahashim16 and all other team members,

I was out sick for a few days and just completed your code review.

First of all, an excellent job! Creating an app itself is very very impressive. I saw it’s very thorough in various input/output selections.

It took me a while to run the app and resolve package conflicts but I got it working. So here are the last few steps and suggestions.

App design:

All input options and output (Plots/table) should go on the right side(main panel). Not on the left.
The left is for Pages/Major Steps only.
I saw you have some documentation on GitHub and some parts of the App. But it would be helpful just to add a separate page for documentation and FAQs in the app itself. The pages for the app can be:

Of the left side navigation/menu either create different sections or use nested pages in “Analysis” Check this package if useful. GitHub - blackary/st_pages: An experimental version of Streamlit Multi-Page Apps

* Home Page
* Analysis
* Documentation/Tutorial
* FAQs
* About

(You can add sub-pages if possible in Streamlit, else just adding a section name analysis should be fine.) Here make pages like (Data import, graph visualization, model/embedding selection, prediction, etc.) Think of these steps as a workflow of your analysis and app.

Data selection

I liked you were using API’s. It’s a good start. Consider providing a small dataset or loaded disease example in your app so that whenever a new user tries to navigate they can easily and quickly understand different steps. You don’t necessarily have to load an example data, you can also simply lower your params to make the navigation of the app quick.

Backend (This is just a note for the future. Right now this doesn’t need priority.)

I saw there are a lot of packages used with some having compatibility in older versions only. This makes the app more resource-heavy. Also cleaning your code (removing unnecessary functions), loading things only when required, using parallelization, etc can reduce resource utilization in the long run. Essentially learning ways to increase performance and efficiency of your App.

Conflict/Compatibility/Deployment (Not priority but preferred; Aids reproducibility)

I mentioned I had to fix some conflicts while trying to run your app and install packages. While this can very likely be a problem with the way I had prior installations this is a very common problem among developers. The quick and simple way to tackle this is to make a docker container.

I create a new branch in the same GitHub repo. See here. After debugging I added a dockerfile you can use. See instructions in comments at the end of file and google in case you are new to docker.

Project website

This is important. I created a repo for this project’s website using a template we had. Here’s the new repo. GitHub - mentorchains/BI-ML_Disease-Prediction_2024_Site: Project Site for BI-ML_Disease-Prediction_2024

Here the main files to edit are:

config.yml (Edit project title, short description)
_docs (This is where all editable stuff goes in, edit as per your project)

The site is already deployed here Home | Project Title , it will be updated as you change the code.

To get a more clear idea about adding content to the project website. See our project website template with an example of sMAP. Check here

As always for good practice. Create a branch for all development (then open pull request and assign me or someone in your team as a reviewer before merging). Don’t commit directly to the main.

Last but very important.

Please write up contributions and acknowledgments carefully. Who contributed what part of analysis/app/data collection etc in the project.

Kudos team!

@stemaway

– Sam