Machine Learning - Gene-Disease Association Prediction
Part 1: Data Collection and Preprocessing (Common for Both Groups)
Step 1: Data Collection
-
OpenTargets Data:
- Download from OpenTargets Platform
- Focus on associations, use JSON format, and download a few files to start with. Collaborate to pick different files.
- Example data format:
{"diseaseId":"DOID_0050890","targetId":"ENSG00000004478","score":0.0022,"evidenceCount":1} {"diseaseId":"DOID_0050890","targetId":"ENSG00000005381","score":0.0022,"evidenceCount":1} {"diseaseId":"DOID_0050890","targetId":"ENSG00000006128","score":0.0022,"evidenceCount":1}
-
STRING Database:
- Download from STRING Database
- Select Homo Sapiens
- Files needed: “9606.protein.links.detailed.v11.5.txt.gz” and “9606.protein.info.v11.5.txt.gz”
- Example data format:
protein1 protein2 combined_score 9606.ENSP00000000233 9606.ENSP00000356607 173 9606.ENSP00000000233 9606.ENSP00000427567 154 9606.ENSP00000000233 9606.ENSP00000253413 151
Step 2: Data Preprocessing
- Load data into DataFrames using pandas.
Step 3: Graph Construction
- Construct the graph using NetworkX.
import networkx as nx import pandas as pd opentargets_df = pd.read_json('path_to_opentargets_file') string_interactions = pd.read_csv('path_to_string_interactions', sep='\t') G = nx.Graph() # Add OpenTargets edges for _, row in opentargets_df.iterrows(): G.add_edge(row['diseaseId'], row['targetId'], weight=row['score'], type='disease-gene') # Add STRING edges for _, row in string_interactions.iterrows(): G.add_edge(row['protein1'], row['protein2'], weight=row['combined_score']/1000, type='protein-protein')
Step 4: Feature Extraction
- Extract node and edge features.
nx.set_node_attributes(G, nx.degree_centrality(G), 'degree_centrality') nx.set_node_attributes(G, nx.clustering(G), 'clustering_coefficient') nx.set_node_attributes(G, nx.pagerank(G), 'pagerank') def common_neighbors(G, u, v): return len(list(nx.common_neighbors(G, u, v))) for u, v, d in G.edges(data=True): d['common_neighbors'] = common_neighbors(G, u, v)
Step 5: Dataset Preparation
- Create positive and negative samples, then split the data.
import random from sklearn.model_selection import train_test_split edges = list(G.edges(data=True)) non_edges = list(nx.non_edges(G)) positive_samples = [(u, v, 1, d) for u, v, d in edges] negative_samples = [(u, v, 0, {'weight': 0, 'common_neighbors': common_neighbors(G, u, v)}) for u, v in random.sample(non_edges, len(edges))] samples = positive_samples + negative_samples random.shuffle(samples) X = [[G.nodes[u]['degree_centrality'], G.nodes[u]['clustering_coefficient'], G.nodes[u]['pagerank'], G.nodes[v]['degree_centrality'], G.nodes[v]['clustering_coefficient'], G.nodes[v]['pagerank'], d['weight'], d['common_neighbors']] for u, v, _, d in samples] y = [label for _, _, label, _ in samples] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
Part 2: Traditional Machine Learning Approach (Group 1)
Step 1: Model Selection and Training
- Choose models and train using GridSearchCV.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score models = { 'Random Forest': RandomForestClassifier(), 'Gradient Boosting': GradientBoostingClassifier(), 'Logistic Regression': LogisticRegression() } param_grids = { 'Random Forest': {'n_estimators': [100, 200, 300], 'max_depth': [5, 10, None]}, 'Gradient Boosting': {'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.1, 0.3]}, 'Logistic Regression': {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']} } results = {} for name, model in models.items(): grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring='roc_auc') grid_search.fit(X_train, y_train) best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) y_pred_proba = best_model.predict_proba(X_test)[:, 1] results[name] = { 'accuracy': accuracy_score(y_test, y_pred), 'precision': precision_score(y_test, y_pred), 'recall': recall_score(y_test, y_pred), 'f1': f1_score(y_test, y_pred), 'auc': roc_auc_score(y_test, y_pred_proba) } for name, metrics in results.items(): print(f"{name}:") for metric, value in metrics.items(): print(f" {metric}: {value:.4f}")
Step 2: Feature Importance Analysis
- Visualize feature importance.
import matplotlib.pyplot as plt rf_model = models['Random Forest'] feature_importance = rf_model.feature_importances_ feature_names = ['u_degree', 'u_clustering', 'u_pagerank', 'v_degree', 'v_clustering', 'v_pagerank', 'weight', 'common_neighbors'] plt.figure(figsize=(10, 6)) plt.bar(feature_names, feature_importance) plt.title('Feature Importance in Random Forest Model') plt.xlabel('Features') plt.ylabel('Importance') plt.xticks(rotation=45) plt.tight_layout() plt.show()
Part 3: Graph Neural Network Approach (Group 2)
Step 1: GNN Data Preparation
- Prepare data for GNN using PyTorch Geometric.
import torch from torch_geometric.data import Data edge_index = torch.tensor([[u, v] for u, v in G.edges()]).t().contiguous() x = torch.tensor([[G.nodes[n]['degree_centrality'], G.nodes[n]['clustering_coefficient'], G.nodes[n]['pagerank']] for n in G.nodes()]) data = Data(x=x, edge_index=edge_index)
Step 2: GNN Model Implementation
- Implement a GCN model.
import torch.nn.functional as F from torch_geometric.nn import GCNConv class GCN(torch.nn.Module): def __init__(self, in_channels, hidden_channels, out_channels): super().__init__() self.conv1 = GCNConv(in_channels, hidden_channels) self.conv2 = GCNConv(hidden_channels, out_channels) def forward(self, x, edge_index): x = self.conv1(x, edge_index) x = F.relu(x) x = F.dropout(x, p=0.5, training=self.training) x = self.conv2(x, edge_index) return x model = GCN(in_channels=3, hidden_channels=64, out_channels=32)
Step 3: Model Training
- Train the GNN model.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01) criterion = torch.nn.BCEWithLogitsLoss() def train(): model.train() optimizer.zero_grad() out = model(data.x, data.edge_index) loss = criterion(out[data.train_mask], data.y[data.train_mask]) loss.backward() optimizer.step() return loss for epoch in range(200): loss = train() print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')
Step 4: Model Evaluation
- Evaluate the GNN model.
from sklearn.metrics import roc_auc_score model.eval() with torch.no_grad(): out = model(data.x, data.edge_index) pred = torch.sigmoid(out) auc = roc_auc_score(data.y[data.test_mask].cpu(), pred[data.test_mask].cpu()) print(f'Test AUC: {auc:.4f}')
Step 5: Visualization
- Visualize node embeddings using t-SNE.
from sklearn.manifold import TSNE import matplotlib.pyplot as plt embeddings = model(data.x, data.edge_index).detach().cpu().numpy() tsne = TSNE(n_components=2) node_embeddings_2d = tsne.fit_transform(embeddings) plt.figure(figsize=(10, 8)) plt.scatter(node_embeddings_2d[:, 0], node_embeddings_2d[:, 1]) plt.title('2D Visualization of Node Embeddings') plt.xlabel('Dimension 1') plt.ylabel('Dimension 2') plt.show()
Final Steps (Both Groups)
Step 1: Results Comparison
-
Compare the performance metrics of traditional ML models and GNN.
for name, metrics in results.items(): print(f"{name} Performance Metrics:") for metric, value in metrics.items(): print(f" {metric}: {value:.4f}") # GNN performance print(f"GNN Test AUC: {auc:.4f}")
-
Analyze which model performed better in terms of accuracy, precision, recall, F1 score, and AUC. Discuss potential reasons for the observed differences, considering the complexity of relationships captured by GNNs versus traditional ML models.
Step 2: Biological Interpretation
- Analyze the top predictions from each method.
# For traditional ML (e.g., Random Forest) top_predictions = sorted(zip(X_test, y_pred_proba), key=lambda x: x[1], reverse=True)[:10] # For GNN top_gnn_predictions = sorted(zip(data.edge_index.t(), pred), key=lambda x: x[1], reverse=True)[:10] # Interpret biological significance for prediction in top_predictions: u, v, score = prediction print(f"Disease: {u}, Gene: {v}, Prediction Score: {score}") for prediction in top_gnn_predictions: edge, score = prediction print(f"Edge: {edge}, Prediction Score: {score}")
Step 3: Incorporating Additional Features via NLP
- Extract additional features using NLP (e.g., from scientific literature or clinical notes).
- Use libraries like spaCy or NLTK to process text data and extract relevant features.
import spacy from sklearn.feature_extraction.text import TfidfVectorizer nlp = spacy.load('en_core_web_sm') documents = [...] # List of documents to process tfidf = TfidfVectorizer(max_features=1000) tfidf_matrix = tfidf.fit_transform(documents) # Add these features to the existing feature set additional_features = tfidf_matrix.toarray() X_enhanced = np.hstack((X, additional_features))
- Incorporate these additional features into the dataset and retrain the models.
# Split enhanced data X_train, X_test, y_train, y_test = train_test_split(X_enhanced, y, test_size=0.3, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42) # Retrain traditional ML models and GNN with the enhanced feature set # Follow the same steps as before for model training and evaluation
Step 4: Future Directions
- Explore incorporating more diverse features such as gene expression data or pathway information.
- Experiment with more advanced GNN architectures, like GraphSAGE or GAT.
- Develop an ensemble method combining traditional ML and GNN predictions to leverage the strengths of both approaches.