Updated! 📢 Project Kickoff and Team Assignments for Upcoming Virtual-Internships

Teams Based on MCQ Scores

We have extended our deadline, so more participants will be added on a rolling basis. Passing score is 8 out of 10. While we’ve accepted a couple of scores of 7, please ensure you thoroughly read the project details.

Foundational Projects:

Proceed to the code-along for your project and submit your replies to the specified tasks. If you encounter any issues, ask questions via the forum, and mentors will respond as soon as possible.

Advanced Projects:

We will contact you soon with details about the AI Evaluator and a kick-off preparatory meeting with our mentors.

Collaborative Opportunities:

Our projects are designed to encourage dynamic collaboration among teams within the same area. For example, the Foundational ML team can dive into NLP tasks highlighted by the Recommender team, and the Recommender team can compare their findings with those of the GNN-based Recommender team to develop a comprehensive study.

In addition, those who have honed their skills in R Shiny can collaborate with the advanced Bioinformatics team to implement sophisticated projects. Such cross-team interactions not only enhance learning but also lead to innovative solutions.

For the Survey of AI Coding Assistants, team members have the opportunity to familiarize themselves with specific AI coding assistants by working on tasks such as Text Classification using NLP or the PPI Visualization App. This hands-on approach provides a thorough understanding of AI tools and their applications in various projects.

Machine Learning

GNN-Based Recommender Systems

NLP Pipeline - Recommender Systems

Text Classification using NLP

Bioinformatics

Biomedical Knowledge Graph

PPI Visualization Graph

Survey of AI Coding Assistants

1 Like

Roadmap based on mentor meeting 07/27:

Parallel Tasks:

Bioinformatics

Objective: Define the specific project goals and select target diseases for prediction.
Starting point: General project idea of predicting gene-disease associations.

Tasks:

  • Research OMIM database structure and content
  • Investigate STRINGdb content and structure
  • Study the biological basis of gene-disease associations
  • Identify and prioritize specific diseases for the prediction task

Expected outcome:

  • A clear, written project objective specifying the prediction task and evaluation criteria
  • A prioritized list of diseases for the project focus

Data Processing, Feature Extraction, and ML Pipeline

Objective: Create a flexible pipeline for data processing, feature extraction, and ML model training that can be applied to any disease.
Starting point: Select any disease, ensuring the code is written in a way to support any selected disease.

Tasks:

  • Develop modular code for:
    • OMIM data extraction (API or web scraping)
    • STRINGdb data processing (data is available for direct download)
    • Text preprocessing and NLP feature extraction
    • Dataset preparation (positive and negative examples)
    • ML model training and evaluation
  • Implement the pipeline for the initial selected disease
  • Ensure the pipeline can easily accommodate new diseases

Expected outcome:

  • A complete, modular pipeline that can process data, extract features, and train ML models for any given disease
  • Initial results for the selected development disease

Graph Neural Network Implementation

Objective: Develop a GNN model for gene-disease association prediction that can incorporate features from the ML pipeline.
Starting point: STRINGdb protein-protein interaction data.

Tasks:

  • Construct a graph representation of the protein-protein interaction network
  • Implement a basic GNN architecture (e.g., Graph Convolutional Network)
  • Develop a flexible node feature integration mechanism
  • Train the GNN using initial node features (e.g., protein IDs)
  • Implement evaluation metrics for the GNN model

Expected outcome:

  • A functional GNN model capable of predicting gene-disease associations
  • A mechanism to integrate additional node features when available from Task 2

Integration Plan:

  • Once Task 1 finalizes disease selections, Task 2 team will apply their pipeline to these diseases
  • Task 3 team will integrate the features generated by Task 2 into their GNN model as they become available

Alternate Path

An interesting idea proposed by a student was to use a pretrained GNN and fine-tune it for a specific use case. Mentors had concerns about the compute power requirements (pretrained GNNs are usually very large) and the learning aspect (it may lead to treating the model as a black box). However, this path could be a great step towards a final product. If anyone has feedback, please reply in this thread.

We have created a new subcategory: Project - July 2024 - STEM-Away® for this project.

Group members of Updates-2024 - STEM-Away® have read and reply permissions. This is an open group, anyone interested can join.

Group members of https://stemaway.com/g/Preinternship have full read, reply and create permissions. Members are added by the stemaway team.