Bioinformatics - Level 1 - Module 1 - Kelly Zhang

Overview:

Technical Area:

  • Installing RStudio, along with ggplot and Bioconductor packages in R
  • Familiarizing myself with basic R fundamentals, such as coding, debugging, syntax
  • Understanding how to read scientific papers

Tools:

  • STEM-Away
  • RStudio
  • R

Soft Skills:

  • Time Management: I’m also juggling a lab position, so I had to manage my time wisely to make sure I was effectively completing my duties as PM leader and as a STEM-Away participant.
  • Project Management: I set up my team’s Trello, managed our Google Suite workspaces (Calendar, Drive, Google Group), and I made a few (hopefully) helpful STEM-Away posts.
  • Leadership, Co-Mentoring: I learned to work with both project leads, and leadership of the other Bioinformatics teams in order to ensure our team participants were starting out on the right foot.

Achievement Highlights:

  • I had a past version of R downloaded, so updating to the latest version of R (v4.1.0), which allowed me to download the latest version of Bioconductor, and bypassing several “file not found” errors during this process
  • I now have a basic understanding of bioinformatics.
  • I familiarized myself with R syntax and code and with working in RStudio.

Difficulties Completing Tasks:

  • The STEM-Away platform was difficult and confusing to navigate at first but as I interacted with it more, I found it easier to find the posts and information I was looking for.

Module 2 Overview:

Technical Area:

  • Installing several necessary packages in R
  • Working with RShiny, RStudio, and RStudio Cloud
  • Familiarizing myself with the GEO database and understanding how to view extracted data

Tools:

  • RStudio, R, RShiny
  • WinRAR
  • R Packages: affy, affyPLM, simpleaffy, etc.
  • Bioconductor
  • Github

Soft Skills:

  • Time/Task Management - I’m still figuring out how to juggle being my lab internship with the STEM-Away tasks and my project management lead duties. I was able to get all the necessary things done but I will create an hourly schedule for myself next week so that I can stay on top of everything.
  • Project Management - I created a Google Spreadsheet detailing the Bioinformatics Pathway for my team. I’ve also been updating our team’s Trello and Google Calendar.
  • Leadership - I worked with our Project Leads on setting up the Github, making sure we covered everything during meeting agendas. I also make sure to post important information so that our team has all the tools they need to be successful.
  • Collaboration - I worked with Shravya last week on our Journal presentation and I am working with Leila this week on our module 3 deliverables.

Achievement Highlights (3):

  • Ivan and I played around with our Github repository. I was able to get a handle on creating branches, forking the repository, pulling and pushing changes, refreshing my fork, etc. I also familiarized myself with working with Github Desktop and Github via the Windows Command Prompt.
  • I successfully unzipped and opened the GEO datasets in RStudio. I have a Windows machine, so this was a bit of a challenge for me initially.
  • I kept constant communication with the other leads (from my own team and other BI teams) to coordinate the R Shiny project and to help guide and mentor my team members.

Difficulties Completing Tasks:

  • Since I installed R v4.1.0 initially, I had some trouble with further package installations. With Sam’s help, I reverted to R v4.0.5.
  • I kept getting a “path not writable” error when I tried installing the packages but I was able to work through this.
  • Completing batch corrections on multiple GEO datasets - I have not yet been able to do this but I’d like to figure it out

Module 3 Overview:

Technical Area:

  • Installed several Bioconductor packages to RStudio
  • Worked with pulling/committing/pushing code from/to GitHub
  • GSE19804 dataset

Tools:

  • RStudio
  • R
  • simpleaffy
  • arrayQM
  • affyQCReport
  • affyPLM
  • PCA
  • pheatplot()

Soft Skills:

  • Time Management - Juggled PM duties, learning the ins and outs of data analysis, and my personal lab projects
  • Project Management - Worked closely with Sam and Disha on organizing the Bioinformatics-team-wide R Shiny project. Worked with BI-Team 1 leads on preparing team members, structuring meetings, and ensuring completion of deliverables
  • Leadership - Coordinated and prepared meetings, instructed fellow team members on module timeline/RShiny overview
  • Virtual Collaboration - Met with different groups throughout the week, working on deliverables, R Shiny, and the STEMAway modules.

Achievement Highlights (3):

  • Successfully working with R and R libraries! I have been able to generate analyses of the GSE19804 dataset via simpleaffy, arrayQM, affyQCReport, affyPLM, PCA, and heat plots.
  • I now understand batch corrections, although I’m not familiar with doing a batch correction.
  • Contribution of PM duties. Working with Sam and Disha to create a foundation for the RShiny project using Slack, Airtable, Trello, Spacetime, and Google Calendar.

Difficulties Completing Tasks:

  • Understanding how to read the plots output by the many data analysis tools, specifically heat map plots

Module 4 Overview:

Technical Area:

  • Gene annotation and filtering
  • Limma analysis: statistical method for multifactorial analysis of DEG
  • Volcano plots: visualizing DEG

Tools:

  • RStudio
  • R
  • hgu133plus2.db
  • Limma analysis
  • Volcano plot (volcanoplot())
  • Heat plot (pheat())

Soft Skills:

  • Time Management - Juggled PM duties, learning the ins and outs of data analysis, and my personal lab projects
  • Project Management - Worked with Sam, Disha, and the BI-Team 1 Leads to coordinate meetings, break down project details, and address member questions/concerns.
  • Leadership - Provided resources to my team members and assisted in leading meetings.
  • Virtual Collaboration - Met with the BI-Team 1 this week to present deliverables and my group to work through our deliverables (generating the top 10 DEG).

Achievement Highlights (3):

  • Good understanding of determining data outliers. Figured out how to exclude outliers from raw and meta datasets along with my group members.
  • Successfully copied over subsets of my raw and meta datasets to generate a matrix model, a linear fit model, and a Bayesian model.
  • Figuring out how to plot a heatmap with both raw and normalized data, in addition to understanding what the heatmap means.

Difficulties Completing Tasks:

  • I had some trouble figuring out how to use certain Bioconductor packages. For example, I wasn’t sure what arguments to use for topTable(). I was able to figure this out thanks to my group members, Aditi and Shreya.
  • I got a few errors while running my code and had to figure out what the error was. I occasionally had to go back into previous code, change a few things, and rerun the same sections a few times.

Module 5 Overview:

Technical Area:

  • Utilized code from previous module to construct a list of top DEG. In the process, I ended up changing and debugging some of my code so that it would run.
  • Performed GSEA Analysis using GSEA-MSigDB and our DEG vector
  • Installed several BioConductor packages

Tools:

  • R, RStudio
  • GitHub
  • GSEA-MSigDB
  • BioConductor Packages: msigdbr, tidyr, magrittr

Soft Skills:

  • Virtual collaboration - worked with Roman over Zoom to construct code and work on the GSEA plot together
  • Presentation - presented our deliverables to the BI-Team 1. I explained what the GSEA plot showed us and possible analyses, as well as concerns.
  • Time Management - planned my time so I could work in meetings with my team, the mentors, and to work on the deliverables while making time to familiarize myself with functional analysis and my personal lab projects.

Achievement Highlights (3):

  • I successfully debugged my code for constructing a list of top DEG (code carried over from the previous module but was not running as expected, hence the debugging).
  • I learned how to read and analyze a GSEA plot.
  • Roman and I experimented with different P-values in the GSEA plot and discovered the optimal range of max and min p-values for visualizing our results.

Difficulties Completing Tasks:

  • My team collectively had trouble understanding which dataset and meta dataset we should be using, as we all filtered our data differently in the previous module. We ended up using the meta dataset that Shreya, Aditi, and I had modified in module 4.

Module 6 Overview:

Technical Area:

  • Survival Analysis using GEPIA database

Tools:

  • Gene Expression Profiling Interactive Analysis database

Soft Skills:

  • Virtual Collaboration - Arian, Ivan and I live in different time zones and did not have a time when all three of us could meet up, so we divvied up tasks and communicated our findings via Slack
  • Teamwork - We each chose different genes to analyze with lung cancers (LUSC and LUAD).
  • Time Management - Although this week’s task was simpler, I made sure to partition my time so that I completed my deliverables in time for our weekly meeting.
  • Presentation - I presented my findings on NCKAP5 to the BI-Team 1 at our weekly presentation meeting.

Achievement Highlights (3):

  • I familiarized myself with the GEPIA database; I played around with Survival Analysis, Single Gene Analysis, and Multiple Gene Analysis, I selected several different genes from our constructed file of top DEG, and I chose various cancer types to look into with the genes.
  • I read several papers that used the GEPIA survival analysis plots in their research so that I could learn how to understand and analyze a survival analysis plot.
  • I chose to analyze the impact of NCKAP5 on lung cancer and despite not finding much literature on NCKAP5, I was able to determine that positive expression of the gene is correlated to a higher survival rate for those with lung cancer.

Difficulties Completing Tasks:

  • Due to time zone differences and no overlap in schedule, it was difficult to meet up with Arian and Ivan to work on deliverables. Despite this, we were each able to successfully complete survival analyses on our individually chosen genes.

Module 7 Overview (Weeks 7-8):

Technical Area:

  • Performed the following on GEO dataset GSE4107:
    • Statistical analysis
    • Quality control report analysis
    • DGE analysis

Tools:

  • R/RStudio
  • GitHub
  • Bioconductor packages: affy-packages, ggplot2, pheatmap, limma, EnhancedVolcano, clusterProfiler, enrichplot, msigdbr
  • GSEA database
  • GEO/GEOquery
  • DAVID, STRING-DB, GEPIA

Soft Skills:

  • Project Management/Task Management - During our meetings, I recorded what tasks we had to finish and assigned various members to different tasks.
  • Time Management - We had a little over a week to perform analyses on GSE4107 samples, so I managed my time by dividing up my jobs and scheduling time for myself to work on specific tasks (ex: literature review, plot analysis, code review, etc).
  • Teamwork - I worked with my team members @ivanlam27 @veyssi @Ananya_Kaushik @Roman_Ramirez @Leila to analyze GSE4107 for significant genes to colorectal cancer.
  • Virtual Collaboration - My teammates and I met over Zoom calls, shared ideas and papers over Slack, and worked together on a Google Document report and our Google Slides final presentation. We shared/collaborated on our code via GitHub.
  • Literature review - I looked over multiple papers to research any correlations between the FOS gene and colorectal cancer.
  • Presentation - My team working on the Capstone project and I presented a full report of our analysis of GSE4107 and our identified genes of interest in correlation to colorectal cancer to mentors Anya and Ali.

Achievement Highlights (3):

  • I am very proud of my teammates and I for completing a review/report of GSE4107’s sample’s significance to colorectal cancer within a week. We worked together to create data visualization plots, analyze our output and draw significant conclusions.
  • During my literature review of FOS, I found an interesting paper with two polymorphisms that enhanced expression of the FOS gene, leading to cell differentiation/tumor formation and a higher risk of colorectal cancer (Chen et al. 2019).
  • I have a really good understanding of reading data analysis plots: Normalization boxplots, PCA plots, Heatmaps, and Volcano Plots.

Difficulties Completing Tasks:

  • Difficulties completing tasks include working around time zones, as this was a highly collaborative project and our team was working across 4 different time zones
  • Without the structure of the modules, a difficulty I encountered was figuring out what to do for my final project/presentation. Luckily, I had a great team working with me and they helped me find the motivation and urgency to complete my tasks for our capstone project.