EunahYang - Bioinformatics Pathway

Week 3 [July 20th - July 26th]

concise overview of things learned:
technical area

  • Learned what the batch correction is, and how to build a model matrix for batch correction
  • Understanding various quality control/normalization techniques
  • Compare before/after each action, and figure out the effectiveness of the normalization, batch correction
  • Labeling and plotting the PCA plot, box plot, heatmap
  • Convert an object to desired class object and use those to the appropriate function
  • How to deal with memory issue when placing/processing datasets

tools

  • R, R studio
  • Stemaway forum
  • Github/Stack overflow/ Google: To get help

soft skills

  • Communicate with my team to solve the problem via Stemaway forum
  • Finish deliverables a day before the deadline and compare the results with teammates

three achievements:

  • Learned how model matrix works and what format should they have, and made model matrix with the help from Anya
  • Successfully contacted with my teammates, and corrected my heatmap with the help of Xuewen - plenty of communication!
  • Answered one question on Stemaway forum

list of meetings attended including team events:

  • 7/20 team meeting
  • 7/23 Office Hour
    I wanted to participate in the happy hour, but it was pretty late to me, and I was really tired to stay up till 5 am. I will try to be at the happy hour this week.

goals for upcoming week:

  • Manage my task done before the office hours and use OH to get more help
  • Communicate with teammates on slack and get to know each other more!
  • Try to attend the happy hour!
  • Start new deliverables earlier than the last week
  • Use slack as a communication tool

detailed statement of tasks done:

It was my first time programming with raw data, and first time using R as a technical tool. So doing the deliverables by myself was pretty challenging, but glad to finish all four deliverables. I used simpleaffy and affyPLM for quality control, and gcrma for normalization tool. I struggled to plot similar to the example ones. My biggest obstacle on this week was the batch correction. I understood the concept of the batch correction, but no idea how to deal with the functions and model matrix. I searched Github, Stack overflow, Googles, and used Stemaway troubleshooting posts. I learned that the ComBat batch uses a single covariant, and the model matrix should be a vector, not the data frame. So I used the ~factor and c(,) to make the desired model matrix. This was my biggest achievement this week. I plotted the data with PCA and heatmaps. I made tons of mistakes and errors doing this week’s deliverables, but happy to finish it with many people’s help.

Week 4 [July 27th – Aug 2nd]
// This post will be revised after the tag suggestions post for BI pathway is uploaded //

Overview of things I learned:

Technical area:

  • Understand the concept of gene filtering
  • Know how limma works for differential gene expression analysis

Tools:

  • R, R studio, Slack, Stemaway forum

Soft skills:

  • Frank conversation with my personal status and keep in loop with works within my given circumstances
  • Discussion of interpreting the volcano plot and figure out which genes upregulated in colorectal cancer group
  • Making hypothesis after analyzing the data plots and reading the papers related to our datasets

Achievement highlights:

  • First time using Slack for real-time team communications

  • Set hypothesis with analyzed data and papers
    CEMIP and FOXQ1 was upregulated in colorectal cancer(CRC) group, but the usage of those genes in CRC cells were unknown. I read the original paper where our datasets originate, and searched additional papers related to those gene locus. And set hypothesis that CEMP and FOXQ1 have properties of common tumorigenesis and malignancy.

  • Discussed about the Cultural Competence in Micro-Away Presentation
    My personal experience of cultural diversity

List of meetings attended including team events:

  • Jul/27/2020 team meeting
  • Jul/28/2020 team presentation
  • Jul/29/2020: BI GitHub Webinar

Goals for upcoming week:

  • Make suitable & stable workspace for coding (I’ve struggled to code because of the laptop hardware problems)
  • Review the week 4 deliverables (Annotation, Gene filtering, Limma)
  • Understand precisely what variable what I am using, and how it looks lilke
  • Interpret the data for my own analysis
  • Attend the happy hour

Detailed statements of tasks done:
This week my laptop was dead and R studio was crashed, so not be able to go through the deliverables this week. But tried to keep up with the conversations, analysis, and preparing presentation. We discussed of interpreting the resulted plots and figured out which genes are differentially expressed in cancer cells. The significantly upregulated genes on CRC cells are CEMIP and FOXQ1 for our team (with gcrma normalization), and set those as further analysis target. We made hypothesis that CEMIP and FOXQ1 will have key role or at least useful as markers for tumor genesis. I serached CEMIP and FOXQ1 on journals, but there were few details about the mechanism of those genes. So we wrapped up that those genes may promote angiogenesis, participate in sending anti-apoptosis signal, may co-upregulated with cell proliferation marker, and related to cell morphological change.

Week 5 [Aug 3rd - Aug 9th]

Overview of things I learned:
Technical area:

  • Functional enrichment analysis: Gene ontology analysis, Kyoto Encyclopedia of Genes and Genomes analysis, Gene concept network, STRING DB
  • Global/universal gene set enrichment analysis: MSigDB_GSEA(Hallmark gene sets)
  • Transcriptional factor analysis: MSigDB_GSEA(C3 gene sets)
  • Survival analysis
  • Plot with cnetplot, gseaplot2

Tools:

  • R, R studio, Github, Slack, Stemaway forum

Soft skills:

  • Team discussion of interpreting the cnetplot and figure out which genes to dig further
  • Compare my plots with teammates and see what makes the difference

Achievement highlights:

  • Attend office hour with question related to plot interpretation, not coding itself
  • Review the week 4 deliverables (Annotation, Gene filtering, Limma) and finish this week’s deliverables
  • Understand the difference of each functional analysis and interpret the plots

List of meetings attended including team events:

  • Aug/3/2020: Micro-Away (Team 2) Meeting
  • Aug/3/2020: Python Webinar for Beginners #1
  • Aug/4/2020: GitHub Webinar
  • Aug/4/2020: Micro-Away (Team 2) Presentation
  • Aug/5/2020: Deliverables Webinar with Q&A

Goals for upcoming week:

  • Finish my project with new gene sets, and analyze them
  • Name the variable with understandable terms, and make it accessible to others
  • Upload my data on Github as I planned
  • Understand the GSEA plot
  • Do survival plotting for gene of interest

Detailed statements of tasks done:

  1. Gene ontology analysis
    Converting my limmadata into a gene vector annotated with ENTREZID.
    Set fold change threshold and erase genes with minor expression change in the cancer group compared to control.
    Misunderstood at first- I made a data frame with logFC and ENTREZID, but figured out I need to make a vector with logFC value and name those with ENTREZID.
    Plotting the CC, MF,BP ontology was repetitive, so I made function to shorten my code.
  2. KEGG analysis
    Used enrichKEGG to plot data, and checked my differentially expressed genes are mostly related to cell cycle and proliferation. These fit with the last week’s hypothesis- differentially expressed genes would play an essential role in tumorgenesis.
  3. Gene-Concept network
    Double checked the data in KEGG analysis.
    Learned to use cnetplot in two ways; circular form and non-circular form. And non-circular type was easier to check the interaction of each gene for me.
  4. STRING DB
    Analyzing my gene list on STRING was highly aggregated, and a huge hub gene cluster was found. Went to office hour and learned ways to reduce the number of nodes, but that method didn’t helped with the overall node distribution. And the highly clustered proteins were related to DNA replication and cell cycle.
  5. GSEA with hallmark gene set
    Learned I should use the whole data set, not the filtered ones for GSEA.
    Hard time understanding the arguments of enricher and GSEA function.
    Need more time to understand the GSEA plot.
  6. Transcription factor analysis with regulatory target gene sets
    Did the transcription factor analysis with top 5 significant terms.
    Genes on the cnetplot were highly overlapped, meaning they share similar non-coding regions(mostly transcription factor binding sites).

Week 6 [Aug 10th - Aug 16th]

Overview of things I learned:

Technical area:

  • Learn which gene to put in GEPIA for survival analysis
  • Analyze the GSEA plot

Tools:

  • R, R studio, Github, Slack, Stemaway forum, Biostars

Soft skills:

  • Presenting with interpretation of data
  • Communicate with my teammates

Achievement highlights:

  • Understand the difference of each functional analysis and interpret the plots
  • Made metadata with new geneset
  • Interpret the whole data and connect those with biological background

List of meetings attended including team events:

  • Aug/10/2020: Micro-Away (Team 2) Meeting
  • Aug/11/2020: Office Hours: Functional Analysis
  • Aug/13/2020: Office Hours
  • Aug/14/2020: Micro-Away Presentation Meeting

Goals for upcoming week:

  • Finish my project with new gene sets, and analyze them
  • Upload my data on Github
  • Do the final project presentation

Detailed statements of tasks done:

  1. Troubleshooting during data collection
    Me and my teammates used exactly the same dataset, and similar procedure in processing the data. But we had slightly different plots, and we compared our codes and figured out that we used different functions in erasing duplicates. I used !duplicated while others used collapserows. !duplicated leaves the first probeID with the same name while collapserows select rows with higher means. The single line of code significantly changed our plots.

  2. Admit that my hypothesis can be wrong
    Our team selected FOXQ1 and CEMIP as top differentially expressed genes by looking at the volcano plot. But I overlooked that volcano plot doesn’t consider the function of each gene. They may have high fold change and high reliability, but this doesn’t mean they also have high expression rate. There are possibilities that these two genes are not that important in the tumorigenesis process.
    There are several genes that have similar fold change and p values with FOXQ1 and CEMIP. But those values were slightly lower. It turns out that some of those genes have much higher rate in expression, and play more important role in vivo.
    We set the hypothesis that FOXQ1 and CEMIP may play a key role in tumorigenesis, but proven to be not.

  3. Start the personal project and made metadata
    This week I started a new project with breast cancer. Before that, I used metadata that the team lead provided. I saw the ‘seris matrix files’ in GEO and learned how to make metadata.

Week 7 [Aug 17th - Aug 23th]

Overview of things I learned:

Technical area:

  • Create metadata from raw dataset
  • Quality control with affyQCReport
  • Normalization with gcrma package
  • Annotation with hgu133plus2.db
  • Differential gene expression analysis with limma
  • Using Gene Ontology, KEGG analysis, and STRING DB for function of particular set of gene

Tools:

  • R, R studio, Github, Slack, Stemaway forum, Biostars

Soft skills:

  • Public presentation with my analysis, combining with biological information

Achievement highlights:

  • Analyze new set of data from the start to the end by my own
  • Separate my dataset into two, and analyze each one to obtain meaningful result
  • Present my project to public, about my analysis and some biological background

List of meetings attended including team events:

  • Aug/17/2020: Office Hour
  • Aug/17/2020: Micro-Away Team Meeting
  • Aug/18/2020: Webinar: How to Make a Professional Presentation
  • Aug/21/2020: [BI] Final Presentations

Detailed statements of tasks done:

  1. Start the personal project and made metadata
    This week I started a new project with breast cancer. Before that, I used metadata that the team lead provided. I saw the ‘seris matrix files’ in GEO and learned how to make metadata. This metadata includes cancer batch(early recurrence cancer, intermediate recurrence cancer, late recurrence cancer).
  2. Read several papers related to estrogen positive breast cancer
    To understand what the data actually means, I need to read several papers related to my dataset. I figured out there are several subtypes of breast cancer, and estrogen positive group are susceptible to cancer recurrence. The reference papers are included in the presentation file.
  3. Normalization & batch correction
    I normalized my data with gcrma package. At first, I batch corrected my data. I used single data so the batch correction was necessary, but I did those because I thought that process may help the data to be clustered. But I got meaningless data(every p value of differential gene was 1). I attended office hours, searched biostars, and asked in the forum but couldn’t find why this happened at first. I learned that batch correction may overcorrect the data if that correction is not necessary. It may be helpful to figure out what the mechanism of batch correction(ComBat function), and how it differs from the normalization procedure.
  4. Limma analysis
    I used limma to get differentially expressed gene list. But my data was all cancer data from same cancer group, so the gene expression was not that different. So I lowered the threshold. I plotted heatmap and volcano plot, and most of the genes were expressed almost the same. Few of the genes were expressed differentially, but the rate was not that significant from the previous project. But I continued the functional analysis to figure out what happens in the recurrence cancer from distinct recurrence time point.
  5. Functional analysis
    I got two differentially expressed gene list: differentially expressed gene comparing the early recurrence breast cancer versus intermediate recurrence breast cancer / differentially expressed gene comparing the intermediate recurrence breast cancer versus late recurrence breast cancer. I did same analysis(GO, KEGG, extracting hub genes from STRING DB). And did survival analysis of ESR1, which are differentially expressed in both list.
  6. Final project presentation in public
    finalproj_EunahYang.pdf (2.3 MB)
    This is my final project file, and I presented for about 15min.