Aman Burji - Bioinformatics - Self Assessment

Week: July 20

Overview of Things Learned:

Technical Area:

How GEO databases are made and the built in analysis tools

Tools:

Affy, simpleaffy, arrayQualityMetrics, sva, ggplot2, pheatmap packages

Soft Skills:

Setting up group meetings

Discussing and troubleshooting difficulties

Achievement Highlights

1)Noted GSM215109.CEL file to be considered a lower quality file than the rest, however, it wasn’t so different that I had to remove it

2)Setting up the group meeting and getting everyone acquainted with one another

3)Resolving the errors I had with RStudio and the older version of R, so as to set up for for the rest of the internship

Meetings attended

July 20 Team Meeting

July 24 BI Happy Hour

Goals for the Upcoming Week

  1. The purpose of quality control
  2. How to utilize normalized data to find differentially expressed genes
  3. Understand the purpose of the new packages

Tasks Done

  1. Quality control and batch correction
  2. Analysis of our quality control

Week: July 27

Overview of Things Learned:

Technical Area:

The purpose of annotations and how genomic data can be analyzed using already created packages

Creating metadata from the GEO databases, since all data has to be submitted along with certain tags

Finding the important conclusions to present

Tools:

hgu133plus2.db, limma, Enhanced Volcano, pheatmap

Soft Skills:

Setting up group meetings

Troubleshooting

Achievement Highlights

1)Reviewing the bioconductor package hgu133plus2.db and understanding its use in annotating the chip’s identifiers and accessions in order to format the normalized gene list.

2)Through office hours found many ways to reduce duplicates based on specified columns in the data frame and was able to improve my gene lists efficiency in filtering out the extra genes

3)My team could interpret the data from the volcano plots and the heat maps to bring out four genes for us to analyze and relate to colorectal cancer

Meetings attended

July 28 Team Presentation

July 29 GitHub Webinar

July 30 Office Hours

Goals for the Upcoming Week

  1. Learn about Gene Ontology and how genes can be classified and grouped based on function, location and the relationship between the genes
  2. Set up dates to discuss what DEG’s we will be further analyzing
  3. Understand how genes relate to each other in a cascading pathway

Tasks Done

  1. Annotate the normalized and batch corrected genes to make further analysis easier
  2. Limma analysis of batch corrected and normalized data to find the differentially expressed genes, save as a csv and Rdata file

Week: August 3

Overview of Things Learned:

Technical Area:

Gene ontology after functional enrichment of DEG data

KEGG analysis to find the high level functions of the genes and the biological system they belong to

Building a gene-concept network to see the multitude of genes affected and then designating hub genes through the STRING database

GSEA analysis to see which hallmarks are overexpressed/underexpressed and where they’re located

Tools:

org.Hs.eg.db, topGO, clusterProfiler, magrittr, tidyr, msigbr, enrichplot

Soft Skills:

Setting up group meetings

Troubleshooting, and analyzing results as a team

Achievement Highlights

1)This week was difficult to understand all the packages involved but I was able to finish the code after office hours

2)Spent an ample amount of time but properly understood what the graphs for KEGG, EnrichGO, GroupGO, and survival plots were saying. Had a decent understanding of GSEA and explained it to my team.

3)Helped host happy hour, and introduced a game that seemed to be well enjoyed

Meetings attended

August 3 Team Meeting

August 4 Team Presentation

August 6 Office Hours

August 7 BI Happy Hour

Goals for the Upcoming Week

  1. Understand how to analyze GSEA plots and their purpose
  2. Review documentation of the packages learnt this internship
  3. Start reviewing the datasets for the final project

Tasks Done

  1. Functional enrichment and analysis of DEGs, enrichGO, groupGO, KEGG analysis, locating HUB genes, gene-concept plots, GSEA plots using hallmark and survival plots

QC_DV Presentation.pdf (926.3 KB)

This was a difficult week for me as I had not been a part of the first two weeks of the bioinformatics pathway and R/RStudio hadn’t been properly installed on my laptop. I engaged with the team lead and asked Yves for help with installing packages and was then able to get my code to work and run properly. I connected with my team through the StemAway site at first and then we moved to zoom calls and texting.
I presented the slides on quality control and the PCA plots.

Coding:
The difficulties I faced with coding were mainly with loading the packages. For some reason, many of the packages that were supposed to come with the installation of R and RStudio did not, so the bio conductor packages I was attempting to load weren’t successful and the team leads and I weren’t able to figure out why. Updating my OS and reinstalling R did not work at first but with some help from Yves I was able to install the packages. Batch correction was the most difficult portion of the code because I had to read through the documentation and understand the arguments clearly to know what I needed to pass into the function.

Group 1 - DGE Analysis.pdf (429.0 KB)
This week was easier to code and I was able to understand the purpose of the majority of the functions used to plot and label the differentially expressed genes.
I presented a portion of the hypothesis along with the explanation for how the limma package worked to separate the differential genes.

Coding:
In the beginning, I was a little confused about how to do the annotations because I wasn’t familiar with the hgu133plus2.db package. But once I got used to the formatting the remainder of the code was easy. Understanding the purpose behind contrastMatrix() was a little difficult but I utilized office hours to clear up my confusions.

Functional Analysis.pdf (1.1 MB)
This week was the final week and had the most analysis to be done. Functional analysis was as much about learning the documentation of the multiple packages as it was making the code. I had a few problems since I found that my differentially expressed genes hadn’t been filtered properly, so I had to redo some of the code from the previous weeks to ensure I had better quality DEGs.
In the presentation, I explained the enrichGO plots, the GSEA and survival plots. These took time to research and learn how to interpret.

Code:
While completing this weeks code I did not have any significant difficulties, just to ensure that I was using the functions appropriately I did utilize the office hours to check my code. There I noticed that my DEGs had been filtered wrong the previous week and I had to correct that.

BioinformaticsFinal - ILD_Normal.pdf (571.2 KB)
For this final project, I analyzed the possible genes for interstitial lung diseases from the GEO database. To do this I went through very similar steps that I did during the colorectal cancer analysis with my group. I first completed a quality control of the data using affyQCReport, in order to generate the QC report as well as I did a PCA analysis to find genes that were possible outliers. I also created boxplots and histograms to view the data after normalization which was done using the rma() function. I used gcrma() as well but through some research found that for my dataset, rma() would yield better results.

Then I used the limma package as well as the hgu133plus2.db package to annotate and find the differentially expressed genes. Finally, I did enrichGO, groupGO, located hub genes, a KEGG analysis and finally a GSEA plot. I had some difficulties due to the small sample size of this dataset leading to a low number of DEG’s so I had to expand my p-value cutoff because the enrichGO and KEGG analysis were not working with the 140 genes that were at the 0.05 p-value cutoff.

Code:
The code was similar to my previous group work, I had to create my own metadata but it proved to be simple once I decided upon what factor I wanted to analyze using my data.