Sneha_Raj - Bioinformatics (Level 2) Pathway

Sneha_Raj · July 9, 2021, 12:16am

Module 1 Self-Assessment

Technical Area

Installed RStudio and additional packages
Learned R syntax, functions, and arguments
Reviewed reading academic literature
Learned to plot data using ggplot

Tools

RStudio
ggplot2
R Console
Bioconductor

Soft Skills

Adaptive: Although I had no prior experience with R, I was able to use my existing understanding of Java to adapt and find commonalities while coding. I went through provided resources and worked through the step-by-step guides in order to grow confident with the R environment.
Resourcefulness: In addition to the resources linked in this unit, I consulted web resources to expand my understanding of RStudio and the R language in general. Wherever I was stuck, I spent time on multiple websites in order to find explanations that best suited my learning needs.

Achievement highlights

Learning syntax, functions, and arguments in R
installed the ggplot package
plotted data using ggplot2

Tasks completed

While I am familiar with programming in Java, I have little experience with R. After reading through several web sources, I downloaded R, R Studio, and some of the recommended packages. I also updated some of the packages.
I then used the syntax guide and consulted other websites to test out a variety of commands including vectors, matrices, data factors, and lists on R studio.
After studying syntax and the uses of different data structures, I tried using a few mathematical functions by inputting arguments. I also familiarized myself with the uses of packages and how to load libraries for use such as ggplot2.
Next I worked through the data wrangling document. I located and assigned the mouse_exp_design csv. I then learned how to manipulate vectors, save the data in the csv to a new file, plot functions in different formats, and plotting using the ggplot2 function.
Finally, I read the Guo paper and watched the videos regarding bioinformatics. This was an easier task because I am familiar with reading the academic language in such research papers.

Sneha_Raj · July 9, 2021, 12:19am

Module 2 Self-Assessment

Technical Area

Using GEO to search for & download data
Importing metadata and gene expression data to R Studio
Basics of R Shiny apps

Tools

RStudio
R Shiny package
https://mastering-shiny.org/textbook
GEO database
GEOquery package
Bioconductor documentation
Limma package

Soft Skills

Perseverance: A lot of the material was very new to me and sometimes some of the resources were a little beyond my understanding, so I did my best to go step-by-step and cover the basics with videos on StemAway and online and eventually approaching the more difficult instructions and resources.
Team collaboration: During team meetings I kept my camera on and provided visual and verbal cues to indicate that I was in agreement/on the same page as my teammates. I made sure to go over the material ahead of time and bring questions to our meeting with our mentor, Anya.
Resourcefulness: When in doubt, I used pathway hubs, google drive, slack, other topics, and even Google to try to find a solution on my own. If I was still confused, I wrote down the question and brought it to team meetings.

Achievement highlights

Understanding the components of an R Shiny App
Being able to import metadata and gene expression data into R Studio
Explore other methods of data importation with alternate packages

Tasks completed

I read, understood, and summarized the Long et al. 2019 paper (the paper whose results we aim to replicate during this internship).
I watched and took notes on the “From Transcriptomics to Therapeutics” presentation by Ayush Noori where I learned some bioinformatics background on microarray vs. RNA-seq data.
In order to communicate with team leads and teammates, I joined the Slack group, introduced myself on the general channel, filled out the form to be placed in subgroups, and started attending subgroup and team-wide meetings.
I used a variety of web and Stem Away-provided resources to learn the basics of creating an R Shiny app.
Now that I was more familiar with Transcriptomics and the GEO database, I started working on my R code using the instructions provided by the module. Instead of the data in the module, I made sure to use the data from the new paper (GSE19804).
I was receiving an error that my files were not valid, so I reached out to our mentor, Anya, on the pathway hubs with screenshots of my code and she was able to help me.
Finally, I tried using other packages used for implementation including Oligo and DESeq2. I also looked online for the different scenarios where using a different package.

Sneha_Raj · July 22, 2021, 10:27pm

Module 3 Self-Assessment

Technical Area

Generate Quality Control reports using various Bioconductor packages
Export and import csv files into RStudio
Create various forms of visualizations for raw and normalized data

Tools

RStudio
Bioconductor package documentation
QC plot with simpleaffy https://bioconductor.statistik.tu-dortmund.de/packages/3.6/bioc/vignettes/simpleaffy/inst/doc/simpleAffy.pdf
report using arrayQualityMetrics https://www.bioconductor.org/packages/release/bioc/vignettes/arrayQualityMetrics/inst/doc/arrayQualityMetrics.pdf
RLE/NUSE boxplots https://www.bioconductor.org/packages/release/bioc/vignettes/affyPLM/inst/doc/QualityAssess.pdf
rma() background correction and normalization
boxplots
PCA
heat maps

Soft Skills

Problem-solving: When encountering errors in my code, I made sure to utilize past contributions on the StemAway channels and even Googling the errors in order to figure out and solve which line of my code had gone wrong.
Communication: I made sure to ask my questions as they arose. I brought to attention that our team did not yet have a designated GitHub repository and our leads were able to create one. During team meetings, I notified my leads and other teammates of my personal progress on individual tasks.

Achievement highlights

Take imported raw data and apply normalization methods
Create visualizations in order to compare raw and normalized data
Troubleshoot errors thrown out by the code

Tasks completed

I did a more in-depth reading of Long et al. (2019) and watched session 1’s module 2 kickoff meeting where Mr. Ali Nehme explained some of the components of the paper.
I used the simpleAffy, arrayQualityMetrics, and affyQCReport methods to conduct quality control and used affyPLM to fit boxplots over the data.
I normalized the data using rma() and exported the data as a new csv file. I then imported this csv file, so I would not have to perform normalization every time I reran my code.
When accessing the normalized data, I was orginally using the exprs() method, which caused memory errors when generating visualizations. I spoke with our mentor, Anya, who helped me recognize when to and when not to use exprs() in order to standardize data.
I created boxplots, PCA plots, and heat maps in order to visualize the data and compare the raw versus normalized data. Using the PCA plots, I was able to recognize some outlier points.

Sneha_Raj · July 30, 2021, 3:34am

Module 4 Self-Assessment

Technical Area

Identify and remove outliers
Annotate data to remove duplicate ProbeIDs, NAs, and SYMBOLS
Filter genes which will provide good quality results
Use limma to find DEGs and p-values
Create visualizations of the DEGs

Tools

RStudio
Identify & remove outliers by calculating IAC: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/HumanBrainTranscriptome/Identification%20and%20Removal%20of%20Outlier%20Samples.pdf
limma package
volcano plots: 19.11 Volcano plots | Introduction to R
heat maps
hgu133plus2.db

Soft Skills

Time-management: While working on the pathway modules, I was also working on coding components for the RShiny app as a part of Group A. Having so many tasks and deadlines, it was important for me to allot and finish my tasks within a certain amount of time. I organized these time slots using my google calendar.
Communication: During this module, I encountered several issues while trying to create the volcano plot and heat map visualizations. By providing my mentor with my code and screenshots of the plots, I was able to clearly communicate my issue and receive feedback.

Achievement highlights

Identified and removed 7 outliers from the microarray and metadata
Filtered genes for quality samples to be used in assessment
Created visualizations of DEGs using volcano plots and heat maps

Tasks completed

I researched different methods of outlier detection and felt that doing it by inter-array correlation (IAC) was the best method in this case. I found and removed a total of 7 outliers from my metadata and microarray data.
I generated another PCA plot to visualize the data after outlier removal.
I used the in-depth guide to annotate the data and removed duplicate ProbeIDs, NAs, and SYMBOLS.
I filtered the data below the second percentile to make sure that the remaining genes could be used in a quality analysis.
I used the limma package to find differentially expressed genes and generated a csv data table. I also created a report with significant genes (when p<0.05) using the eBayes function.
Using ggplot, I created and stylistically formatted a volcano plot so that certain thresholds and the names of the genes could be identified. I alternatively tried to use Enchanced Volcano, but soon realized that it was unsupported on my computer.
I created a heat map to the top 50 differentially expressed genes and color-coded it such that cancer data and normal data were easily differentiable. At this point, I encountered some trouble because the row names of my group table did not match the column names of my filtered data, so the color-coding did not show up. I made the appropriate changes in my metadata file and in my code so that the column and row names matched, and the color-coding showed up.

Sneha_Raj · August 4, 2021, 5:23am

Module 5 Self-Assessment

Technical Area

Create and label vectors in R
Create and format barplot, dotplot, and cnetplot visulizations
Convert ENTREZID to Gene Symbols and vice versa
Identify genes and their relationships in the top KEGG pathways
Identify the relationships of genes in regulatory target sets

Tools

RStudio
clusterProfiler package
org.Hs.eg.db package
enrichplot package
GSEA-MSigDB
barplot, dotplot, & cnetplot

Soft Skills

Persistence: While completing this module, I finally felt that I was acquainted enough with coding in R that I did not constantly have to google the syntax or documentation of commonly used functions. This showed me that even starting from no experience, continued effort in learning a new subject pays off. I see that eventually I am able to self-learn a new skill—in this case a language.
Public Speaking: During this work I also prepared a powerpoint presentation for my team giving an overview of the Long et al. 2019 paper. From feedback from my team manager, I learned that presentations on academic papers are more engaging when presentations have less text and more visuals. The job of the speaker is to point out parts of the images to offer more in-depth explanations of the authors’ methods and results. While I felt I did a good job of speaking clearly and explaining various parts of the paper, I feel that the visuals and text on my presentation could be more succinct.

Achievement highlights

Created, filtered, and annotated R vector of DEGs for further analyses
Access external gene datasets and use them to tag data points annotated with ENTREZID or Gene Symbol
Identified and created visual networks of top genes in related pathways in GSEA and Transcription Factor Analysis

Tasks completed

I created a sorted DEG vector after filtering the topTable data from module 4.
I made 3 barplot visualizations of gene ontology: biological processes, molecular functions, and cellular components by using the enrichGO() function on the ENTREZID-annotated DEG vector.
I used the enrichKEGG() and dotplot() functions to create a visualization of the levels of specific cellular processes of tumor samples in comparison to normal samples.
I made a gene-concept network depicting the linkage of the genes in the top 5 KEGG pathways using cnetplot().
I performed a gene set enrichment analysis (GSEA) in order to associate gene expression to different cellular/molecular processes in the groups. This helped determine whether the gene set is randomly distributed or correlated with a specific cell phenotype. I used the first method of reading the gmt files downloaded from the database and then generated a global DEG vector. This vector was then used to create a GSEA() plot displaying selected gene sets.
I performed transcription factor analysis by getting the C3 regulatory target gene sets and identifying the related genes of interest. This network was also visualized using cnetplot().
I prepared my data for external tools by writing a text file to my laptop.

Deliverables & code

2021-BI-June4/Sneha_Raj/Module 5 at main · mentorchains/2021-BI-June4 · GitHub

Sneha_Raj · August 14, 2021, 9:36pm

Module 6 Self-Assessment

Technical Area

Export a text file of top unregulated DEGs by gene symbol
Use web tools to create and analyze visualizations

Tools

Metascape
STRING Database
GEPIA 2

Soft Skills

Analytical: Using web tools to generate visualizations in this module gave me to opportunity to compare my own plots and networks and observe similarities and differences. I also did some more biological investigation by searching for the disease pathology and expected gene enrichments.
Presentation: This week I also participated in the external R Shiny app presentation to Dr. Rola Dali. In preparation to this event, I worked on my public speaking and presentational skills, so that I could succinctly convey my point and tie into the overall group’s presentation.

Achievement highlights

Using Metascape to generate gene enrichment and PPI graphs
Using STRING DB to generate and analyze PPI networks
Using GEPIA to conduct survival analysis on specific genes

Tasks completed

I imported the list of gene symbols of top unregulated DEGs I generated in module 5 to EnrichR and Metascape.
Then, using the features on the site, I generated visualizations of the enriched genes, pathways, and condensed protein-protein interaction (PPI) networks. I observed some of the cellular processes which were distinctive in the visualizations.
I imported the same list of gene symbols to the STRING database to generate another PPI network. I moved the nodes around and observed the connection across different genes.
Finally, I performed a survival analysis on GEPIA looking at the NCKAP5 gene in lung adenocarcinoma. The survival analysis is plotted as a line graph with time on the x axis and percent survival on the y axis.

Deliverables

Sneha_Raj · August 23, 2021, 4:03am

Module 7 Self-Assessment

Technical Area

Import a new set of data, produce visualizations, and understand disease pathology
Use web tools and existing publications to corroborate evidence

Tools

R Studio
Oligo package
Limma
GSEA
Heatmap
Annotation database
IAC outlier detection

Soft Skills

Presentational: I created a Google Slides presentation to assist my speaking. In the slides I made sure to save most of the space for visualizations and used little text. I used transitions to add boxes and highlights to my figures while talking about a specific section of the visualization. This made my presentation more engaging as I wasn’t making my audience face a giant block of text.
Flexible: Initially I wanted to use GSE98979 which contains normal and cancer samples taken over a period of time, however, I was still learning to annotate them. Given more time I would like to explore more ways to annotate data so that I can learn the different kinds of comparisons that can be made. The dataset I chose, GSE66272, was more helpful to me in the moment to showcase the skills I developed this summer.

Achievement highlights

Independently conduct a transcriptomics analysis on a new dataset.
Present results to mentors.

Tasks completed

Search GEO for a dataset with expression profiling by array. I ended up picking GSE66272, a clear cell renal cell carcinoma.
Import metadata using the Oligo package.
Conduct RMA background correction and normalization. Generate PCA plots.
Identify and remove outliers using inter-array correlation.
Annotate data, conduct a Limma analysis, and identify differentially expressed genes using visualizations.
Conduct functional analysis using enrichGO and enrichment analysis with GSEA.
Use publications and Metascape as evidence to corroborate my findings.
Create presentation giving an overview of my findings and my overall growth during this internship.

Deliverables