Module 1 Self-Assessment
- Installed RStudio and additional packages
- Learned R syntax, functions, and arguments
- Reviewed reading academic literature
- Learned to plot data using ggplot
- R Console
- Adaptive: Although I had no prior experience with R, I was able to use my existing understanding of Java to adapt and find commonalities while coding. I went through provided resources and worked through the step-by-step guides in order to grow confident with the R environment.
- Resourcefulness: In addition to the resources linked in this unit, I consulted web resources to expand my understanding of RStudio and the R language in general. Wherever I was stuck, I spent time on multiple websites in order to find explanations that best suited my learning needs.
- Learning syntax, functions, and arguments in R
- installed the ggplot package
- plotted data using ggplot2
- While I am familiar with programming in Java, I have little experience with R. After reading through several web sources, I downloaded R, R Studio, and some of the recommended packages. I also updated some of the packages.
- I then used the syntax guide and consulted other websites to test out a variety of commands including vectors, matrices, data factors, and lists on R studio.
- After studying syntax and the uses of different data structures, I tried using a few mathematical functions by inputting arguments. I also familiarized myself with the uses of packages and how to load libraries for use such as ggplot2.
- Next I worked through the data wrangling document. I located and assigned the mouse_exp_design csv. I then learned how to manipulate vectors, save the data in the csv to a new file, plot functions in different formats, and plotting using the ggplot2 function.
- Finally, I read the Guo paper and watched the videos regarding bioinformatics. This was an easier task because I am familiar with reading the academic language in such research papers.
Module 2 Self-Assessment
- Using GEO to search for & download data
- Importing metadata and gene expression data to R Studio
- Basics of R Shiny apps
- Perseverance: A lot of the material was very new to me and sometimes some of the resources were a little beyond my understanding, so I did my best to go step-by-step and cover the basics with videos on StemAway and online and eventually approaching the more difficult instructions and resources.
- Team collaboration: During team meetings I kept my camera on and provided visual and verbal cues to indicate that I was in agreement/on the same page as my teammates. I made sure to go over the material ahead of time and bring questions to our meeting with our mentor, Anya.
- Resourcefulness: When in doubt, I used pathway hubs, google drive, slack, other topics, and even Google to try to find a solution on my own. If I was still confused, I wrote down the question and brought it to team meetings.
- Understanding the components of an R Shiny App
- Being able to import metadata and gene expression data into R Studio
- Explore other methods of data importation with alternate packages
- I read, understood, and summarized the Long et al. 2019 paper (the paper whose results we aim to replicate during this internship).
- I watched and took notes on the “From Transcriptomics to Therapeutics” presentation by Ayush Noori where I learned some bioinformatics background on microarray vs. RNA-seq data.
- In order to communicate with team leads and teammates, I joined the Slack group, introduced myself on the general channel, filled out the form to be placed in subgroups, and started attending subgroup and team-wide meetings.
- I used a variety of web and Stem Away-provided resources to learn the basics of creating an R Shiny app.
- Now that I was more familiar with Transcriptomics and the GEO database, I started working on my R code using the instructions provided by the module. Instead of the data in the module, I made sure to use the data from the new paper (GSE19804).
- I was receiving an error that my files were not valid, so I reached out to our mentor, Anya, on the pathway hubs with screenshots of my code and she was able to help me.
- Finally, I tried using other packages used for implementation including Oligo and DESeq2. I also looked online for the different scenarios where using a different package.
Module 3 Self-Assessment
- Generate Quality Control reports using various Bioconductor packages
- Export and import csv files into RStudio
- Create various forms of visualizations for raw and normalized data
- Problem-solving: When encountering errors in my code, I made sure to utilize past contributions on the StemAway channels and even Googling the errors in order to figure out and solve which line of my code had gone wrong.
- Communication: I made sure to ask my questions as they arose. I brought to attention that our team did not yet have a designated GitHub repository and our leads were able to create one. During team meetings, I notified my leads and other teammates of my personal progress on individual tasks.
- Take imported raw data and apply normalization methods
- Create visualizations in order to compare raw and normalized data
- Troubleshoot errors thrown out by the code
- I did a more in-depth reading of Long et al. (2019) and watched session 1’s module 2 kickoff meeting where Mr. Ali Nehme explained some of the components of the paper.
- I used the simpleAffy, arrayQualityMetrics, and affyQCReport methods to conduct quality control and used affyPLM to fit boxplots over the data.
- I normalized the data using rma() and exported the data as a new csv file. I then imported this csv file, so I would not have to perform normalization every time I reran my code.
- When accessing the normalized data, I was orginally using the exprs() method, which caused memory errors when generating visualizations. I spoke with our mentor, Anya, who helped me recognize when to and when not to use exprs() in order to standardize data.
- I created boxplots, PCA plots, and heat maps in order to visualize the data and compare the raw versus normalized data. Using the PCA plots, I was able to recognize some outlier points.
Module 4 Self-Assessment
- Identify and remove outliers
- Annotate data to remove duplicate ProbeIDs, NAs, and SYMBOLS
- Filter genes which will provide good quality results
- Use limma to find DEGs and p-values
- Create visualizations of the DEGs
- Time-management: While working on the pathway modules, I was also working on coding components for the RShiny app as a part of Group A. Having so many tasks and deadlines, it was important for me to allot and finish my tasks within a certain amount of time. I organized these time slots using my google calendar.
- Communication: During this module, I encountered several issues while trying to create the volcano plot and heat map visualizations. By providing my mentor with my code and screenshots of the plots, I was able to clearly communicate my issue and receive feedback.
- Identified and removed 7 outliers from the microarray and metadata
- Filtered genes for quality samples to be used in assessment
- Created visualizations of DEGs using volcano plots and heat maps
- I researched different methods of outlier detection and felt that doing it by inter-array correlation (IAC) was the best method in this case. I found and removed a total of 7 outliers from my metadata and microarray data.
- I generated another PCA plot to visualize the data after outlier removal.
- I used the in-depth guide to annotate the data and removed duplicate ProbeIDs, NAs, and SYMBOLS.
- I filtered the data below the second percentile to make sure that the remaining genes could be used in a quality analysis.
- I used the limma package to find differentially expressed genes and generated a csv data table. I also created a report with significant genes (when p<0.05) using the eBayes function.
- Using ggplot, I created and stylistically formatted a volcano plot so that certain thresholds and the names of the genes could be identified. I alternatively tried to use Enchanced Volcano, but soon realized that it was unsupported on my computer.
- I created a heat map to the top 50 differentially expressed genes and color-coded it such that cancer data and normal data were easily differentiable. At this point, I encountered some trouble because the row names of my group table did not match the column names of my filtered data, so the color-coding did not show up. I made the appropriate changes in my metadata file and in my code so that the column and row names matched, and the color-coding showed up.
Module 5 Self-Assessment
- Create and label vectors in R
- Create and format barplot, dotplot, and cnetplot visulizations
- Convert ENTREZID to Gene Symbols and vice versa
- Identify genes and their relationships in the top KEGG pathways
- Identify the relationships of genes in regulatory target sets
- clusterProfiler package
- org.Hs.eg.db package
- enrichplot package
- barplot, dotplot, & cnetplot
- Persistence: While completing this module, I finally felt that I was acquainted enough with coding in R that I did not constantly have to google the syntax or documentation of commonly used functions. This showed me that even starting from no experience, continued effort in learning a new subject pays off. I see that eventually I am able to self-learn a new skill—in this case a language.
- Public Speaking: During this work I also prepared a powerpoint presentation for my team giving an overview of the Long et al. 2019 paper. From feedback from my team manager, I learned that presentations on academic papers are more engaging when presentations have less text and more visuals. The job of the speaker is to point out parts of the images to offer more in-depth explanations of the authors’ methods and results. While I felt I did a good job of speaking clearly and explaining various parts of the paper, I feel that the visuals and text on my presentation could be more succinct.
- Created, filtered, and annotated R vector of DEGs for further analyses
- Access external gene datasets and use them to tag data points annotated with ENTREZID or Gene Symbol
- Identified and created visual networks of top genes in related pathways in GSEA and Transcription Factor Analysis
- I created a sorted DEG vector after filtering the topTable data from module 4.
- I made 3 barplot visualizations of gene ontology: biological processes, molecular functions, and cellular components by using the enrichGO() function on the ENTREZID-annotated DEG vector.
- I used the enrichKEGG() and dotplot() functions to create a visualization of the levels of specific cellular processes of tumor samples in comparison to normal samples.
- I made a gene-concept network depicting the linkage of the genes in the top 5 KEGG pathways using cnetplot().
- I performed a gene set enrichment analysis (GSEA) in order to associate gene expression to different cellular/molecular processes in the groups. This helped determine whether the gene set is randomly distributed or correlated with a specific cell phenotype. I used the first method of reading the gmt files downloaded from the database and then generated a global DEG vector. This vector was then used to create a GSEA() plot displaying selected gene sets.
- I performed transcription factor analysis by getting the C3 regulatory target gene sets and identifying the related genes of interest. This network was also visualized using cnetplot().
- I prepared my data for external tools by writing a text file to my laptop.
Deliverables & code
Module 6 Self-Assessment
- Export a text file of top unregulated DEGs by gene symbol
- Use web tools to create and analyze visualizations
- STRING Database
- GEPIA 2
- Analytical: Using web tools to generate visualizations in this module gave me to opportunity to compare my own plots and networks and observe similarities and differences. I also did some more biological investigation by searching for the disease pathology and expected gene enrichments.
- Presentation: This week I also participated in the external R Shiny app presentation to Dr. Rola Dali. In preparation to this event, I worked on my public speaking and presentational skills, so that I could succinctly convey my point and tie into the overall group’s presentation.
- Using Metascape to generate gene enrichment and PPI graphs
- Using STRING DB to generate and analyze PPI networks
- Using GEPIA to conduct survival analysis on specific genes
- I imported the list of gene symbols of top unregulated DEGs I generated in module 5 to EnrichR and Metascape.
- Then, using the features on the site, I generated visualizations of the enriched genes, pathways, and condensed protein-protein interaction (PPI) networks. I observed some of the cellular processes which were distinctive in the visualizations.
- I imported the same list of gene symbols to the STRING database to generate another PPI network. I moved the nodes around and observed the connection across different genes.
- Finally, I performed a survival analysis on GEPIA looking at the NCKAP5 gene in lung adenocarcinoma. The survival analysis is plotted as a line graph with time on the x axis and percent survival on the y axis.
Module 7 Self-Assessment
- Import a new set of data, produce visualizations, and understand disease pathology
- Use web tools and existing publications to corroborate evidence
- R Studio
- Oligo package
- Annotation database
- IAC outlier detection
- Presentational: I created a Google Slides presentation to assist my speaking. In the slides I made sure to save most of the space for visualizations and used little text. I used transitions to add boxes and highlights to my figures while talking about a specific section of the visualization. This made my presentation more engaging as I wasn’t making my audience face a giant block of text.
- Flexible: Initially I wanted to use GSE98979 which contains normal and cancer samples taken over a period of time, however, I was still learning to annotate them. Given more time I would like to explore more ways to annotate data so that I can learn the different kinds of comparisons that can be made. The dataset I chose, GSE66272, was more helpful to me in the moment to showcase the skills I developed this summer.
- Independently conduct a transcriptomics analysis on a new dataset.
- Present results to mentors.
- Search GEO for a dataset with expression profiling by array. I ended up picking GSE66272, a clear cell renal cell carcinoma.
- Import metadata using the Oligo package.
- Conduct RMA background correction and normalization. Generate PCA plots.
- Identify and remove outliers using inter-array correlation.
- Annotate data, conduct a Limma analysis, and identify differentially expressed genes using visualizations.
- Conduct functional analysis using enrichGO and enrichment analysis with GSEA.
- Use publications and Metascape as evidence to corroborate my findings.
- Create presentation giving an overview of my findings and my overall growth during this internship.