R Studio: Understanding the fundamentals of R code and getting familiar with its interface
Learning how to read research papers efficiently
Understanding more about the bioinformatics field, such as methods and tools used
Tools:
R studio and R installer
ggplot2
Bioconductor (specifically EnhancedVolcano)
Resources:
Official ggplot2 website
YouTube
Stack Overflow
RStudio Community
Soft Skills:
Time management: pacing myself between reading the given material and performing tasks myself on application
Utilizing resources: Researching and clarifying my own problem with resources at hand:
Official ggplot2 website
YouTube
Stack Overflow
RStudio Community
Navigation: Understanding how to use the Stem-Away website
Communication: Communicating with the leads/founder to clarify any doubts
Determination: At first I thought I wouldn’t be able to understand R, but after going through the material provided, I understand it very well to the point I was able to do problem solving and enjoyed writing code with it very much
Three Achievement Highlights:
Reading through the materials given and practiced on the application itself. This allowed me to grasp the fundamentals of R and understand its interface
Errors: Experimenting with the list and tasks of different errors provided in addition to fixing errors that I personally occurred when I tried to execute code
Understand the basics of how to read scientific papers and getting a better understanding of the bioinformatics field
Tasks Completed:
Originally I was confused about the platform, but I was able to navigate and accomplish the given tasks for this module successfully. I was able to navigate through the STEM-Away website and make use of the resources given to us. I read the given material and was able to understand the basic fundamentals of R: intro to R (download R studio and R installer and understand its interface), syntax and data structures, functions and arguments, and data wrangling, and visualization (creating basic scatterplots, barplots, histograms, box plots, and volcano plots and changing their appearance such as labeling, font, color, and themes) . Whenever I encountered any problems while executing the code or wanted more clarification on certain commands, I was successfully able to use available resources on the internet, such Stack Overflow, RStudio Community, the official ggplot2 website, and YouTube. I was also able to familiarize myself with the different types of errors (syntax, semantic, and logical errors). There were some errors and issues that I personally encountered when executing code, which I was able to find the solution for based on what I learned. Outside of R, I learned efficient ways to read a research paper and get an understanding of how all the information in it can be condensed to something that encompasses all the important points. In addition to this, I was able to understand more about the bioinformatics field and the resources and methods that are used to accomplish many tasks that are required by the field (ex: microarrays, RNA sequencing, FASTQC, GEO2R, and R).
Was able to understand the GEO Database platform and how it works; was able to download necessary data
Developed better understanding of what Bioconducter was and the packages it provides
Learn how to import data into R and create metadata
Tools:
GEO Database
GSE19804 samples
Express Zip
RStudio
Google slides (presentation for my team)
Other resources:
Presentations and videos on Stem-Away Platform
Soft Skills:
Determination and Utilizing Resources: This module definitely felt more engaging than Module 1, which was learning the basics and implementing them. Since I am still new to coding, I thought I wouldn’t be able to continue further. But in the end, thanks to the resources on the STEM-Away platform, I was able to push myself through and finish the tasks
Communication: Contacted leads whenever I had a problem
Teamwork + Presentation Skills: I worked with a group member on a presentation regarding transcriptomics and the two methods that are used (Microarray Analysis and RNA Sequencing)
Three Achievement Highlights:
Was able to download the .CEL files from the GSE19804 dataset
Imported .CEL files using ReadAffy() command
Got metadata by downloading the matrix text file and created and assigned an object to be a dataframe that contains some information from the matrix file (sample names and type of tissue)
Tasks Completed:
Using the resources given, I was able to understand what GEO is, such as how the records of data are organized (original submitter-supplied records vs curated records). I had an error when downloading simpleaffy and affyQCreport. So I contacted Anya who helped resolve the problem by letting me know this was as a result of having R installer 4.0+. Therefore, I downloaded R installer version 3.5.0, (since that was the minimum version of the R installer RStudio can be run on in order support Bioconductor and the packages I wasn’t able to download) and was able to download all the packages, which were : affy, affyPLM, simpleaffy, arrayQualityMetrics, affyQCReport, sva, GEOquery, and pheatmap. I was originally confused on how to load the data into RStudio and obtain the metadata, but I checked a previous meeting video in which Anya guided through the process step-by-step and I was able to understand it. I was successfully able to download the .CEL files for the GSE19804 dataset and import it into RStudio through the ReadAffy() command. Afterwards, I downloaded the matrix text file and imported that into RStudio too to obtain the metadata through the getGEO() command. Since the metadata has to be simplistic, I created a dataframe that limited the columns to showcase the important information: sample name and kind of tissue, since the rest of the information was the same. I looked over the presentation attached that gave a more detailed description of the Bioconductor package and the various tools it offers, which allowed me to get a stronger understanding of it. Separately from the module, this week I worked with my partner to create a presentation on transcriptomics and the methods of microarray analysis and RNA sequencing.
Ran various types of quality control methods on GS19804 data; better understanding of the purpose of these various analyses and how outliers are detected
Background corrected and normalized the GSE19804 data using gcrma()
Created several visualizations of the GSE19804 data, including: a boxplot, PCA plot, and a heatmap
Submitted deliverables through GitHub
Tools:
RStudio
GitHub
Stack Overflow
Other resources:
Guided PDF instructions and videos on Stem-Away Platform
Soft Skills:
Time management: Efficiently pacing myself so I can understand the tasks for each module while finishing it on time
Organization: I started to organize my code by using comments and organize scripts by having one for each module
Communication: Approached the leads whenever I had encountered a problem
Three Achievement Highlights:
Was able to successfully reports of the GSE19804 data using all four QC methods: simpleaffy(), arrayQualityMetrics(), affyQCReport(), affyPLM()
Ran background correction and normalized GSE19804 data; successfully exported the results as a .csv file of it in addition to assigning it to an object
Created a boxplot and heatmap of the normalized GSE19804 data, created a PCA plot of both the raw and normalized GSE19804 data
Tasks Completed:
Through this module I learned about quality control, data normalization, and data visualization. I already had the affy, affyPLM, simpleaffy, arrayQualityMetrics, affyQCReport, pheatmap packages from module 2 and the data from the GSE19804 dataset, so I simply loaded the libraries and the object into the current session. Using simpleaffy(), arrayQualityMetrics(), affyQCReport(), affyPLM(), I ran quality control checks on the data and was able to see the analysis through various visuals. I was able to get a faint idea at what the possible outliers are. Although it became more evident once I normalized the data using the gcrma() method and created data visualizations including: box plots, PCA plots (created both a raw and normalized data graphs), and heatmaps. I originally had an error when making a PCA plot, so I contacted Anya who helped me understand what it meant and I was able to resolve it afterwards. I also was curious about the difference between using “@” vs “$” when accessing sub-levels and the purpose of accessing the “rotation” data part of the GSE19804 dataset. Anya then clarified that “@” is usually to access sub-levels in lists while “$” is usually to access sublevels in data frames and the “rotation” data in the GSE19804 dataset is important as it is a matrix of eigenvectors. Later on, I faced a problem with the creation of the heatmap, so I referred back to Anya’s video where she went through module 3, and I saved my data as a .rds file and used the exprs() function when saving my normalized data as a .csv, which helped get rid of the error. Once I finished all the tasks, I submitted my code and outputs (the necessary deliverables) onto GitHub.
Guided PDF instructions and videos on Stem-Away Platform
Soft Skills:
Perseverance/GRIT: I felt like by far this was the hardest module, since it seemed like more self-learning rather than step-by-step instructions. This made it hard for me as this is my first time coding, but I was able to push through and finish in the end.
Self-research/answering: Whenever I was confused on how to execute code, what it meant, and/or received errors, I researched and resolved the problems myself. In doing so, I discovered various wonderful resources I might refer back to in the future.
Time management: Trying to efficiently execute tasks while learning at the same time
Three Achievement Highlights:
Successfully identified and removed 7 outliers using the IAC method
Created a data frame with probe IDs and their associated gene SYMBOLS and removed any duplicate probe IDs, NA values, and symbols. Afterwards merged the data frame with the normalized data with no outliers by row names
Filtered genes such that only ones above the 2nd percentile remain, then analyzed it using limma and created a volcano plot and heatmap from the top 50 DEG.
Tasks Completed:
This task felt the hardest one by far, but I was able to push through. Using the IAC method, I identified 7 outlier samples and removed them (which were GSM494571.CEL.gz, GSM494572.CEL.gz, GSM494582.CEL.gz, GSM494591.CEL.gz, GSM494596.CEL.gz, GSM494654.CEL.gz, GSM494657.CEL.gz). Afterwards, I created a dataframe with annotated probe IDs to their correct gene SYMBOL through the “hgu133plus2.db” database. I removed any duplicate probe IDs, NA values, and symbols. I set the row names to the probe IDs. Then I set the row names for the data frame to be the probe IDs, which allowed me to merge the data frame with my normalized data with no outliers by row names. Then I set the row names of my newly merged data to the gene SYMBOLS. Then I filtered out the genes such that only those that were above the 2nd percentile remained, and then I performed limma analysis on it to make a linear model. I then extracted the top DEG with an adjusted p-value that is less than 0.05, from which I created a volcano plot, and then retrieved the top 50 DEG to create a heat map.
Guided PDF instructions and videos on Stem-Away Platform
Soft Skills:
Time management: Knowing my tight upcoming deadline, I tried to execute my tasks and solve problems as fast as I could and I paced myself
Self Problem-Solving: I faced several errors once again, and for some there wasn’t much advice on the internet, so before I contacted the leads I solved all my errors by myself
Organization: Organized all my deliverables into proper folders and committed them to GitHub
Three Achievement Highlights:
Created gene ontology plots for CC, PB, and MF for up-regulated DEG
Made gene concept network of KEGG pathways and transcription factors; 1 dot plot for KEGG pathways
Successively created the GSEA plot after bypassing an error, which I solved by reformatting the data frame
Tasks Completed:
I started off by downloading the following necessary packages and their libraries: org.Hs.eg.db, clusterProfiler, enrichplot, msigdb, magrittr, tidyr, and ggnewscale. I created a DEG vector byI extracting the log fold change from my top table, and only filtered out upregulated genes by setting my threshold to be 1.5. After I organized the vectors in descending order, I named these values by their gene SYMBOL and had their associated gene ID in another column. Then to find the enriched pathways associated with up-regulated genes, I used gene ontology, where I produced plots focusing on cellular components, biological processes, and molecular functions, and also KEGG analysis, in which other pathways were identified through a dotplot. In addition, a gene concept network for KEGG analysis where connections were drawn between genes and pathways they connect with. Afterwards, GSEA was performed (using a vector of all genes, not just upregulated ones like previously) to identify overrepresented gene sets by finding the variation in statistical significance between different samples, and a graph was produced in the end. Toward the end, transcription factor analysis was performed and another gene concept network map was created, which allowed transcriptional factors to be assessed so regulation of genes could be understood. Finally, I retrieved a list of the gene SYMBOLS and kept it in a .txt file for the next module.
Self-solving: Noticed an issue with the data I inputted from the vector to the text file, so I fixed it myself
Organization: I created a folder for my outputs, and properly uploaded it (in addition to the other folders) on GitHub
Successfully imported the text file so it included gene SYMBOLs only (originally it had log fold change numbers instead)
Retrieved sets of enriched biological identifiers through Enrichr
Created map of protein-protein relationships for genes entered
Tasks Completed:
Of all the modules, this was the most straightforward and easy. I already performed a bit of functional analysis in module 5 by using GSEA. In this module, I was introduced to web-based functional analysis tools. As per the requirement, I chose to explore one tool from each group: from Group A, I chose Enrichr and from Group B I chose STRING. To proceed to do that, I needed a text file with all the gene SYMBOLs. I accidentally just put in the log fold change numbers at first, so then I corrected my mistake afterwards and was able to have the gene SYMBOLs. I then copied and pasted them in the textboxes on both sites and searched. On Enrichr, the results were many sets of enriched biological annotations, while in STRING, a network map was provided showing connections/interactions between proteins that were associated with the genes provided for the GSE19804 data set.
Self-solving: Fixed any errors myself through prior knowledge/experience I have developed from this internship
Organization: I created a folder for my outputs, and properly uploaded it (in addition to the other folders) on GitHub
Organized by code by commenting
Three Achievement Highlights:
Performed quality control on GSE66272 (arrayQualityMetrics); created two PCA plots
DGE analysis: Successfully identified and removed outliers, and annotated probe IDs to their designated gene SYMBOLS
Functional Analysis: performed GEO and GSEA and created plots/graphs
Tasks Completed:
In this module I did nothing new, but rather applied my skills and knowledge from previous modules to a new data set: GSE66272, which is about renal cancer, specifically CCRCC, which is a very aggressive form of it. Just like the previous lung cancer data set I analysed (GSE19804), the data is profiled by array. I downloaded the data .CEL and series matrix text file, from which I was to import data into RStudio and create metadata. From then I performed a quality control test (specifically arrayQualityMetrics) and then normalized the data using gcrma. Afterwards, I visualize the data (both the raw and normalized) using a PCA model. Then I moved onto DGE analysis, where I identified and removed the outlier probe IDs to their respectful gene SYMBOLS, and removed duplicate symbols, probe IDs, and NAs. Afterwards I merged the annotated data with the normalized data with no outliers by row names and filtered the data. On this data I performed limma analysis, from which I made a top table and made a heat map from. Then I moved onto functional analysis, where I performed gene ontology and GSEA and generated plots/graphs.