Objective
The objective of this project is to introduce students to the field of bioinformatics with a focus on transcriptomic data analysis using the Gene Expression Omnibus (GEO) database. Through this project, students will gain a deep understanding of GEO and learn how to extract and preprocess gene expression data for further analysis.
Additionally, students will develop practical skills in R programming and Bioconductor to perform differential gene expression analysis. The programming tasks in R will follow along the lines of the steps performed in GEO2R, allowing students to replicate the analysis process and gain hands-on experience in implementing the analysis workflow using their own code.
By combining domain-specific knowledge via GEO2R with hands-on programming experience, this project aims to provide students with a comprehensive foundation in bioinformatics and empower them to independently explore and analyze transcriptomic data using R and Bioconductor.
If you are totally new to Bioinformatics, start with this STEMCastÂŽ: From Transcriptomics to Therapeutics. Host: Ayush Noori, Researcher at Mass General, Undergrad at Harvard University
Learning Outcomes
Upon completion of this project, you will have learned:
1. How to navigate and extract data from the GEO database
You will gain proficiency in searching and exploring the GEO database to locate relevant gene expression datasets for your research or analysis.
2. Usage of R and Bioconductor for transcriptomic data analysis
You will gain practical experience in utilizing R, a popular programming language for statistical computing, and Bioconductor, a widely-used collection of R packages for transcriptomic data analysis.
3. Understanding the analysis workflow using GEO2R as a guide
By exploring and understanding the RScript tab of GEO2R, you will gain insights into the analysis workflow for differential gene expression analysis. You will learn how to interpret and apply the different steps of the analysis workflow, such as loading data, defining groups, fitting a linear model, computing statistics, and visualizing the results.
4. Replicating GEO2R analysis workflow in R and Bioconductor
You will learn how to replicate the initial part of the analysis workflow performed in GEO2R using R and Bioconductor. By writing your own R code to perform the analysis, you will gain a deeper understanding of the underlying concepts and have more flexibility in customizing the analysis to your specific needs.
By the end of this project, you will have gained the necessary skills and confidence to independently navigate the GEO database and preprocess gene expression data using R and Bioconductor. This hands-on experience will enhance your understanding of transcriptomics and equip you with valuable tools for future research and analysis in the field.
Steps and Tasks
Note: Steps required to do the independent R analysis are marked with an asterisk (*). You have the flexibility to choose your learning path. You can start by exploring GEO2R on its own, and then progress to replicate the initial analysis using R and Bioconductor. Alternatively, you can seamlessly transition between the two paths based on your preferred learning approach.
1. Explore the GEO database
Access the GEO database and familiarize yourself with the kind of information it provides. We will focus on the datasets studied in this paper: Construction and Analysis of a ceRNA Network Reveals Potential Prognostic Markers in Colorectal Cancer - PMC
2. Download Expression Data from GEO*
Downloading expression data from the GEO database provides you with valuable transcriptomic data that you can analyze to gain insights into differential gene expression. Download the expression data (CEL
files) for the dataset GSE32323. [NOTE: For GSE32323, specifically, only download the files with names of the format *chip_array_C#*
. The other files are for cell lines, not actual human samples.]
Once you are comfortable with the flow, you can download all the datasets mentioned in the paper. The data is available as a TAR file containing multiple CEL files. Windows users may require software such as WinRAR to extract the files.
In the GEO2R R script, data is typically read using the getGEO function from the GEOquery package, which directly downloads data from the GEO database. However, running the getGEO function on local machines can sometimes be challenging due to potential internet connectivity issues and the complexities of dependencies and package installations.
Additionally, getGEO
only gets the processed data. If you want to start analysis from the raw data (which we do in the TranscriptomicsQC project), youâll need to load the data as shown in Step 5.
3. Acquire and Understand Metadata
- Metadata provides additional information about each sample, such as the age of the patient from whom the sample came, tissue or cell type of sample origin, disease status of the patient, etc. It is used in limma analysis, a statistical method for differential gene expression (DGE) analysis.
- To study the metadata, you can do any of these steps:
- Click on
Analyze with GEO2R
and study the samples columns. - Download the âSeries Matix Filesâ, unzip, extract the metadata by copying onto a spreadsheet, store as a .csv file*
- Click on
If you are having trouble with this step, you can find an appropriate GSE32323 metadata file here. Feel free to use this file in the next steps or use as reference to better understand how to create your own metadata file.
4. Install Required Packages*
- Install and familiarize yourself with the following R packages by reading their documentation to understand their functions and use cases:
- Use
BiocManager::install("package_name")
to install: - Use
install.packages("package_name")
- Use
5. Load Data into R*
- Once the necessary packages are installed, load the data into your R environment using the appropriate functions, such as
ReadAffy()
for â.CELâ files andread.csv()
for metadata files.
6. Use GEO2R for Analysis
You may find it beneficial to review recordings of presentations by previous interns at this point:
- Presentation by Mentor ChainsÂŽ Leads to Domain Experts
- Highlighting the Accomplishments of High School Students
Use the GEO2R tool to perform a preliminary differential gene expression analysis. This will serve as a guide for the subsequent steps in R and Bioconductor analysis.
-
Go to the GSE32323 dataset page.
-
Click on
Analyze with GEO2R
to access the GEO2R tool. -
Create groups of interest:
- Sort the samples based on the âTissueâ column to make it easier to create the groups.
- For example, you can create two groups: ânormalâ and âcancerâ. Select the relevant samples and assign them to their respective groups.
NOTE: For samples marked as <something> cell line
, do not include in analysis. These samples are from various cell lines, not human samples.
- Click on
Analyze
to start the analysis with the selected groups and default options. - The output of the analysis will provide you with results such as:
- A table of differentially expressed genes, including statistics such as log-fold change, p-values, and adjusted p-values.
- Visualizations, such as volcano plots, heatmaps, and cluster plots, to aid in the interpretation of the results.
By following the above steps and selecting appropriate groups for comparison based on the experimental design of your study, you will obtain insights into the differential gene expression between the selected groups.
Please note that for the purpose of easier ramp up, you can initially use GEO2R alone to perform the analysis. However, the ultimate goal is to replicate this analysis workflow in R and Bioconductor, allowing for more flexibility and customization in the analysis process.
-
An explanation of the R Script code used for differential gene expression analysis can be found in the appendix. Replicating the code on your local machine will be very challenging, but we encourage you to give it a shot! Itâs a valuable learning experience to understand and troubleshoot error messages. However, we understand that it can be overwhelming for beginners. To make it more accessible, we have provided alternate beginner-friendly R code that you can use as a starting point.
-
Below is the result obtained from GEO2R using the two groups ânormalâ and âcancerâ as described above, with all options set to default. Please use the image to evaluate your progress with mastering the GEO2R steps.
- During the Mentor Webinar, we will discuss and analyze the outputs in detail. In the meantime, feel free to experiment with different groups and options in GEO2R. If you have any questions or specific areas you would like us to address during the webinar, please share them beforehand.
7. Replicate initial steps of the GEO2R Analysis Workflow in R*
For this step, please proceed to the next starter project: TranscriptomicsQC
Self Evaluation
-
Familiarity with GEO Database Learners should start feeling comfortable with utilizing the GEO database. They should have gained proficiency in searching and exploring the database to locate relevant gene expression datasets for their research or analysis.
-
Introduction to R and Bioconductor Learners should have gained familiarity with R and Bioconductor for transcriptomic data analysis. They should have practical skills in utilizing basic R functionalities and Bioconductor packages and be ready to take on the next steps of processing, analyzing, and visualizing gene expression data.
-
Ability to Utilize GEO2R Workflows Learners should be able to use the GEO2R workflows as a resource to search for further learning material and gain insights into the analysis workflow for differential gene expression analysis. They should be able to explore and understand the RScript tab of GEO2R, helping them to expand their knowledge and skills in gene expression analysis.
Resources and Learning Materials
Here are some suggested resources that will aid you in this project:
Bioinformatics Tools and Databases
- Bioconductor: Bioconductor provides tools for the analysis and comprehension of high-throughput transcriptomic data. Explore the Getting Started guide for a better understanding of its use.
- GEOquery: This Bioconductor package will help you interact with the GEO database.
- Gene Expression Omnibus (GEO) Database: The main resource from which youâll extract data for this project.
- GEO2R tutorial: A tool that allows users to compare two or more groups of Samples in a GEO Series to identify differentially expressed genes.
R Programming and Data Science
- R for Data Science: This online book offers a comprehensive introduction to R, with a specific focus on data science.
R - More Resources
Textbook-style
Cookbook for R
Reccommended sections: 1-3, 5, 6 (Factors, Data Frames), 8, 9
Data Visualization with R (Intro. to ggplot2)
Recommended sections: 1, 2, 3.2.3, 9.5, 10 (for aesthetics), 13
ggplot2 Manual
Recommended sections: 1-4, 7, 9-11 (for aesthetics), 12-14, 17 (for aesthetics), 18-20 (for advanced concepts)
Online Courses
edX: R programming
Start with R Programming Fundamentals
or Data Science: R Basics
Web-based/ Blog-style
These resources are a bit harder to navigate since the posts are in a date-organized manner, rather than a topic one. But if you prefer short, straight-to-the-point learning, these websites are very valuable!
RStudio Education
Learning paths, cheat sheets, and books curated by the RStudio team.
R-bloggers
Contains links to other resources based on what you want to do with R (e.g., data manipulation, visualization, machine learning, etc.)
R-coder
Contains some statistics-oriented tutorials on how to use R
Stats Projects
Unlike java and python, programs are not built using only R. R is a statistical programming language; it is a means to an end. Therefore, the most effective way to learn R is by working on projects that use it. These projects are mainly statistical in nature. Almost any statistics-related project that you can do in excel or Google spreadsheet or by hand can also be done in R. If you are unsure of where to start, hereâs an outline of a good stats project with some ideas: https://www2.stat.duke.edu/~jerry/sta103/proj2_instr.htm
R - Debugging and Troubleshooting
Types of Errors
-
Syntax errors : These errors are like âspelling and grammarâ errors in English. They concern the syntax of the programming language youâre using. For instance, forgetting to close a parenthesis when using a function.
sum(1,2
These are the easiest to handle since they output error messages telling you whatâs wrong, and most times, where itâs is wrong.
-
Semantic errors : These errors have to do with the meaning and context. They are often caused by type mismatches. For example, the
sum()
function expects numeric input. If a character or string is given, the function will return an error.sum(1, "a")
-
Logical errors : These errors have to do with the program flow. A classic example is found in learning PEMDAS, the order of operations in math equations.
1+2+3/4 = 3.75 1+(2+3)/4 = 2.25 (1+2+3)/4 = 1.5
These errors are often the hardest to identify and debug because they typically do not return an error message. Logical errors are identified by the programmer or user when a program or function has an unexpected output.
Debugging Tips
- Check spellings and capitalization
- Ensure that your variables contain the information you expect
- Check that all brackets are closed (ie. they come in pairs and have a start
(
and end)
) - Try reloading data and libraries
- If you keep trying and trying and nothing seems to work, try clearing your environment and rerunning your code
Debugging Resources
Common Errors and Debugging Techniques
https://medium.com/analytics-vidhya/common-errors-in-r-and-debugging-techniques-f11af3f1c7d3
Appendix
GEO2R R Script: Differential Gene Expression Analysis
Differential Expression Analysis with limma
NOTE: #
indicates comments in R
Version Information
- R version: 4.2.2
- Biobase version: 2.58.0
- GEOquery version: 2.66.0
- limma version: 3.54.0
- DESeq2 version: 1.38.3
Importing the Required Libraries
In this step, the necessary R libraries for the analysis are loaded. These libraries include GEOquery, limma, umap, and car. These libraries provide functions and tools for accessing gene expression data from the GEO database, performing differential expression analysis, and creating dimensionality reduction plots.
library(GEOquery) #BiocManager::install()
library(limma) #BiocManager::install()
library(umap) #install.packages()
library(car) #install.packages()
Loading and Preparing the Data
In this step, the gene expression dataset âGSE32323â is loaded from the GEO database using the getGEO function. The GSEMatrix = TRUE argument ensures that the processed data is loaded, while AnnotGPL = TRUE downloads the associated platform GPL, which is necessary for mapping probe IDs to gene symbols. If multiple platforms are used in the study, the script selects the one named âGPL570â. The dataset is then prepared by assigning proper column names and group membership for all samples.
NOTE: getGEO
only gets the processed data. If you want to start analysis from the raw data (which we do in the TranscriptomicsQC project, youâll need to use an alternate data loading method described in the linked post.
gset <- getGEO("GSE32323", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
# alternative if, else statement that looks much nicer: ifelse(condition, if TRUE, if FALSE)
# idx <- ifelse(length(gset) > 1, grep('GPL570', attr(gset, 'names')), 1)
gset <- gset[[idx]]
fvarLabels(gset) <- make.names(fvarLabels(gset))
gsms <- '1111111111111111100000000000000000XXXXXXXXXX'
sml <- strsplit(gsms, split="")[[1]]
# removes samples not included in analysis, i.e., cell lines
sel <- which(sml != 'X') # NOTE: you can use double or single quotes as long as both start and end are the same type
sml <- sml[sel]
gset <- gset[ ,sel]
Log2 Transformation
In this step, the gene expression data is log2 transformed. This transformation helps normalize the data, ensuring that the expression values are on a similar scale and reducing the impact of extreme values. The transformation is performed based on certain criteria, such as the distribution of expression values.
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) || (qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) {
ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex)
}
Group Assignment and Model Design
In this step, the samples are assigned to different groups, such as ânormalâ and âcancerâ. A design matrix is created, which specifies the experimental design and the groups to be compared in the differential expression analysis.
gs <- factor(sml)
groups <- make.names(c("cancer","normal"))
levels(gs) <- groups
gset$group <- gs
design <- model.matrix(~group + 0, gset)
colnames(design) <- levels(gs)
Cleaning the Dataset
In this step, the dataset is cleaned by removing rows with missing values in the gene expression matrix. This ensures that only complete cases are used in the subsequent analysis.
gset <- gset[complete.cases(exprs(gset)), ]
Fitting the Linear Model
In this step, a linear model is fitted to the gene expression data using the lmFit function from the limma library. This model will be used to assess the significance of the difference in gene expression between the groups defined in the design matrix. See page 37, section 8.1 in the limma user guide for more information about the model.
fit <- lmFit(gset, design)
Setting Up Contrasts
In this step, contrasts of interest are set up to compare the gene expression between different groups. This involves recalculation of model coefficients. Contrasts are most useful for multivariate analyses when comparing more than two groups (e.g., comparing different cancer stages). When only comparing 2 groups, contrasts are less important for the actual analysis, but some functions will not work without them.
cts <- paste(groups[1], groups[2], sep="-")
cont.matrix <- makeContrasts(contrasts=cts, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
NOTE: Make sure that âcancerâ is the group of interest. You can do this with:
cont.matrix
If the output shows cancer as 1, youâre good! If not, change the first line to:
cts <- paste(groups[2], groups[1], sep="-")
Computing Statistics and Generating the Results
In this step, empirical Bayes statistics are computed using the eBayes function. This improves the estimation of gene-wise variances and facilitates the generation of a table of the top significant genes. The results are then written to a tab-separated file. The number
argument in topTable
specifies the maximum number of genes to output to tT
. Using number=250
will output results of the top 250 differentially expressed genes. Using number=Inf
(as youâll see in the next step) outputs results for all genes, regardless of differential expression values.
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
tT <- subset(tT, select=c('ID', 'adj.P.Val', 'P.Value', 't', 'B', 'logFC', 'Gene.symbol', 'Gene.title'))
write.table(tT, file=stdout(), row.names=F, sep="\t")
NOTE: file=stdout()
in the write.table()
function will output the object (tT
) to the code console. If you want to save as a file, change stdout()
to a file path.
Visualization and Quality Control
In this step, various visualizations and quality control checks are performed. These include generating a histogram of adjusted P-values, classifying the test results as âupâ regulated (1), âdownâ regulated (-1), or ânot expressed/not significantâ (0), and creating a Venn diagram and volcano plot of the test results. Will also generate a QQ-plot to check the inflation and false positive rate.
There are 3 important thresholding numbers when deciding which genes are differentially expressed (done with the decideTests
function here), one of which is not included in the code:
p.value
- unadjusted p-value associated with testing
- lower values = more significant and more stringent
- typical cutoff depends on analysis
adj.P.Value
- FDR adjusted p-value; this is the gold standard most used in publishing
- more stringent that
p.value
- lower values = more significant
- typical cutoff is 0.01 or 0.05
lfc
- stands for log fold change in gene expression between your groups
- high values = more stringent
- typical cutoff depends on analysis, but should be greater than 0
tT2 <- topTable(fit2, adjust="fdr", sort.by="B", number=Inf)
# histogram
hist(tT2$adj.P.Val, col = "grey", border = "white", xlab = "P-adj", ylab = "Number of genes", main = "P-adj value distribution")
# Venn Diagram
dT <- decideTests(fit2, adjust.method="fdr", p.value=0.05, lfc=0)
vennDiagram(dT, circle.col=palette())
vennDiagram(dT, include=c('up', 'down'), circle.col=palette()) # count Up and Down genes separately
# QQ plot
t.good <- which(!is.na(fit2$F))
qqt(fit2$t[t.good], fit2$df.total[t.good], main="Moderated t statistic")
# volcano plot
colnames(fit2)
ct <- 1
volcanoplot(fit2, coef=ct, main=colnames(fit2)[ct], pch=20,
highlight=length(which(dT[,ct]!=0)), names=rep('+', nrow(fit2)))
# mean-difference plot
plotMD(fit2, column=ct, status=dT[,ct], legend=F, pch=20, cex=1)
abline(h=0)
NOTE: I could not get the volcanoplot
function to work properly. We would expect a different color for up and down regulated genes (i.e., the two sides of the âvolcanoâ should be different colors). However, all my points were blueâŚ
General Expression Data Analysis
In this step, general analysis tasks are performed on the expression data. This includes creating a box-and-whisker plot to visualize the distribution of gene expression values, generating density plots for different groups, and creating a UMAP plot for dimensionality reduction. Learn more about UMAPâs in StatQuests basic concepts video.
NOTE: These visualizations would normally be used in QC and preprocessing of raw data. The inclusion here is less for analysis and more for information purposes.
CHALLENGE: Most of the functions here (par
, legend
, etc.) is related to basic R plots. As a challenge, try recreating these plots using ggplot2
and the ggplot2 reference documentation.
ex <- exprs(gset)
# boxplot
# dev.new(width=3+ncol(gset)/6, height=5) : will open up a separate plot window; comment out to view plot in RStudio's plot panel
ord <- order(gs)
palette(c("#1B9E77", "#7570B3", "#E7298A", "#E6AB02", "#D95F02", "#66A61E", "#A6761D", "#B32424", "#B324B3", "#666666"))
par(mar=c(7,4,2,1))
title <- paste("GSE32323", "/", annotation(gset), sep="")
boxplot(ex[,ord], boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=gs[ord])
legend("topleft", groups, fill=palette(), bty="n")
# dev.off() : used to "turn off" plotting, useful for R scripting; comment out to keep plot visible
# density plot
par(mar=c(4,4,2,1))
title <- paste("GSE32323", "/", annotation(gset), " value distribution", sep="")
plotDensities(ex, group=gs, main=title, legend ="topright")
# UMAP
ex <- na.omit(ex)
ex <- ex[!duplicated(ex), ]
ump <- umap(t(ex), n_neighbors = 15, random_state = 123)
par(mar=c(3,3,2,6), xpd=TRUE)
plot(ump$layout, main="UMAP plot, nbrs=15", xlab="", ylab="", col=gs, pch=20, cex=1.5)
legend("topright", inset=c(-0.15,0), legend=levels(gs), pch=20, col=1:nlevels(gs), title="Group", pt.cex=1.5)
# library("maptools") : library is retiring in 2023; pointLabel function moved to car library mentioned in first chunk
pointLabel(ump$layout, labels = rownames(ump$layout), method="SANN", cex=0.6)
Mean-Variance Trend
In this step, a mean-variance trend plot is generated to evaluate the relationship between the mean and variance of the expression values. This helps to determine if precision weights are needed in the analysis.
plotSA(fit2, main="Mean variance trend, GSE32323")
The code provided demonstrates a basic workflow for differential expression analysis using the limma package in R. It covers steps such as data loading and transformation, group assignment and model design, fitting the linear model, computing statistics, generating results, and visualizing the findings. The code can be adapted and customized for different datasets and research questions in bioinformatics.