Code Along for Combo: GEOQuest, Bioconductor-Intro

stemaway · June 2, 2024, 5:38pm

Steps and Tasks

Note: Steps required to do the independent R analysis are marked with an asterisk (*). You have the flexibility to choose your learning path. You can start by exploring GEO2R on its own, and then progress to replicate the initial analysis using R and Bioconductor. Alternatively, you can seamlessly transition between the two paths based on your preferred learning approach.

1. Explore the GEO database

Access the GEO database and familiarize yourself with the kind of information it provides. We will focus on the datasets studied in this paper: Construction and Analysis of a ceRNA Network Reveals Potential Prognostic Markers in Colorectal Cancer - PMC

2. Download Expression Data from GEO*

Downloading expression data from the GEO database provides you with valuable transcriptomic data that you can analyze to gain insights into differential gene expression. Download the expression data (CEL files) for the dataset GSE32323. [NOTE: For GSE32323, specifically, only download the files with names of the format *chip_array_C#*. The other files are for cell lines, not actual human samples.]

Once you are comfortable with the flow, you can download all the datasets mentioned in the paper. The data is available as a TAR file containing multiple CEL files. Windows users may require software such as WinRAR to extract the files.

In the GEO2R R script, data is typically read using the getGEO function from the GEOquery package, which directly downloads data from the GEO database. However, running the getGEO function on local machines can sometimes be challenging due to potential internet connectivity issues and the complexities of dependencies and package installations.

Additionally, getGEO only gets the processed data. If you want to start analysis from the raw data (which we do in the TranscriptomicsQC project), you’ll need to load the data as shown in Step 5.

3. Acquire and Understand Metadata

Metadata provides additional information about each sample, such as the age of the patient from whom the sample came, tissue or cell type of sample origin, disease status of the patient, etc. It is used in limma analysis, a statistical method for differential gene expression (DGE) analysis.
To study the metadata, you can do any of these steps:
- Click on Analyze with GEO2R and study the samples columns.
- Download the ‘Series Matix Files’, unzip, extract the metadata by copying onto a spreadsheet, store as a .csv file*

If you are having trouble with this step, you can find an appropriate GSE32323 metadata file here. Feel free to use this file in the next steps or use as reference to better understand how to create your own metadata file.

4. Install Required Packages*

Install and familiarize yourself with the following R packages by reading their documentation to understand their functions and use cases:
- Use BiocManager::install("package_name") to install:
- Use install.packages("package_name")
  - ggplot2
  - pheatmap

5. Load Data into R*

Once the necessary packages are installed, load the data into your R environment using the appropriate functions, such as
- ReadAffy() for ‘.CEL’ files and
- read.csv() for metadata files.

6. Use GEO2R for Analysis

You may find it beneficial to review recordings of presentations by previous interns at this point:

Use the GEO2R tool to perform a preliminary differential gene expression analysis. This will serve as a guide for the subsequent steps in R and Bioconductor analysis.

Go to the GSE32323 dataset page.
Click on Analyze with GEO2R to access the GEO2R tool.
Create groups of interest:
- Sort the samples based on the “Tissue” column to make it easier to create the groups.
- For example, you can create two groups: “normal” and “cancer”. Select the relevant samples and assign them to their respective groups.

NOTE: For samples marked as <something> cell line, do not include in analysis. These samples are from various cell lines, not human samples.

Click on Analyze to start the analysis with the selected groups and default options.
The output of the analysis will provide you with results such as:
- A table of differentially expressed genes, including statistics such as log-fold change, p-values, and adjusted p-values.
- Visualizations, such as volcano plots, heatmaps, and cluster plots, to aid in the interpretation of the results.

By following the above steps and selecting appropriate groups for comparison based on the experimental design of your study, you will obtain insights into the differential gene expression between the selected groups.

Recording of Bioinformatics Mentor Webinar (06/20)

This recording covers:

Combo Starter Projects: GEOQuest, Bioconductor-Intro
Starter Project: TranscriptomicsQC