Code Along for TranscriptomicsQC

stemaway · June 2, 2024, 5:45pm

Steps and Tasks

Note: If you are joining from the GEOQuest Combo Starter Project, please skip to Step 5.

1. Download Expression Data from GEO

Download the expression data (CEL files) for the dataset GSE32323. [NOTE: For GSE32323, specifically, only download the files with names of the format *chip_array_C#*. The other files are for cell lines, not actual human samples.]

Once you are comfortable with the flow, you can download all the datasets mentioned in the paper. The data is available as a TAR file containing multiple CEL files. Windows users may require software such as WinRAR to extract the files.

2. Acquire and Understand Metadata

Metadata provides additional information about each sample, such as the age of the patient from whom the sample came, tissue or cell type of sample origin, disease status of the patient, etc. It is used in limma analysis, a statistical method for differential gene expression (DGE) analysis.
To study the metadata, download the ‘Series Matix Files’, unzip, extract the metadata by copying onto a spreadsheet, store as a .csv file*

If you are having trouble with this step, you can find an appropriate GSE32323 metadata file here. Feel free to use this file in the next steps or use as reference to better understand how to create your own metadata file.

3. Install Required Packages

Install and familiarize yourself with the following R packages by reading their documentation to understand their functions and use cases:
- Use BiocManager::install("package_name") to install:
- Use install.packages("package_name")
  - ggplot2
  - pheatmap

4. Load Data into R

Once the necessary packages are installed, load the data into your R environment using the appropriate functions, such as
- ReadAffy() for ‘.CEL’ files and
- read.csv() for metadata files.

5. Quality Control and Data Pre-processing

Replicate the analysis workflow performed in GEO2R using R and Bioconductor. This involves writing R code to perform the following steps:

Conduct quality control (QC) using Bioconductor packages and data visualization
- Inter-array: quality control between arrays
  - arrayQualityMetrics – distance between arrays (correlation heatmap), PCA, boxplots, density plots, RLE, NUSE
  - “by hand”
    - PCA
    - Boxplots
    - Relative Log Expression (RLE)
    - Normalized Unscaled Standard Error (NUSE)
- Intra-array: quality control within arrays
  - arrayQualityMetrics – variance mean dependence, RNA degradation plot, MA plots, perfect matches and mismatches (PM, MM)
Background correction and normalization
Create data visualizations of quality control, background correction, and normalization.
- You should create at least 1 visualization for each of the methods below:
  - PCA
  - Boxplot (affy)
  - RLE (affyPLM)
  - NUSE (affyPLM)
- Create PCA plot before and after implementing background correction and normalization
Identify outlying samples using analysis from previous tasks

Background Information for Step 5

What is QC and why is it done?

The purpose of QC is to check whether the data and any conclusions drawn from it can be considered reliable. It also helps us detect outliers and improve the signal-to-noise ratio. If our QC shows that there are many outliers in our data (i.e., a low signal-to-noise ratio), the data may be too variable to find any meaningful results.

What is background correction and normalization?

Background correction removes signal from nonspecific hybridization (i.e., signal emitted by things other than a sample hybridizing to a probe). Normalization corrects for systematic biases due to environment variations such as different absorption rates, spatial heterogeneity in a microarray chip, etc. These two processes are sometimes performed separately, but the packages we’re using perform these two methods in one function. There are 3 different methods of background correction and therefore 3 possible functions you can use:

MAS 5.0 (Affymetrix’s algorithm – Microarray Suite)
Robust Multiarray Averaging (RMA)
GeneChip RMA (GCRMA) – page 4-6

Visualizations

Boxplots

Boxplots are used to visualize the distribution of data. For transcriptomics, we want to visualize the distribution of probe intensities for each sample. So, this figure will contain multiple boxplots, one for each sample. Using this visualization, we can see whether any of the samples have a drastically different distribution than others. If a sample’s probe intensity distribution is substantially different from others, it may be an outlier.

You can use the boxplot function to create a boxplot in R

Principle Component Analysis (PCA)

PCA is often used as a dimensionality reduction calculation. I’ve always found visualizing the process of PCA much more helpful than reading a written explanation. I highly recommend you check out the StatQuest videos listed in Resources.

Here, “dimension” refers to the number of attributes in our data (i.e., the probes). There are thousands of probes in our data and it would be difficult (mostly impossible) to visualize thousands of axes on a single plot. But we want to visualize the variation between samples stratified by probe. By using PCA, we can reduce the dimensions necessary to explain the variation and can reasonably visualize the data. The rationale behind PCA has its roots in matrix algebra. If you are new to matrix algebra, but would like to learn more, check out Introduction to Matrix Algebra and Reduced Row Echelon Form

Use the prcomp() function with scale=FALSE and center=FALSE
Plot PC1 vs. PC2

Correlation Heatmaps

Heatmaps add meaningful colors to data visualizations based on data values. Correlation is a measure of how similar 2 data points are and values range from -1 to 1.

A value closer to 0 indicates that data are dissimilar
A value closer to 1 indicates that data are very similar and vary in the same direction (i.e., positive correlation)
A value closer to -1 indicates that data are very similar and vary in the opposite direction (i.e., negative correlation)

Hierarchical Clustering (i.e., what is the branching structure in the margins of pheatmap() output?)

If you’re curious about the branching structure on the top and left side of your heatmap, I suggest you look into hierarchical clustering. Hierarchical clustering is an unsupervised learning technique that tries to cluster similar things together with no prior knowledge using dis/similarity metrics, such as correlation. The branching structure can be interpreted much like a family tree (or a phylogeny tree). Samples connected by “lower” branches are more similar than those with further branches. You only need a basic understanding of algebra to understand most of the dis/similarity metric calculations!

Use cor() to calculate correlation between samples
Use factor() to group samples into Normal and Cancer
Use pheatmap or ggplot2 to plot a heatmap

How to identify outliers

Once we have visualizations, we have to figure out what they mean and determine which samples will be considered outliers, if any. These outliers should be removed before continuing analysis. There may not be an initial consensus on which samples are outliers. This determination can be subjective especially for edge cases.

Recording of Bioinformatics Mentor Webinar (06/20)

This recording covers:

Combo Starter Projects: GEOQuest, Bioconductor-Intro
Starter Project: TranscriptomicsQC