🔸 TranscriptomicsQC

:dna: Bioinformatics Code-Along

Thursday July 27 at 4:00 PM PST/7:00 PM EST

Zoom link: Launch Meeting - Zoom


Objective

The objective of the TranscriptomicsQC starter project is to provide learners with a robust hands-on experience in performing quality control (QC) of transcriptomics data. This project focuses on microarray data quality control for beginners, with the option to include RNA-seq quality control for advanced users.

Through the step-by-step project activities, learners will develop a strong proficiency in evaluating, correcting errors, and filtering probes/reads, ensuring the quality of the data. This project aims to provide students with a comprehensive foundation in bioinformatics, emphasizing the critical aspect of quality control. By the end of the project, learners will be empowered to start independently exploring and analyzing genomic data using the powerful tools of R and Bioconductor.



:bulb: If you are totally new to Bioinformatics, please start with the GEOQuest Combo Starter Project



Learning Outcomes

Upon completion of the TranscriptomicsQC starter project, you will have learned:

1. Importance of quality control in transcriptomics data analysis

Understand the significance of quality control in ensuring the reliability and accuracy of transcriptomics data. And recognize the impact of data quality on downstream analyses and interpretation.

2. Usage of R and Bioconductor for genomic data analysis

You will gain practical experience in utilizing R, a popular programming language for statistical computing, and Bioconductor, a widely-used collection of R packages for genomic data analysis.

3. Proficiency in data pre-processing and QC

  • Develop hands-on experience in using Bioconductor packages and data visualization tools for data pre-processing and quality control.
  • Acquire skills in performing data pre-processing steps such as background correction and normalization.
  • Learn to generate informative visualizations, including PCA plots, boxplots, and RLE plots, to assess data quality.

By completing the TranscriptomicsQC project, you will develop a strong foundation in transcriptomics data quality control, enabling you to ensure the reliability and integrity of your data, make informed decisions in subsequent analyses, and replicate the essential workflow using R and Bioconductor.


Steps and Tasks


Note: If you are joining from the GEOQuest Combo Starter Project, please skip to Step 5.

1. Download Expression Data from GEO

  • Download the expression data (CEL files) for the dataset GSE32323. [NOTE: For GSE32323, specifically, only download the files with names of the format *chip_array_C#*. The other files are for cell lines, not actual human samples.]

    Once you are comfortable with the flow, you can download all the datasets mentioned in the paper. The data is available as a TAR file containing multiple CEL files. Windows users may require software such as WinRAR to extract the files.

2. Acquire and Understand Metadata

  • Metadata provides additional information about each sample, such as the age of the patient from whom the sample came, tissue or cell type of sample origin, disease status of the patient, etc. It is used in limma analysis, a statistical method for differential gene expression (DGE) analysis.
  • To study the metadata, download the ‘Series Matix Files’, unzip, extract the metadata by copying onto a spreadsheet, store as a .csv file*

If you are having trouble with this step, you can find an appropriate GSE32323 metadata file here. Feel free to use this file in the next steps or use as reference to better understand how to create your own metadata file.

3. Install Required Packages

  • Install and familiarize yourself with the following R packages by reading their documentation to understand their functions and use cases:

4. Load Data into R

  • Once the necessary packages are installed, load the data into your R environment using the appropriate functions, such as
    • ReadAffy() for ‘.CEL’ files and
    • read.csv() for metadata files.

5. Quality Control and Data Pre-processing


Replicate the analysis workflow performed in GEO2R using R and Bioconductor. This involves writing R code to perform the following steps:
  • Conduct quality control (QC) using Bioconductor packages and data visualization
    • Inter-array: quality control between arrays
      • arrayQualityMetrics – distance between arrays (correlation heatmap), PCA, boxplots, density plots, RLE, NUSE
      • “by hand”
        • PCA
        • Boxplots
        • Relative Log Expression (RLE)
        • Normalized Unscaled Standard Error (NUSE)
    • Intra-array: quality control within arrays
      • arrayQualityMetrics – variance mean dependence, RNA degradation plot, MA plots, perfect matches and mismatches (PM, MM)
  • Background correction and normalization
  • Create data visualizations of quality control, background correction, and normalization.
    • You should create at least 1 visualization for each of the methods below:
      • PCA
      • Boxplot (affy)
      • RLE (affyPLM)
      • NUSE (affyPLM)
    • Create PCA plot before and after implementing background correction and normalization
  • Identify outlying samples using analysis from previous tasks
Background Information for Step 5

What is QC and why is it done?

The purpose of QC is to check whether the data and any conclusions drawn from it can be considered reliable. It also helps us detect outliers and improve the signal-to-noise ratio. If our QC shows that there are many outliers in our data (i.e., a low signal-to-noise ratio), the data may be too variable to find any meaningful results.


What is background correction and normalization?

Background correction removes signal from nonspecific hybridization (i.e., signal emitted by things other than a sample hybridizing to a probe). Normalization corrects for systematic biases due to environment variations such as different absorption rates, spatial heterogeneity in a microarray chip, etc. These two processes are sometimes performed separately, but the packages we’re using perform these two methods in one function. There are 3 different methods of background correction and therefore 3 possible functions you can use:


Visualizations

Boxplots

Boxplots are used to visualize the distribution of data. For transcriptomics, we want to visualize the distribution of probe intensities for each sample. So, this figure will contain multiple boxplots, one for each sample. Using this visualization, we can see whether any of the samples have a drastically different distribution than others. If a sample’s probe intensity distribution is substantially different from others, it may be an outlier.

  • You can use the boxplot function to create a boxplot in R

Principle Component Analysis (PCA)

PCA is often used as a dimensionality reduction calculation. I’ve always found visualizing the process of PCA much more helpful than reading a written explanation. I highly recommend you check out the StatQuest videos listed in Resources.

Here, “dimension” refers to the number of attributes in our data (i.e., the probes). There are thousands of probes in our data and it would be difficult (mostly impossible) to visualize thousands of axes on a single plot. But we want to visualize the variation between samples stratified by probe. By using PCA, we can reduce the dimensions necessary to explain the variation and can reasonably visualize the data. The rationale behind PCA has its roots in matrix algebra. If you are new to matrix algebra, but would like to learn more, check out Introduction to Matrix Algebra and Reduced Row Echelon Form

  • Use the prcomp() function with scale=FALSE and center=FALSE
  • Plot PC1 vs. PC2

Correlation Heatmaps

Heatmaps add meaningful colors to data visualizations based on data values. Correlation is a measure of how similar 2 data points are and values range from -1 to 1.

  • A value closer to 0 indicates that data are dissimilar
  • A value closer to 1 indicates that data are very similar and vary in the same direction (i.e., positive correlation)
  • A value closer to -1 indicates that data are very similar and vary in the opposite direction (i.e., negative correlation)

Hierarchical Clustering (i.e., what is the branching structure in the margins of pheatmap() output?)

If you’re curious about the branching structure on the top and left side of your heatmap, I suggest you look into hierarchical clustering. Hierarchical clustering is an unsupervised learning technique that tries to cluster similar things together with no prior knowledge using dis/similarity metrics, such as correlation. The branching structure can be interpreted much like a family tree (or a phylogeny tree). Samples connected by “lower” branches are more similar than those with further branches. You only need a basic understanding of algebra to understand most of the dis/similarity metric calculations!

  • Use cor() to calculate correlation between samples
  • Use factor() to group samples into Normal and Cancer
  • Use pheatmap or ggplot2 to plot a heatmap


How to identify outliers

Once we have visualizations, we have to figure out what they mean and determine which samples will be considered outliers, if any. These outliers should be removed before continuing analysis. There may not be an initial consensus on which samples are outliers. This determination can be subjective especially for edge cases.


Resources and Learning Materials

Here are some suggested resources that will aid you in this project:

Quality Control

Data Visualization

Boxplots

PCA (Principal Component Analysis)

Heatmaps


Appendix


Feeling R stuck? Check out these helpful tips!
  • Visualization Tips

    • By default pheatmap implements a blue-to-red palette, but you can change this if you want! Since we are looking at correlation and the values on the two extremes are meaningful, we want a “diverging” color palette (blue-to-red is also a diverging palette).
    • For outlier detection, we are interested in samples that are dissimilar. But samples that are dissimilar have correlation close to 0, between -1 and 1. Most diverging palettes color the middle values as white, grey, or something less visually striking. If you want to make the outliers more noticeable, use a dissimilarity metric, such as 1-correlation.
    • For a visual explanation, check out StatQuest’s video on Drawing and Interpreting Heatmaps
  • Some methods require the data in different formats, so pay attention to the documentation. For example, arrayQualityMetrics() function is looking for an AffyBatch object; the prcomp() function is looking for a data frame. You can determine an objects type by looking for the variable in the “Environment” tab in RStudio or by using the function class(object_name)

  • If you have a plot in the plot window that is squished or doesn’t look right, you can adjust the window size by dragging the edges. Or you can export the plot and set the dimensions of the output to larger values. This will create a PNG or PDF in your files that you can then open on your computer.

  • arrayQualityMetrics() typically takes a long time to run and uses a lot of memory. If you are on windows and get an error message that mentions “can’t allocate vector of size…”, try increasing the memory you’re allocating to Rsudio using the function memory.limit(). Note: you cannot allocate more memory than the amount of RAM on your computer.

  • If you’re stuck on interpreting you QC visualizations

    • Try googling the name of the plot or package.
    • The arrayQualityMetrics() HTML output includes a little explanation of each plot and its interpretation.
    • Feel free to ask questions!
  • RMA correction runs the fastest with the lease memory requirements, so if you have a slow computer or not a lot of RAM, I recommend you use this method.

Access the Code-Along for this Skill-Builder Project to join discussions, utilize the t3 AI Mentor, and more.