Level 1: Module 3 - Quality Control

  1. Overview of Module 3
    a. Before Starting with the analysis
  2. Instructions
  3. Deliverables
  4. Tips
  5. Resources

Overview of Module 3

  • Data visualization, quality control, and outlier detection using Bioconductor packages
  • Data pre-processing: background correction and normalization of data
  • Data visualization and final identification of outliers

Before starting with the analysis

Please refer to previous modules for more information.

Technical Background
Biology Background
  • Read the Guo paper (Module 1)
  • Familiarize yourself with the concept of the microarray sequencing (Module 1)
  • Get a better understanding of the data from these two datasets, e.g. (Module 2)
    • What type of experiment/protocol was used to prepare the samples?
    • What instrumentation was used?
    • For which genome was the data generated?


1. Make sure your data is loaded in your R environment. If it’s not, follow steps in Module 2

2. Conduct quality control (QC) using Bioconductor packages and data visualization

The purpose of QC is to check whether the data and any conclusions drawn from it can be considered reliable. It also helps us detect outliers and improve the signal-to-noise ratio. If our QC shows that there are many outliers in our data (or a low signal-to-noise ratio), the data may be too variable to find meaningful results.

The various QC methods you should try with Bioconductor packages are listed below:
a. Produce a QC plot using simpleaffy
b. Create an HTML report using arrayQualityMetrics
c. Generate a 6-page report using affyQCReport
d. Plot boxplots of the RLE (Relative Log Expression) and NUSE (Normalized Unscaled Standard Error) scores using affyPLM

3. Background correction and normalization

Background correction removes signal from nonspecific hybridization (i.e., signal emitted by things other than a sample hybridizing to a probe). Normalization corrects for systematic biases due to environment variations such as different absorption rates, spatial heterogeneity in a microarray array chip, and others. These two processes are sometimes performed separately, but the packages we’re using perform these two methods in one function. There are 3 different methods of background correction and therefore 3 possible functions you can use.

You only need to use one of the following methods:

a. mas5()
b. rma()
c. gcrma()

You should spend some time researching these background correction methods to see how they’re different or similar and if there’s certain cases where one might be preferred over another.

4. (Batch Correction). Use the ComBat() function, your background corrected/normalized data, metadata, and the object from model.matrix() to correct for batch effects while preserving features of interest (which include cancer and normal labels)

Batch effects occur when you combine data from two different sources. Batch correction makes the data comparable between the different “batches” (i.e., data sets) so that we can use them in the same analysis.

5. Create data visualizations comparing your data at different steps. You should create 1 visualization for each of the 3 methods below using batch corrected (or background corrected/normalized data). For at least one of the methods (I recommend PCA) create 3 (or 2) visualizations. These 3 plots should represent your raw data, data after background correction/normalization, and data after batch correction. For plotting data, you can use standard R plots or ggplot2

a. Boxplots
b. PCA (Principal Component Analysis)

1. Use `prcomp()` with `scale=F` and `center=F`
2. Plot PC1 vs. PC2.
3. Desired output should look similar to the figure below.

c. Correlation Heatmaps

1. Use `1-cor()` to calculate the negative correlation between samples
2. Use `factor()` to group samples into Batch 1 Normal, Batch 1 Cancer, Batch 2 Normal, and Batch 2 Cancer
3. Plot a heatmap using `pheatmap()`
4. Desired output should look similar to the figure below.

6. Using your QC plots from steps 2-3 and your visualizations from step 6, identify your potential outliers.

Once we have visualizations, we have to figure out what they mean. Determine which samples you will consider outliers, if any, and post your decisions below. This information will be used in Module 4. There will not be a consensus on which samples are outliers since everyone has different criteria and allowances. This is okay!

I’m going to leave the majority of this step up to you and your research skills, but feel free to ask questions below and we can discuss further below and in our wrap-up meeting. As a jumping off point, I’ve included some resources I’ve found helpful below.

Module Deliverables:

Always submit your R code, too!

  1. QC results.

  2. Data visualization I. 1 visualization for each method using your batch corrected (or background corrected/normalized)
    a. Boxplots
    b. PCA (Principal Component Analysis)
    c. Correlation Heatmaps

  3. Data visualization II. 3 (or 2) visualizations for at least one of the methods above. These 3 (or 2) visualizations should represent your raw data, data after background correction and normalization, and data after batch correction.

  4. Identification of outliers as a reply to this post.

All deliverables and code should be submitted on GitHub unless stated otherwise. For further instructions on how to submit deliverables, go here.


  • If you want more guidance at any step in this process, a document with more guided instructions is linked below. Feel free to ask questions, too!
  • Some of the methods require the data in different formats so pay attention to the documentation. For example, the arrayQualityMetrics() function is looking for an AffyBatch object; the prcomp() function is looking for a data frame
  • The simpleaffy QC plot is also generated in the affyQCReport, so if you want to save time, you can skip the simpleaffy package. However, the QC plot in the affyQCReport is often squished and less readable since it is confined to one page
  • If you have a plot in the plot window (lower right panel in RStudio by default) that is squished or doesn’t look “correct,” try Exporting the plot and setting the dimensions to 1000x1000 or something larger if that doesn’t work. Then view the saved png
  • arrayQualityMetrics() typically takes a long time to run and uses a lot of memory. If you get an error message that mentions “can’t allocate vector of size…”, try increasing the memory you’re allocating to RStudio using the function: memory.limit()
  • If you’re stuck on interpreting your QC visualizations, try googling the name of the plot or the package and feel free to ask questions below!
  • RMA correction runs the fastest, so if you have a slow computer, I recommend you try this method first
  • If you’re interested in how PCA actually works, I recommend you look it up on YouTube. StatQuest has some really nice videos explaining the process. PCA is a very popular method in biology research and you may see it pop up in papers (or something similar called t-SNE). A linear algebra background will help in understanding the math behind the methods
  • If you’re curious about the branching structure on the top and left side of your heatmap, I suggest you look into hierarchical clustering. StatQuest has some good videos on this topic, too. Hierarchical clustering is an unsupervised learning technique that tries to cluster similar things together with no prior knowledge using (dis)similarity metrics, like (negative) correlation. The branching structure can be interpreted much like a family tree. Samples connected by “lower” branches have a “closer relationship” meaning they are more similar than those with further branches. You only need a basic understanding of math to understand most of the (dis)similarity metric calculations


I’ve been trying to load the data file using simpleaffy function, but it keeps giving me a warning that “Package ‘simpleaffy’ is deprecated and will be removed from Bioconductor version 3.13.” In addition, it says that “In file(file, “rt”) : cannot open file ‘./GSE32323_RAW’: Permission denied”

I am not sure why the simpleaffy package is not working, so if anyone has any idea on how to resolve this issue, advice and instruction would be greatly appreciated.

Hi Aditi,

The warning is just a warning. It won’t impact your results. In terms of the permission denied error, did you unzip your download from GEO? If so, check the output of getwd(). Is GSE32323_RAW in that folder?

Yes the file is in the getwd() folder. I’ve also unzipped all the files in the folder.

Oh there is another message that pops up when I load the simpleaffy library, “Package ‘simpleaffy’ is deprecated and will be removed from Bioconductor version 3.13.”

You can ignore the message

Package ‘simpleaffy’ is deprecated and will be removed from Bioconductor version 3.13.

It’s just a warning.

I just realized - have you done step 6 in Module 2? The input for simpleaffy() should be the result from ReadAffy(). Not the GSE files.

I think I see the problem. I’ve imported the data using read.excel() and not with ReadAffy(). I’ll give it a go and update you on if it succeeds.

Update: It seems to be working now. I will continue to test it out and see if any other problems arise. Thank you Anya! This made me realize that I’ve made too many mistakes in Module 2.

Hi again, I am having difficulty using the model.matrix function and I was wondering what exactly would the arguments be. I’ve gone through all the resources and nothing seems to be working. My feature column in metadata is called CN, so I tried to run

mod <- model.matrix(CN, metadata)

but it seems that the first argument needs to be an object?
So then I tried using raw which is the output of ReadAffy, but the same error pops up, “Error in terms.default(object, data = data) :
no terms component nor attribute.”

Hi Aditi,

You were right the first time! For the model.matrix(), the first argument’s data should be the CN column - identifying which samples are cancer and which are normal. However, the function is looking for a factor, not a vector.

You can factorize a vector by adding ~ in front. So the command will look like:

mod <- model.matrix(~CN, metadata)

A little bit more explanation: Columns in data frames are vectors and a data frame is a collection of vectors of the same length. A factor is a special kind of vector where the data are strings and they are grouped in what are called “levels.” Levels are the possible values of the strings in the factor. The factorized CN column should have 2 levels, “normal” and “cancer” (or something similar).

To access the data from the AffyBatch object you can use exprs(gse) where gse is the AffyBatch object. However, when plotting a boxplot of the raw data, you can use boxplot(gse). Using boxplot(exprs(gse)) will take a very long time and a lot of memory. So I recommend using boxplot(gse).

For the PCA and the heatmap, you will need to use exprs(gse).

@Jayni_Ram @ShannonSaad @bcab

1 Like

Hi @anya,

I hope your well! Do we only need to use the ComBat() method if we are using two datasets?

Hi Shannon,

I’m doing well, hope you are too!
Yes, you only need to do ComBat() if you’re using 2 datasets. Otherwise, you can skip it.

1 Like

Do you know how to create an affybatch object? In the image below, there seems to be no documentation to write when creating the affybatch object.

The function to create an affybatch object is part of the affy package (not the affyQCReport package). Here’s a link to the affy bioconductor page: Bioconductor - affy

The function is ReadAffy. You can find more information about its parameters on page 52 of the manual: https://www.bioconductor.org/packages/release/bioc/manuals/affy/man/affy.pdf

Let me know if you have any other questions!

Is there a specific file we input for the filenames paramenter in the ReadAffy function? Capture908

Nope. The only parameters you need are compress and celfile.path.


Do you know how to convert the gz files into cel files? The image with all those cel.gz files are the best I can find to input in the celfile.path parameter.

To unzip a .gz file (i.e., convert .gz into CEL files), you can use gzip -d filename.gz.

However, you shouldn’t need to! A .gz file (or gzipped file) indicates that the file is compressed meaning it takes up less space. This is especially useful for larger files and large genomic datasets! If you change the compress parameter to be equal to TRUE, you don’t need to unzip the files!

Also, check that the celfile.path is the path to the folder containing the CEL files, not the actual CEL files. It looks like it might be C/Users/matth/Desktop/Intro-to-R in your case.


The only time I remember downloading a cel file is when downloading data from the GSE32323 dataset. I wonder how to convert these cel.gz files into cel files.

Hi matt, you can use gzip -d filename.CEL.gz to unzip the files.

I’ve already used ReadAffy to create an AffyBatch object, and I created a QC analysis report using the function qc. Do you know how to store the resulting data into a QC Stats object?