- Overview of Module 3
a. Before Starting with the analysis
Overview of Module 3
- Data visualization, quality control, and outlier detection using Bioconductor packages
- Data pre-processing: background correction and normalization of data
- Data visualization and final identification of outliers
Before starting with the analysis
Please refer to previous modules for more information.
- Familiarize yourself with R programming and the RStudio environment (Module 1)
- Familiarize yourself with the data and the Gene Expression Omnibus (GEO) database (Module 2)
- Read the Guo paper (Module 1)
- Familiarize yourself with the concept of the microarray sequencing (Module 1)
- Get a better understanding of the data from these two datasets, e.g. (Module 2)
- What type of experiment/protocol was used to prepare the samples?
- What instrumentation was used?
- For which genome was the data generated?
1. Make sure your data is loaded in your R environment. If it’s not, follow steps in Module 2
2. Conduct quality control (QC) using Bioconductor packages and data visualization
The purpose of QC is to check whether the data and any conclusions drawn from it can be considered reliable. It also helps us detect outliers and improve the signal-to-noise ratio. If our QC shows that there are many outliers in our data (or a low signal-to-noise ratio), the data may be too variable to find meaningful results.
The various QC methods you should try with Bioconductor packages are listed below:
a. Produce a QC plot using simpleaffy
b. Create an HTML report using arrayQualityMetrics
c. Generate a 6-page report using affyQCReport
d. Plot boxplots of the RLE (Relative Log Expression) and NUSE (Normalized Unscaled Standard Error) scores using affyPLM
3. Background correction and normalization
Background correction removes signal from nonspecific hybridization (i.e., signal emitted by things other than a sample hybridizing to a probe). Normalization corrects for systematic biases due to environment variations such as different absorption rates, spatial heterogeneity in a microarray array chip, and others. These two processes are sometimes performed separately, but the packages we’re using perform these two methods in one function. There are 3 different methods of background correction and therefore 3 possible functions you can use.
You only need to use one of the following methods:
You should spend some time researching these background correction methods to see how they’re different or similar and if there’s certain cases where one might be preferred over another.
4. (Batch Correction). Use the
ComBat() function, your background corrected/normalized data, metadata, and the object from
model.matrix() to correct for batch effects while preserving features of interest (which include cancer and normal labels)
Batch effects occur when you combine data from two different sources. Batch correction makes the data comparable between the different “batches” (i.e., data sets) so that we can use them in the same analysis.
5. Create data visualizations comparing your data at different steps. You should create 1 visualization for each of the 3 methods below using batch corrected (or background corrected/normalized data). For at least one of the methods (I recommend PCA) create 3 (or 2) visualizations. These 3 plots should represent your raw data, data after background correction/normalization, and data after batch correction. For plotting data, you can use standard R plots or ggplot2
b. PCA (Principal Component Analysis)
1. Use `prcomp()` with `scale=F` and `center=F` 2. Plot PC1 vs. PC2. 3. Desired output should look similar to the figure below.
c. Correlation Heatmaps
1. Use `1-cor()` to calculate the negative correlation between samples 2. Use `factor()` to group samples into Batch 1 Normal, Batch 1 Cancer, Batch 2 Normal, and Batch 2 Cancer 3. Plot a heatmap using `pheatmap()` 4. Desired output should look similar to the figure below.
6. Using your QC plots from steps 2-3 and your visualizations from step 6, identify your potential outliers.
Once we have visualizations, we have to figure out what they mean. Determine which samples you will consider outliers, if any, and post your decisions below. This information will be used in Module 4. There will not be a consensus on which samples are outliers since everyone has different criteria and allowances. This is okay!
I’m going to leave the majority of this step up to you and your research skills, but feel free to ask questions below and we can discuss further below and in our wrap-up meeting. As a jumping off point, I’ve included some resources I’ve found helpful below.
Always submit your R code, too!
Data visualization I. 1 visualization for each method using your batch corrected (or background corrected/normalized)
b. PCA (Principal Component Analysis)
c. Correlation Heatmaps
Data visualization II. 3 (or 2) visualizations for at least one of the methods above. These 3 (or 2) visualizations should represent your raw data, data after background correction and normalization, and data after batch correction.
Identification of outliers as a reply to this post.
All deliverables and code should be submitted on GitHub unless stated otherwise. For further instructions on how to submit deliverables, go here.
- If you want more guidance at any step in this process, a document with more guided instructions is linked below. Feel free to ask questions, too!
- Some of the methods require the data in different formats so pay attention to the documentation. For example, the
arrayQualityMetrics()function is looking for an AffyBatch object; the
prcomp()function is looking for a data frame
- The simpleaffy QC plot is also generated in the affyQCReport, so if you want to save time, you can skip the simpleaffy package. However, the QC plot in the affyQCReport is often squished and less readable since it is confined to one page
- If you have a plot in the plot window (lower right panel in RStudio by default) that is squished or doesn’t look “correct,” try Exporting the plot and setting the dimensions to 1000x1000 or something larger if that doesn’t work. Then view the saved png
arrayQualityMetrics()typically takes a long time to run and uses a lot of memory. If you get an error message that mentions “can’t allocate vector of size…”, try increasing the memory you’re allocating to RStudio using the function:
- If you’re stuck on interpreting your QC visualizations, try googling the name of the plot or the package and feel free to ask questions below!
- RMA correction runs the fastest, so if you have a slow computer, I recommend you try this method first
- If you’re interested in how PCA actually works, I recommend you look it up on YouTube. StatQuest has some really nice videos explaining the process. PCA is a very popular method in biology research and you may see it pop up in papers (or something similar called t-SNE). A linear algebra background will help in understanding the math behind the methods
- If you’re curious about the branching structure on the top and left side of your heatmap, I suggest you look into hierarchical clustering. StatQuest has some good videos on this topic, too. Hierarchical clustering is an unsupervised learning technique that tries to cluster similar things together with no prior knowledge using (dis)similarity metrics, like (negative) correlation. The branching structure can be interpreted much like a family tree. Samples connected by “lower” branches have a “closer relationship” meaning they are more similar than those with further branches. You only need a basic understanding of math to understand most of the (dis)similarity metric calculations
- Guided instructions and tips for using packages – Level 1: Module 3 -- Guided Instructions
- Quality Control
- Batch Correction
- SVA Package Tutorial – Section 7