Level 1: Module 2 - Data Resources

  1. Overview of Module 2 a. Instructions b. Tips
  2. Resources

Now that you’ve gotten some experience in R and programming, it’s time to introduce you to the data.

Overview of Module 2

  • GEO database
    • explore the data repository
    • search for studies relevant to hands-on projects
    • download data relevant to our project
  • Exploring and acquiring metadata
  • Prepare and explore packages for the next module


  1. Explore the GEO database and what kind of information it provides

  2. Download expression data from GEO using the following guidelines:

    • Which dataset? – You have 2 options:

      • Download 2 datasets: GSE32323 and GSE8671
      • Download only one of the datasets linked above

      Using 2 datasets will be a bit more challenging during data pre-processing, but will introduce you to batch correction, which is an important step when working with data from multiple sources.

    • Which files?

      • GSE32323 – (http)(custom) non-cell line samples; samples containing C#T-H in the name (first 34 samples) GSE32323

      • GSE8671 – (http)(custom) .CEL files only (64 samples) GSE8671

  3. Unzip the .tar file(s)

  4. Acquire metadata - this data will be used in batch correction and limma analysis (a method of statistical analysis on the significance of differential gene expression (DGE))

  5. Install and load the following packages; and explore their documentation for more information on their functions and use cases

  6. Load data into R environment

    • ReadAffy()
    • read.csv()
  7. (Batch Correction) Merge the affy objects using the merge() function

  8. If you’ve completed the steps above and want to get experience in applying what you’ve learned in R during Module 1, take a look at the resources below, under “Intro to DGE Analysis Using DESeq2 (a Bioconductor package).” In the following modules, we’ll be implementing a similar pipeline using different packages and algorithms.

  9. Reply to the General Introduction and GitHub Account post with your information in preparation for Module 3.

  10. Reply to your self-assessment here following the same format


  • GEO database entries (aka Accession Displays) contain information about where the data came from ([Summary]), how data was collected ([Overall design] and [Platforms]), and sometimes link to papers in which the data was used ([Citation(s)]). This information can be helpful when trying to understand what the data is showing or being used for

  • The GEO database provides a handy tool called GEO2R, which does DGE analysis right in your browser. If you scroll to the bottom of an Accession Display, you’ll see a button that says “Analyze with GEO2R”. The DGE analysis we’ll be doing in this project is slightly different, but if you want to get a feel for the kind of results you’ll see, try out this tool. You can also view the R code they use. Instructions on how to use GEO2R can be found here

  • If you’re having trouble unzipping the .tar files, see below:

    • For Windows users, you may need to download WinRAR (preferably x64)
    • For Mac users, double-click the .tar folder
  • Your metadata should contain at least 2 columns (3 if doing batch correction):

    • sample name
    • feature: represents the tissue type of the sample - normal or cancer
    • (batch correction) BATCH number - representing which batch (ie. dataset) the sample is from
  • If you’re having trouble with ReadAffy(), look at the documentation and pay attention to the possible arguments for the function. Do you need to specify anything related to .CEL files?

  • Bioconductor packages often have many resources for documentation which can be helpful in troubleshooting or learning how to use a package. These resources can be found under the “Documentation” heading on a package’s bioconductor page:

    • primers, introductions, vignettes, and most pdfs usually contain a guided walkthrough for a specific function of the package
    • reference manuals usually contain a list of functions along with a description, parameters, usages, etc. This information can also be found using the ?function_name command in R
  • Some packages also have their own website which offers a load of information on use cases, examples, niche functions, etc.


When I open the GSE32323 tar file, a new coding file opens up in RStudio. Is this what’s supposed to happen after I complete step 3?

Hi @ttmath12,

Can you send a screenshot of what’s happening? I don’t think a new coding file should open up.

You may also find it useful to review the first part of the video linked below, where I help another student through the steps:

In the bottom left, there’s a RStudio icon. That’s the GSE32323.tar dataset file. On the top right, this is what pop-ups when I try opening the GSE dataset.

Hi @ttmath12,

You should be opening RStudio separately and creating a new R script. Then reading in the data using the ReadAffy() function.

In the first part of the video I linked above, I am walking another student through how to do this. Could you please review that video and let me know if you have any questions? If you are are running into errors, I will be having office hours tomorrow, Saturday, at 10a EST and Sunday at 9a EST and will be able to walk you through the steps in real time. The zoom links can be found in the calendar event posts on this pathway hub.

I was able to access the GEO database link and download the first 34 supplementary cel.gz files that belonged to the GSE 32323 dataset. Then, I went to RStudio and found the GSE 32323 tar file. I clicked it and a popup appears. Popup.pdf (255.0 KB) Which app should I choose?

Hi @ttmath12,

Before opening R, you should unzip the .tar file in your file browser. It looks like you have a windows, so you will need to use WinRAR to do this, which it looks like you have installed. After the files are unzipped, open R and load the affy pacakge, and use the command ReadAffy(compress=T, cel.path=[path-to-files]) where [path-to-files] is the path to the folder resulting from unzipping the .tar. I’m using a Mac, so for me this path was ~/Downloads/GSE32323_RAW.

Does the Module 1-3 video talk about how to unzip the tar files?

1 Like

Hi @ttmath12,

In the video linked above, we briefly talk about unzipping tar files, but there is no demo as the person sharing her screen was just showing the RStudio window. However, there are plenty of videos on YouTube if you search “unzipping tar files windows.” On a mac, you should be able to just double click the tar to unzip it.

Hi Anya. I hope all is well with you. I don’t understand in step 7 (batch correction) what should we merge with what? Should we enter this code for GSE32323

dataset<-ReadAffy(celfile.path = “/maryam/Internship/Internship project/Data/GSE32323/”)

and another similar function for GSE8671

dataset<-ReadAffy(celfile.path = “/maryam/Internship/Internship project/Data/GSE8671/”)

and after reading these two files should we merge them together? I tried to watch the video you put above about these modules but I got this error: Sorry, the file you have requested does not exist.

Hi Maryam!

You only need to do batch correction (step 7) if you’re using multiple datasets in your analysis.

GSE32323 and GSE8671 are two separate datasets, both available on the GEO database. If you want to use both, go for it! You’ll need to download them separately, though.

If you only want to use 1 dataset, you can skip step 7.

Based on your error message, I’d guess you either didn’t download one of the datasets or your file path is incorrect – check for spelling mistakes and that the .CEL files actually exist in the specified folder.

I have downloaded the two datasets and now I want to remove the batch effect. For this purpose, I should determine for R that each sample belongs to which batch. and for this, I guess I should read the metadata file in R too. but I don’t know after that how should I make a relation between datasets files and metadata files. I mean I don’t know which functions I should use for this.

And the Error that I mentioned was not an error in R. You had sent a video in answer to ttmath12 's question below this module(module 2) and you had said you were walking another student through how to do this module. I tried to see the video but I got that error.

Ah, okay, I understand.

I updated the video link so it should be working now. Sorry about that!

You should create 1 metadata file for both datasets. This file should contain at least 3 columns - sample name, tissue (normal or cancer), and the batch (i.e., which dataset the sample is from). You can also add a column for combining the tissue and batch (i.e., batch1_normal, batch2_normal, etc.).

You don’t need to combine the gse data sets and metadata. In Module 3, you’ll normalize the gse datasets, then use the ComBat function to perform batch correction. The combat function takes the gse dataset and metadata as separate arguments so you don’t need to merge them – just ensure that they have the same sample names.

1 Like

Thanks for your quick reply. I got it . So what is the meaning of step 7 ? (Batch Correction) Merge the affy objects using the merge() function

In step 7, you’re merging gse32323 with gse8671 so you have 1 single gse dataset moving forward.

1 Like

Sorry. What is the meaning of your last sentence “just ensure that they have the same sample names” you mean the same group name ? for example in each dataset we should have two normal and cancer samples?

The samples are named GSM#. When creating an affybatch object using ReadAffy, I think the function takes the file name as the “sample name”. So the “sample name” in your gse dataset might be GSM#.CEL.gz but in your metadata file, it may be just GSM#. You should make sure it’s the same (either GSM# or GSM#.CEL.gz) for both datasets. The “sample names” in the gse data set are the column names.

image I don’t understand what R means by “hex digits.”

I think you might need to alter your path string so all the \ backslashes are / forward slashes

Hello Anya,

I am having trouble on this step, as I keep getting an error

Error: the following are not valid files:
    C/Users/Siya/Desktop/StemAway/Unit 2/GSE32323_RAW

Do you think you could help me? I have added screenshots of R Studio and the folder where the data is located.

Thank you, Sneha