DGE Analysis - Sona Popat

Progress Summary - Differential Gene Expression Analysis - Sona Popat

Link: https://docs.google.com/presentation/d/1xYxYrk5cLbaehOvUmVIgxFrn2G73fQN5sMxWzeH5TOc/edit?usp=sharing

Contribution: Wrote the code to generate the plots

Challenges Faced:

  • I struggled with using collapseRows() function as I was unfamiliar with the arguments needed - and varying the inputs only led to more errors! After attempting to troubleshoot myself using the function help documents and online forums, I resolved this by asking for advice in the team troubleshooting channel. Understanding the inputs and the errors I was getting each time helped me overcome my issues with the collapseRows function - first I needed to convert the expression data columns from character to numeric, then I needed to exclude the columns of the data frame that contained character data (e.g. the SYMBOL and ProbeID columns). The code for the working function is shown below.

      ##Removed the rows where multiple different ProbeIDs mapped to the same gene, choosing to keep the probeID with the highest mean
      datET <- expression_clean[,-c(1,2,3)]  #Remove character columns (first 3 of the dataframe) so only the numeric data is left so the function can validly calculate the mean of each row
      rowID <- expression_clean[,1]  #Select column by which the function will identify multiples (ProbeID)
      rowGroup <- expression_clean[,2]  #Select column by which the function can tell the rows with the same ProbeID are actually unique (SYMBOL)
      expression_clean_2 <- collapseRows(datET,
                                         rowGroup = rowGroup,
                                         rowID = rowID, 
                                         method = "MaxMean")
  • My heatmap also initially did not appear as expected, with all of the boxes appearing dark blue (indicating a value of 0) instead of a range of colours to show varying gene expression levels. Using the sub-group chat and office hours to troubleshoot this led to us identifying an error in the model matrix set-up, as there were 2 different types of model matrix possible (2 columns for each cancer and normal samples, or just 1 column showing both) that required different corresponding code. Identifying this underlying issue made fixing the problem very simple, as it was just a question of making sure the type of model matrix and code matched!

Summary of Work:

  • Annotation using hgu133plus2.db

  • Gene filtering using collapseRows and !duplicate

  • DE analysis using limma (lmFit, eBayes, and topTable)

  • Visualisation using heatmaps and volcano plots

  • Used GitHub to submit deliverables by creating a branch and a pull request to merge this to the master

  • Learned to use GEO2R and use the R code generated there to help guide the steps in our own bioinformatics pipeline

Further Notes:
All of the different members of the team I spoke to had different top DEGs in our results, which was initially something I worried about. The task leads reassured us that this was normal and likely due to slightly different steps/order of steps taken during normalisation or filtering. I found this interesting, as surely this means the steps taken in earlier parts of the pipeline will affect the final conclusions drawn. I would be interested to cross-reference the common genes between the different results to see if the same top functions are identified, as this would be reassuring to know that the conclusions drawn are still reliable no matter the slight differences earlier in the pipeline.