DGE Analysis - Daniel Drucker

Progress Summary - Differential Gene Expression Analysis - Daniel Drucker

Presentation
Link:
https://docs.google.com/presentation/d/1xYxYrk5cLbaehOvUmVIgxFrn2G73fQN5sMxWzeH5TOc/edit?usp=sharing

Contribution: Wrote code to annotate the data and calculate differential values for gene expressions as well as the statistical information about the data set

Challenges Faced

  • Through the annotation, there are steps in which we eliminate redundant gene symbols. I couldn’t help the anxiety that I was messing up the data set, or leaving out important information when discarding redundant information, though my group-mates seemed to agree on our methods, so I felt more comfortable that we were doing it correctly.
  • Syntax and semantics of the data type given by a data frame’s “row.names” and preserving this quality through certain steps didn’t come to me intuitively.
  • The specific data type of an argument required by the function we used to collapse the redundant rows of our data frame was unclear. The documentation didn’t make things much clearer for me or my group-mate. Eventually we figured it out through some trial and error.
  • Still had trouble being independent when it came to the data visualization steps.

Summary of Work:

  • Mapped probe IDs from our raw data to gene symbols using the hgu133plus2 database
  • Filtering redundancies and “N/A” mappings, and clearing away the most insignificant gene expressions based on the percentile of their expression means
  • Using lmFit to gather differential expression values between our cancerous and healthy samples
  • Using eBayes to gather statistical data on the significance of differential expression values

Further Notes:

  • Least squares problems were something I had just recently covered in a numerical analysis course the semester before this program started. I was somewhat excited to see an application for a linear fit model function, knowing that I had some knowledge of how the function works.
  • At the presentation we noticed other subgroups had different genes listed as their most differentially expressed genes. During the presentation, we had a discussion that seemed to identify that the differences arose from the different functions used for the quality control step.