Progress Summary - Breast Cancer Analysis Project - Sona Popat
- I would have liked to have analysed a dataset which used a base package other than affy, such as the oligo package, but when I tried this I ran into some errors right away. Due to the time constraints, I decided to use the same bioinformatics pipeline used to analyse the colorectal cancer dataset previously, using affy, as I was more familiar with these packages.
- After that, I didn’t find any major challenges with completing the bioinformatics pipeline for the second time, as the same functions were used. Most of the technical issues that arose I managed to independently troubleshoot using previously asked questions in the team’s troubleshooting channels or using my previous code as a guide.
- I had some concerns about the dataset I selected to analyse: unlike the previous datasets analysed, this had a very unequal balance of normal and cancer samples (5 normal samples and 45 cancer samples). I was unsure how reliably representative the normal samples could be considered as the gene expression levels varied widely between them, but speaking to the team leads reassured me that this was acceptable and the conclusions made would still be valid.
- Deciding what to include in my final presentation to tell a story! Each step of the pipeline has several different visualisations and points that could be discussed, and there were many many different potential routes I wanted to explore from the functional analysis outputs! I found it challenging to cut my presentation down to fit within the allotted time, as many of the pathways and conclusions were exciting to read about in a biological context. In the end, I chose to focus on the most significant genes and pathways that could tell a good story within the presentation, going from how the cancer develops to using the analysis outputs to identify drug targets and treat the cancer.
Key slides from my final presentation focusing on the functional analysis and interpretations that can be made from a single gene: HER2:
Summary of Work:
- Downloaded the breast cancer data set
- Created a metadata csv file from the series matrix file
- Completed the quality control and data visualisation steps of the pipeline
- QC using affyPLM - produced RLE and NUSE boxplots
- Normalisation using rma
- Batch correction
- Visualisation using PCA plots
- Differential gene expression analysis
- Annotation using hgu133plus2.db
- Gene filtering using collapseRows and !duplicate (thinking about different methods and selecting a method with justification based on how these would affect the results)
- DE analysis using limma (lmFit, eBayes, and topTable)
- Visualisation using heatmaps and volcano plots
- Functional analysis
- Gene Ontology Analysis - using enrichGO(), setReadable(), and barplot()
- KEGG Analysis - using enrichKEGG() and dotplot()
- Gene-Concept Network - using enrichDGN(), setReadable(), and cnetplot()
- STRING analysis
- Transcriptional Factor Analysis - downloading data from MSigDB - GSEA, using cnetplot()
- Survival Analysis - using GEPIA to produce survival plots, beginning to interpret survival plots
- Focussing on the biological implications of the results of the functional analysis
- Created a timetable of small achievable tasks to help my time management: ensured I stayed on-track to complete my final presentation deliverables
- Created a presentation to showcase my final deliverables, explaining the steps in the bioinformatics pipeline whilst maintaining a focus on the biological importance of the data analysis
Further Notes: I really enjoyed learning how to create the metadata for a selected dataset, as this was not a step we had completed previously - learning how to do this means I am now able to complete every step of the pipeline independently, so I can carry forward these skills to analyse almost any dataset!
In that vein, I would love to develop my skills further so I can analyse any dataset I am interested in. This could include learning how to process data with different base packages, such as oligo, or being able to adapt the skills I have now developed to different pipeline designs. Another thing I noticed when selecting a dataset to analyse was that not all of the data was comparing normal vs cancer samples: some data compared gene expression before and after being treated with a candidate drug, for example. I would be interested to see how the pipeline or interpretations would be adapted to investigate a slightly different hypothesis, though the analysis is still based on gene expression.
Overall, I am really proud of how far I have come throughout the project, as I am now much more confident approaching coding and computational biology techniques, which I previously had very little experience with. I hope to apply these new skills to my undergraduate research project!