Final Self-Assessment
Things Learned:
- Technical Skills: Creating functions in Python, Data analysis and plotting in R, Strategies for analyzing scientific papers, Role of genetics and the environment in CRC and other cancers, Data visualization packages in Python, Bioconductor software for high-throughput genomic data analysis in R, transposing matrices and data frames in R, working with large data sets in R, changing object types in R, adapting existing code for new purposes, installing new packages in R studio, reading R documentation, uploading and organizing files in GitHub, troubleshooting in R, project management
- Tools: Python, R, Stem-Away Forum, Google Suites, Bioconductor, Slack, hgu133plus2.db package in R, STEM-Away Forums, GitHub, limma package in R, GEO2R database, AsanaEnrichR, orgs.HS.eg.db, topGO, clusterProfiler, pathview, matgrittr, tidyr, STRING databaseGSE21510 dataset, STEM-Away forum
- Soft Skills: online collaboration and team building, communication with new people, etiquette for online meetings, leading small groups, organizing online meetings, flexibility with last-minute meetings, communicating with small groups when not everyone can meet at the same time, the importance of creativity in STEM and creativity-building exercises, communication, leadership, planning meetings, resume building and formatting, elevator pitches, divergent thinking, presentation strategies, leading meetings, networking
Meetings/Trainings Attended:
6/2 R Training, 6/3 Technical Training Webinar, 6/4 Team 3 Meeting, 6/9 R Training, 6/9 Python Training, 6/10 Logistics Webinar, 6/10 Technical Training, 6/10 Welcome Session, 6/11 Gene Team Meeting, 6/12 Welcome Meeting, 6/12 R Training, 6/12 Gene Team Happy Hour, 6/15 Gene Team Meeting 6/15 Python Office Hours, 6/16 Asana Training, 6/16 Python and Pandas Webinar, 6/17 Technical Training, 6/18 Gene Team Meeting, 6/18 Gene Team Office Hours, 6/19 Gene Team Happy Hour, 6/22 Gene Team Meeting, 6/23 Python Training, 6/24 Bioinformatics Office Hours, 6/24 Fireside Chat Webinar, 6/25 GitHub Training Webinar, 6/25 Bioinformatics Webinar, 6/25 Gene Team Meeting, 6/26 R Training, 6/26 Gene Team Happy Hour, 6/29 Gene Team meeting 6/29 Python Training, 6/30 Python Office Hours, 7/1 GitHub Webinar, 7/2 Gene Team Meeting, 7/2 Gene Team Happy Hour, 7/6 Gene Team Meeting 7/08 Gene Team meeting, 7/08 New and Old Leads Meeting, 7/09 Python Training, 7/10 Gene Team Happy Hour, 7/13 Gene Team Meeting, 7/15 Gene Team Team Meeting, 7/15 BI Intro Webinar, 7/16 Leads Meeting, 7/17 July Leads Meeting, 7/20 July BI Team 3 Meeting, 7/21 BI Leads Technical Training, 7/21 Gene Team Meeting, 7/24 BI final presentations
Tasks Completed:
- Attended or watched the R and Python training sessions and completed related exercises. I found some of the exercises to be a little challenging, but after going back and watching the training for a second time, I was able to understand how to adapt the code to accomplish the different tasks.
- Read and annotated the scientific paper about prognostic markers for CRC. The paper was lengthy and difficult to digest, but after discussing with my team and receiving training on a recommended order to read the paper and the areas to focus on, the task seemed much more manageable and I was able to extract more meaning from the paper.
- Completed assigned Python and R exercises
- Normalized GSE8761 microarray data to be used for quality control analysis (mas5 function)
- Performed quality control analysis on GSE8761 microarray data using the arrayQualityMetrics() package in Bioconductor for both raw and normalized data
- Used the ggplot function in R to create PCA plots and heatmaps to compare clustering of normal and tumor samples before and after normalization
- Used the hgu133plus2.db package to annotate the GSE8761 data set by mapping probe ID to gene symbol and eliminating duplicate values and NAs
- Used quantile() function to identify and filter out genes expressed below the 4th quantile from the GSE8761 data set
- Created a new matrix of normalized and filtered GSE8671 differential expression data
- Used the limma package in R to calculate statistics for GSE8671 (gene symbol, log(2)fold change, p-value, adjusted p-value)
- Investigated different thresholds for determining significance of differentially expressed genes and determined appropriate cutoffs
- Illustrated results from GSE8671 differential analysis and significance cutoffs in a Volcano Plot
- Completed the second set of python exercises
- Cleaned the phenotypic data for GSE8671 and transferred into a GSE8671 expressionSet object containing the gene expression data for Gene Team group 3
- Created a vector containing the top differentially expressed genes for GSE8671 and mapped to their entrez ids
- Performed GO analysis to identify the cellular components and molecular functions associated with the most differentially expressed genes
- Analyzed the KEGG pathways of the most differentially expressed genes
- Used enrichR() to further analyze the involved pathways and gene ontologies for the top differentially expressed genes in GSE8671
- Used the STRING database to identify protein interactions associated with the top differentially expressed genes and to create a PPI network map
- GSEA enrichment analysis on the GSE8671 data
- Hosted two sessions of office hours (7/08 and 7/13)
- Identified an LFC threshold for differential expression and separated upregulated and downregulated genes in the GSE8671 dataset into two vectors
- Ran GO, KEGG, Wiki Pathway, and STRING analysis on both gene vectors to identify and compare trends between upregulated and downregulated genes
- Analyzed differential expression of the GSE21510 data set
- Quality control and outlier removal
- normalization
- Gene annotation and gene filtering
- Limma analysis and data visualization
- Compared results from GSE21510 to results from GSE8671 to identify common dysregulated genes and associated pathways
- Compared results from my combined data to the results published in the Guo Paper and confirmed differential expression of hub genes
- Created a Powerpoint presentation showcasing what I have learned and accomplished during my internship at STEM-Away
Achievement Highlights:
- Leaned how to edit plots in R to analyze specific data sets
- Leaned how to create functions in Python
- Identified and taught peers a simple and efficient method to read in microarray data for analysis project
- Troubleshooted in R to develop a method to fix the probe ID labels in the normalized GSE8761 data
- Successfully and independently completed my code not only to finish all of this week’s deliverables on time, but more importantly, to gain a better understanding of how the different code elements work.
- Took the lead in cleaning the phenotypic data for GSE8671 and transferring it into a GSE8671 expressionSet object containing the gene expression data for Gene Team group 3
- Selected as a lead for the July 2020 BI pathway
- Complimented on the helpfulness of my office hours section (7/08 and 7/13)
- Identified and solved a discrepancy problem in my team’s GO and KEGG analysis data
- Complimented on my presentation skills hosting a webinar about the first set of deliverables for incoming July interns
- Compared the dataset I analyzed for my final project to the dataset we analyzed during the June session of BI and complimented on the originality of my presentation
Link to presentation slides: