Aliciarepka - Bioinformatics Pathway

Things Learned:

  • Technical Skills: Creating functions in Python, Data analysis and plotting in R, Strategies for analyzing scientific papers, Role of genetics and the environment in CRC and other cancers
  • Tools: Python, R, Stem-Away Forum, Google Suites
  • Soft Skills: online collaboration and team building, communication with new people, etiquette for online meetings.

Achievement Highlights

  • Leaned how to edit plots in R to analyze specific data sets
  • Leaned how to create functions in Python
  • Analyzed Figures in the CRC paper to better understand the implications of the data

Meetings/Trainings Attended

6/2 R Training, 6/3 Technical Training Webinar, 6/4 Team 3 Meeting, 6/9 R Training, 6/9 Python Training, 6/10 Logistics Webinar, 6/10 Technical Training, 6/10 Welcome Session, 6/11 Gene Team Meeting, 6/12 Welcome Meeting, 6/12 R Training, 6/12 Gene Team Happy Hour, 6/15 Gene Team Meeting

Goals for the Upcoming Week

  • Continue to work with and become more comfortable with R and Python so that I can better understand how to accomplish a given task
  • Attend all meetings and events
  • Discuss the CRC paper and data with my team to gain new insight on the findings

Tasks Completed:

  • Attended or watched the R and Python training sessions and completed related exercises. I found some of the exercises to be a little challenging, but after going back and watching the training for a second time, I was able to understand how to adapt the code to accomplish the different tasks.
  • Read and annotated the scientific paper about prognostic markers for CRC. The paper was lengthy and difficult to digest, but after discussing with my team and receiving training on a recommended order to read the paper and the areas to focus on, the task seemed much more manageable and I was able to extract more meaning from the paper.

Current Role: Observer, but I would like to request to change to a participant now that our project duration has been extended to 8 weeks.

Self Assessment 6/22:
Things Learned:

  • Technical Skills: Creating functions in Python, Data analysis and plotting in R, Strategies for analyzing scientific papers, Role of genetics and the environment in CRC and other cancers, Data visualization packages in Python, Bioconductor software for high-throughput genomic data analysis in R
  • Tools: Python, R, Stem-Away Forum, Google Suites, Bioconductor, Slack
  • Soft Skills: online collaboration and team building, communication with new people, etiquette for online meetings. Leading small groups, organizing online meetings

Achievement Highlights:

  • Identified and taught peers a simple and efficient method to read in microarray data for analysis project
  • Collaborated with peers to analyze data on GSE8671 expression and coordinate a presentation
  • Troubleshooted in R to develop a method to fix the probe ID labels in the normalized GSE8761 data

Meetings/Trainings Attended:

6/2 R Training, 6/3 Technical Training Webinar, 6/4 Team 3 Meeting, 6/9 R Training, 6/9 Python Training, 6/10 Logistics Webinar, 6/10 Technical Training, 6/10 Welcome Session, 6/11 Gene Team Meeting, 6/12 Welcome Meeting, 6/12 R Training, 6/12 Gene Team Happy Hour, 6/15 Gene Team Meeting 6/15 Python Office Hours, 6/16 Asana Training, 6/16 Python and Pandas Webinar, 6/17 Technical Training, 6/18 Gene Team Meeting, 6/18 Gene Team Office Hours, 6/19 Gene Team Happy Hour, 6/22 Gene Team Meeting, 6/23 Python Training, 6/24 Bioinformatics Office Hours, 6/24 Fireside Chat Webinar

Goals for the Upcoming Week:

  • Ask clarifying questions using the STEM Away Forum
  • Finish tasks and deliverables ahead of time
  • Collaborate with my small group to better understand the deliverables and what the code is accomplishing

Tasks Completed:

  • Attended or watched the R and Python training sessions and completed related exercises. I found some of the exercises to be a little challenging, but after going back and watching the training for a second time, I was able to understand how to adapt the code to accomplish the different tasks.
  • Read and annotated the scientific paper about prognostic markers for CRC. The paper was lengthy and difficult to digest, but after discussing with my team and receiving training on a recommended order to read the paper and the areas to focus on, the task seemed much more manageable and I was able to extract more meaning from the paper.
  • Completed assigned Python and R exercises
  • Normalized GSE8761 microarray data to be used for quality control analysis (mas5 function)
  • Performed quality control analysis on GSE8761 microarray data using the arrayQualityMetrics() package in Bioconductor for both raw and normalized data
  • Used the ggplot function in R to create PCA plots and heatmaps to compare clustering of normal and tumor samples before and after normalization
  • Investigated and used the hgu133plus2.db package to annotate the GSE8761 data set by mapping probe ID to gene symbol

Self Assessment 6/29
Things Learned:

  • Technical Skills: transposing matrices and data frames in R, working with large data sets in R, changing object types in R, adapting existing code for new purposes
  • Tools: hgu133plus2.db package in R, STEM-Away Forums, GitHub, limma package in R, GEO2R database, Asana
  • Soft Skills: flexibility with last-minute meetings, communicating with small groups when not everyone can meet at the same time, the importance of creativity in STEM and creativity-building exercises

Achievement Highlights:

  • Successfully and independently completed my code not only to finish all of this week’s deliverables on time, but more importantly, to gain a better understanding of how the different code elements work.
  • Discovered a method to help prevent my computer from crashing when performing calculations of large data sets in R
  • Mentored my peers in The Gene Team subteam 3 to help them complete their code to accomplish and understand this week’s deliverables as well

Meetings/Trainings Attended:

6/22 Gene Team Meeting, 6/23 Python Training, 6/24 Bioinformatics Office Hours, 6/24 Fireside Chat Webinar, 6/25 GitHub Training Webinar, 6/25 Bioinformatics Webinar, 6/25 Gene Team Meeting, 6/26 R Training, 6/26 Gene Team Happy Hour, 6/29 Gene Team meeting

Goals for the Upcoming Week:

  • Communicate more with small groups about deliverables
  • Set expectations and duties with small group before beginning on tasks
  • Clearer delegation of tasks for subteam 3

Tasks Completed:

  • Used the hgu133plus2.db package to annotate the GSE8761 data set by mapping probe ID to gene symbol and eliminating duplicate values and NAs
  • Used quantile() function to identify and filter out genes expressed below the 4th quantile from the GSE8761 data set
  • Created a new matrix of normalized and filtered GSE8671 differential expression data
  • Used the limma package in R to calculate statistics for GSE8671 (gene symbol, log(2)fold change, p-value, adjusted p-value)
  • Investigated different thresholds for determining significance of differentially expressed genes and determined appropriate cutoffs
  • Illustrated results from GSE8671 differential analysis and significance cutoffs in a Volcano Plot
  • Completed the second set of python exercises

Self Assessment 7/6
Things Learned:

  • Technical Skills: installing new packages in R studio, reading R documentation
  • Tools: EnrichR, orgs.HS.eg.db, topGO, clusterProfiler, pathview, matgrittr, tidyr, STRING database
  • Soft Skills: communication, leadership, planning meetings

Meeting/Trainings Attended:

  • 6/29 Python Training, 6/30 Python Office Hours, 7/1 GitHub Webinar, 7/2 Gene Team Meeting, 7/2 Gene Team Happy Hour, 7/6 Gene Team Meeting

Achievement Highlights:

  • Took the lead in cleaning the phenotypic data for GSE8671 and transferring it into a GSE8671 expressionSet object containing the gene expression data for Gene Team group 3
  • Asked to host office hours for interns in the July Bioinformatics pathway
  • Prepared, finalized, and submitted the week 4 results for group 3

Goals for the Upcoming Week:

  • Thoroughly prepare for and lead a helpful office hours session
  • Work more closely with team 3 to complete week 5 deliverables

Tasks Completed:

  • Cleaned the phenotypic data for GSE8671 and transferred into a GSE8671 expressionSet object containing the gene expression data for Gene Team group 3
  • Created a vector containing the top differentially expressed genes for GSE8671 and mapped to their entrez ids
  • Performed GO analysis to identify the cellular components and molecular functions associated with the most differentially expressed genes
  • Analyzed the KEGG pathways of the most differentially expressed genes
  • Used enrichR() to further analyze the involved pathways and gene ontologies for the top differentially expressed genes in GSE8671
  • Used the STRING database to identify protein interactions associated with the top differentially expressed genes and to create a PPI network map

Self Assessment 7/13
Things Learned:

  • Technical Skills: troubleshooting in R, uploading and organizing files in GitHub
  • Tools: EnrichR, STRING database
  • Soft Skills: resume building and formatting, elevator pitches, divergent thinking

Meetings Attended:

  • 7/08 Gene Team meeting, 7/08 New and Old Leads Meeting, 7/09 Python Training, 7/10 Gene Team Happy Hour, 7/13 Gene Team Meeting

Achievement Highlights:

  • Selected as a lead for the July 2020 BI pathway
  • Complimented on the helpfulness of my office hours section
  • Identified and solved a discrepancy problem in my team’s GO and KEGG analysis data

Goals for the Upcoming Week:

  • Stay on top of tasks while away on vacation
  • Get ready to serve as a lead for the July session

Tasks Completed:

  • GSEA enrichment analysis on the GSE8671 data
  • Hosted two sessions of office hours (7/08 and 7/13)
  • Identified an LFC threshold for differential expression and separated upregulated and downregulated genes in the GSE8671 dataset into two vectors
  • Ran GO, KEGG, Wiki Pathway, and STRING analysis on both gene vectors to identify and compare trends between upregulated and downregulated genes

Self Assessment 7/20
Things Learned:

  • Technical Skills: troubleshooting in R, installing R packages, project management
  • Tools: GSE21510 dataset, STEM-Away forum
  • Soft Skills: Presentation strategies, leading meetings, networking

Meetings Attended:

7/15 Gene Team Team Meeting, 7/15 BI Intro Webinar, 7/16 Leads Meeting, 7/17 July

Leads Meeting, 7/20 July BI Team 3 Meeting, 7/21 BI Leads Technical Training, 7/21

Gene Team Meeting

Achievement Highlights:

  • Identified and solved a complex error loading packages in R Studio
  • Complimented on my presentation skills hosting a webinar about the first set of deliverables for incoming July interns
  • Hosted a successful meeting for July BI team 3 after finding out on the same day that our team lacked a PM lead.

Goals for the Upcoming Week:

  • Stay on top of tasks while away on vacation
  • Stay organized with July project lead role as I finish my final project for June
  • Prepare a great final presentation

Tasks Completed:

  • Hosted a webinar covering the deliverables for week 3 of the July session
  • Organized my first team meeting as a lead for the July session
  • Completed all deliverables for July week 3
    • Merged GSE32323 and GSE8671
    • Quality control
    • Normalization
    • Batch correction
  • Analyzed differential expression of the GSE21510 data set
    • Quality control and outlier removal
    • normalization
    • Gene annotation and gene filtering
    • Limma analysis and data visualization
  • Compared results from GSE21510 to results from GSE8671 to identify common dysregulated genes and associated pathways
  • Compared results from my combined data to the results published in the Guo Paper and confirmed differential expression of hub genes
  • Created a Powerpoint presentation showcasing what I have learned and accomplished during my internship at STEM-Away

Final Self-Assessment

Things Learned:

  • Technical Skills: Creating functions in Python, Data analysis and plotting in R, Strategies for analyzing scientific papers, Role of genetics and the environment in CRC and other cancers, Data visualization packages in Python, Bioconductor software for high-throughput genomic data analysis in R, transposing matrices and data frames in R, working with large data sets in R, changing object types in R, adapting existing code for new purposes, installing new packages in R studio, reading R documentation, uploading and organizing files in GitHub, troubleshooting in R, project management
  • Tools: Python, R, Stem-Away Forum, Google Suites, Bioconductor, Slack, hgu133plus2.db package in R, STEM-Away Forums, GitHub, limma package in R, GEO2R database, AsanaEnrichR, orgs.HS.eg.db, topGO, clusterProfiler, pathview, matgrittr, tidyr, STRING databaseGSE21510 dataset, STEM-Away forum
  • Soft Skills: online collaboration and team building, communication with new people, etiquette for online meetings, leading small groups, organizing online meetings, flexibility with last-minute meetings, communicating with small groups when not everyone can meet at the same time, the importance of creativity in STEM and creativity-building exercises, communication, leadership, planning meetings, resume building and formatting, elevator pitches, divergent thinking, presentation strategies, leading meetings, networking

Meetings/Trainings Attended:

6/2 R Training, 6/3 Technical Training Webinar, 6/4 Team 3 Meeting, 6/9 R Training, 6/9 Python Training, 6/10 Logistics Webinar, 6/10 Technical Training, 6/10 Welcome Session, 6/11 Gene Team Meeting, 6/12 Welcome Meeting, 6/12 R Training, 6/12 Gene Team Happy Hour, 6/15 Gene Team Meeting 6/15 Python Office Hours, 6/16 Asana Training, 6/16 Python and Pandas Webinar, 6/17 Technical Training, 6/18 Gene Team Meeting, 6/18 Gene Team Office Hours, 6/19 Gene Team Happy Hour, 6/22 Gene Team Meeting, 6/23 Python Training, 6/24 Bioinformatics Office Hours, 6/24 Fireside Chat Webinar, 6/25 GitHub Training Webinar, 6/25 Bioinformatics Webinar, 6/25 Gene Team Meeting, 6/26 R Training, 6/26 Gene Team Happy Hour, 6/29 Gene Team meeting 6/29 Python Training, 6/30 Python Office Hours, 7/1 GitHub Webinar, 7/2 Gene Team Meeting, 7/2 Gene Team Happy Hour, 7/6 Gene Team Meeting 7/08 Gene Team meeting, 7/08 New and Old Leads Meeting, 7/09 Python Training, 7/10 Gene Team Happy Hour, 7/13 Gene Team Meeting, 7/15 Gene Team Team Meeting, 7/15 BI Intro Webinar, 7/16 Leads Meeting, 7/17 July Leads Meeting, 7/20 July BI Team 3 Meeting, 7/21 BI Leads Technical Training, 7/21 Gene Team Meeting, 7/24 BI final presentations

Tasks Completed:

  • Attended or watched the R and Python training sessions and completed related exercises. I found some of the exercises to be a little challenging, but after going back and watching the training for a second time, I was able to understand how to adapt the code to accomplish the different tasks.
  • Read and annotated the scientific paper about prognostic markers for CRC. The paper was lengthy and difficult to digest, but after discussing with my team and receiving training on a recommended order to read the paper and the areas to focus on, the task seemed much more manageable and I was able to extract more meaning from the paper.
  • Completed assigned Python and R exercises
  • Normalized GSE8761 microarray data to be used for quality control analysis (mas5 function)
  • Performed quality control analysis on GSE8761 microarray data using the arrayQualityMetrics() package in Bioconductor for both raw and normalized data
  • Used the ggplot function in R to create PCA plots and heatmaps to compare clustering of normal and tumor samples before and after normalization
  • Used the hgu133plus2.db package to annotate the GSE8761 data set by mapping probe ID to gene symbol and eliminating duplicate values and NAs
  • Used quantile() function to identify and filter out genes expressed below the 4th quantile from the GSE8761 data set
  • Created a new matrix of normalized and filtered GSE8671 differential expression data
  • Used the limma package in R to calculate statistics for GSE8671 (gene symbol, log(2)fold change, p-value, adjusted p-value)
  • Investigated different thresholds for determining significance of differentially expressed genes and determined appropriate cutoffs
  • Illustrated results from GSE8671 differential analysis and significance cutoffs in a Volcano Plot
  • Completed the second set of python exercises
  • Cleaned the phenotypic data for GSE8671 and transferred into a GSE8671 expressionSet object containing the gene expression data for Gene Team group 3
  • Created a vector containing the top differentially expressed genes for GSE8671 and mapped to their entrez ids
  • Performed GO analysis to identify the cellular components and molecular functions associated with the most differentially expressed genes
  • Analyzed the KEGG pathways of the most differentially expressed genes
  • Used enrichR() to further analyze the involved pathways and gene ontologies for the top differentially expressed genes in GSE8671
  • Used the STRING database to identify protein interactions associated with the top differentially expressed genes and to create a PPI network map
  • GSEA enrichment analysis on the GSE8671 data
  • Hosted two sessions of office hours (7/08 and 7/13)
  • Identified an LFC threshold for differential expression and separated upregulated and downregulated genes in the GSE8671 dataset into two vectors
  • Ran GO, KEGG, Wiki Pathway, and STRING analysis on both gene vectors to identify and compare trends between upregulated and downregulated genes
  • Analyzed differential expression of the GSE21510 data set
    • Quality control and outlier removal
    • normalization
    • Gene annotation and gene filtering
    • Limma analysis and data visualization
  • Compared results from GSE21510 to results from GSE8671 to identify common dysregulated genes and associated pathways
  • Compared results from my combined data to the results published in the Guo Paper and confirmed differential expression of hub genes
  • Created a Powerpoint presentation showcasing what I have learned and accomplished during my internship at STEM-Away

Achievement Highlights:

  • Leaned how to edit plots in R to analyze specific data sets
  • Leaned how to create functions in Python
  • Identified and taught peers a simple and efficient method to read in microarray data for analysis project
  • Troubleshooted in R to develop a method to fix the probe ID labels in the normalized GSE8761 data
  • Successfully and independently completed my code not only to finish all of this week’s deliverables on time, but more importantly, to gain a better understanding of how the different code elements work.
  • Took the lead in cleaning the phenotypic data for GSE8671 and transferring it into a GSE8671 expressionSet object containing the gene expression data for Gene Team group 3
  • Selected as a lead for the July 2020 BI pathway
  • Complimented on the helpfulness of my office hours section (7/08 and 7/13)
  • Identified and solved a discrepancy problem in my team’s GO and KEGG analysis data
  • Complimented on my presentation skills hosting a webinar about the first set of deliverables for incoming July interns
  • Compared the dataset I analyzed for my final project to the dataset we analyzed during the June session of BI and complimented on the originality of my presentation

Link to presentation slides: