Vanshikabidhan - Bioinformatics Pathway


  • I got better understanding of doing quality control analysis of the dataset using SimpleAffy and arrayQualityMetrics packages. In addition to that, I interpreted the QC results to detect the outliers. I learned to normalize the data using rma normalization method, and remove the variability arising due to batch effect using sva package. I was able to make PCA plots, and heatmaps for both normalized and batch corrected data. Through this week’s learning process, I realized the importance of having patience, since the arrayQualityMetrics package was taking too much time to run. At the beginning of this week, I was so impatient to get QC results and proceed for batch correction and plots, that I did not give time to the program to run, and rather closed it down without waiting for the output. I wasted 2-3 days in this whole process. Since time was running out, I had no other option but to patiently wait for the program to run, which ultimately gave me my ‘index.html’ (although it took me few hours). I also learned to coordinate work with other team members. Communicating with my colleagues was bit challenging due to the huge time difference (12.5h), but I managed it well. I’m glad about the fact that we were able to put together all the deliverables and a decent presentation, in time, all because of effective teamwork. Another challenge I faced was to allocate large sized vectors, and I figured out a way to that too (by increasing the memory limit)!

  • My three achievements for this week are:
    One, I learned to make and interpret different plots like QC plot, boxplot for array intensities, NUSE and RLE plots, and heatmaps based on pearson correlation to detect outliers and check data quality. Two, I was able to perform principal component analysis, and three, I effectively communicated with my fellow group members (using slack and Gsuite account), and also cleared my doubts by communicating with other team members and leads.

  • I attended all team meetings and office hours, except the BI happy hour, as it was late in the night here. For week 4, I have volunteered as the task lead, and my approach will be to make sure everything is conducted smoothly and tasks are completed in time. Also, my goal for this week is to work in a more systematic manner in order to complete my individual tasks, and to manage my time efficiently.


Things learned

  • Technical area: Learned how to map Probeset IDs to gene symbols, use limma package to fit linear model for the data set, and visualize the deferentially expressed genes using heatmap and EnhancedVolcano plot.
  • Tools: R, Slack, G Suite, GitHub.
  • Soft skills: I volunteered to be the task lead for this week and created the online documents for the group members to contribute in, and also compiled the work. I communicated and cross-checked the results with everybody.

Achievement highlights

  • As a task lead, I communicated with my team members to coordinate the work, and and also made sure that everybody contributed towards completing the tasks.
  • Performed differential gene expression analysis and visualized multiple plots by setting different arguments and cutoff values for better understanding.
  • Completed the deliverables and presentation much before the deadline.

List of training and meetings attended
7/27 Team meeting, 7/28 Team Presentation, 7/29 GitHub Webinar, 7/30 Office Hours, 7/31 Discussion

Goals for the upcoming week

  • For the upcoming week, functional analysis is a totally new and time consuming part to perform. I will try to organize my work in order to complete the deliverables smoothly.
  • Will try to get to know my group members better

Tasks done

  • For the technical part, I mapped gene symbols, filtered genes, and fitted linear model to create a table for differentially expressed genes (sorted by their adj.P.Value). I visualized the results using heatmap based on clustering, and Volcano Plot. I interpreted the volcano plot by comparing the log fold change and p values of different genes.
  • For me, the coding part was easier this week compared to the last one. I encountered fewer problems and challenges, and the major one was related to filtering the genes. I managed to solve that by taking help through Office Hours.


Things learned

  • Technical area: This week, I made an attempt to describe the functions and interactions of the top DEGs for our colorectal cancer dataset through computational approach, i.e., I made use of packages like clusterProfiler and enrichplot.
  • Tools: R, Slack, G suite, GitHub, GEPIA, StringDB
  • Soft skills: Made an extra effort to know my team members better, and communicate this week’s results with them in detail.

Achievement highlights

  • “Kyoto Encyclopedia of Genes and Genomes” was just a regular database for me, before I conducted KEGG analysis. I interpreted the results using dotplots, and also used gene-concept network for understanding complex associations among the genes. This way, I got to know more about the database and what it is about and how to utilize it.
  • I made several attempts to get the cnetplot with the desired arguments. And finally, I did it on my own after specifying the arguments again & again, and redefining the vectors multiple times.
  • Read more about the different databases available for biological pathways, and how plots are interpreted, for example, the enrichment score and FDR from gsea plot.

List of training and meetings attended
8/4 Presentation and Diversity Discussion, 8/5 Deliverables Webinar, 8/6 Office Hours

Goals for the upcoming week

  • Compile the results of functional analysis and prepare a write-up to draw conclusions from the whole process.
  • Start working on the final task and presentation.

Tasks and Challenges

  • Filtered the genes from limma analysis to creata a gene vector, with logFC values in decreasing order. Performed GO analysis by using enrichGO and KEGG analysis by using enrichKEGG, and visualized the results by making barplot/dotplots. I also used string database to locate the hub genes.
  • Also performed GSE analysis and TF analysis using GSEA() and visualized the results by working on two new types of plot: gseaplot and cnetplot.
  • Faced problem with assigning arguments to some of my plots, but solved this issue after few attempts.
  • Defining the gene vector was the most time consuming process for me, as I wasn’t sure about the no. of genes to be considered while making the plots, and the segregation of genes into up-regulated and down-regulated genes made it more confusing, but I took help from Office hours and my team mates to figure this out.


Things learned

  • Technical area: Performed principal component analysis, quality control using ArrayQualityMetrics package; DGE analysis using limma, EnhancedVolcano and pheatmap; function analysis : GO, KEGG pathway, gene network, StringDB, TFA, GSEA.
  • Tools: R, Slack, G suite, GEPIA, StringDB, Powerpoint, GEO
  • Soft skills: Improved on my presentation skills while working for the final project.

Achievement highlights

  • Completed my final presentation on ccRCC and learned about the GEO dataset in detail. Also learned how to organize and make the metadata for different groups.
  • Spent a lot of time in interpreting all the analyses performed.
  • Besides renal cell carcinoma, I also tried analyzing the datasets for parkinson’s, breast cancer and AML.

List of training and meetings attended
8/10 Team meeting, 8/11 Office Hours, 8/12 Functional Analysis Webinar, 8/13 Office Hours, 8/14 Group presentation, 8/17 Team Meeting and Office Hours, 8/18 Webinar on professional presentation, 8/21 Final Presentation, 8/24 Final Team Meeting

Goals for the future
Work more to enhance the skills acquired through this internship, be it technical or soft skills.

Tasks and Challenges

  • I tried to merge different datasets for my final presentation but did not get good results and ended up being confused about why such errors were arising but then I sought help through office hours and managed to work it out.
  • Interpreting each and every graph was a tedious task, but I made an effort to understand every plot in detail.
  • Faced problem with particular datasets since different packages were required to load them, as a result of which I learned about the different array types and how the data is differently organized in each type of microarray.