Sahilbolar - Machine Learning (Level 3) Pathway

Module 1

Technical Area: This week, I read the paper which we are trying to implement. I learned more about the EBC algorithm, how the matrices are set up, and how biclustering gives us unique insights into the insights into different drug-gene interactions and dependency paths.

Soft Skills: Learning to coordinate with my team was crucial! Our team has members from around the world, so it was important to find a common time to meet that would work for everyone. We also presented to our group about different aspects of the aforementioned paper during Journal Club. This helped me in my presentation skills as I attempted to distill and simplify information in an understandable manner; furthermore, learning directly from my peers afforded me a deeper understanding of the ideas in the paper.

Upcoming goals: To obtain the data from Medline and preprocess it (identify drug-gene pairs co-occurring in Medline sentences, extract all dependency paths connecting these drug-gene pairs, and arrange the data into a matrix). I’m super excited to get my hands on the data and begin coding!

Module 2

Technical Area: This week, I gained a deeper understanding of the Stanford parser and the mechanism with which it operates. I learned more about how the neural network’s task is to determine the optimal transition (push to stack, left arc, or right arc) depending on the state of the system (stack, buffer, set of dependency arcs). It was very exciting to start coding and see the Stanford parser operate on the input sentences that I provided. I also learned a little about how Dask operates and how we can use it in our project.

Soft skills: This week I furthered my communication skills, as I prepared a brief presentation to my group on how the Stanford parser operates. It was admittedly confusing for me to learn at first, but I was able to distill the idea down when explaining to my teammates. I also was open with what I understood and what still confused me, and my teammates helped fill in some of the gaps in my knowledge.

Upcoming goals: To start parsing Medline abstracts and form dependency paths that we can use to input into the EBC algorithm.

Module 3

Technical Area: This week, I learned more about the Ensemble Biclustering for Classification Algorithm (EBC). This algorithm is used to determine which drug-gene pairs and which dependency paths follow the same biological mechanisms. I read more about what biclustering is and how it works, and read a little into Information Theoretical Co-Clustering (ITCC), which is the backbone for EBC. I implemented the EBC algorithm in Python this week. This code outputs cluster assignments for each drug-gene pair and for each dependency path.

Soft skills: This week I presented to my group on the EBC code I implemented and walked them through what I understood and what I didn’t. I also learned to research on my own because the EBC code in the Github repository of the paper’s authors was designed for Python 2; I went through and edited their code to make it functional for Python 3.

Upcoming goals: Understand what the next steps are with this output (cluster assignments for drug-gene pairs and dependency paths), whether it is to produce the dendrogram or something else.

Module 4

Technical Area: This week, I continued to work on implementing the EBC algorithm. With each iteration of running the biclustering, each drug-gene pair gets assigned to a cluster. By running this biclustering many times, we can track which drug-gene pairs tend to cluster together with which other drug-gene pairs, and keep track of this in a co-occurrence matrix.

Soft skills: Not only did I gain a deeper intuitive understanding of EBC from Colin’s meeting, but I learned how to best phrase my questions to truly understand what the original authors did at each step and why they did so. This helped many things click for me, so if a recruiter asks me in the future how this project works I can guide them through the general workflow.

Upcoming goals: Use this co-occurrence matrix along with DrugBank to examine if our clustering matches up with ground-truth relationships.

Module 5

Technical Area: This week, I was able to implement the supervised portion of the EBC algorithm. Using DrugBank data of known drug-gene pairs, we were able to compare our biclustering with the ground truth to see if EBC was working as intended.

Soft skills: This week, I have been preparing for our team’s final presentation. It’s been helpful for me to look at our work from a bird’s-eye view to understand what each section is doing and how it all fits together. Additionally, since I didn’t request access to DrugBank data until relatively late, I proceeded to sketch out an outline for the EBC supervised pipeline without actually having access to the data. I found that this made my process smoother, as I had to be more intentional about how I designed my code. I know in industry that software engineers often have to design code by discussing necessary variables, data structures, etc. before actually starting to code, so this was good practice for my future career.

Upcoming goals: See how many of our test sets had an AUC > 0.7, and create a dendrogram from the co-occurrence matrix rankings.