Mtaruno - Machine Learning (Level 3) Pathway

mtaruno · July 4, 2021, 2:23pm

Overview:

Technical Area: Understood Stanford Parser at a deeper level by reading the research paper
Tools: Github VSCode Integration, Use of Utilities file setup
Soft Skills: Leadership: Making sure the team’s project management is not lagging behind

Three Achievement Highlights:

Applied correct Stanford Parser implementation under correct configurations
Well acquainted and very comfortable with VS Code’s Github integration now as well as use of value of the Utilities file
Organized and planned meetings for the group and facilitated clear deliverables for the entire group

Goals for Upcoming Week:

Complete the EBC algorithm
Kickoff learning docker
Significant progress with AWS cloud certification

Statement of Tasks Done

Here is my group’s current pipeline so far:

In the pipeline, I supervised and kept the group organized and worked on the Stanford Parser.

One big hurdle I faced is the size of the biomedical data parsed. I was feared that the parser would not efficiently run on the entire dataset, so I took a sample of the sentences and this is the output I fed to the next member of my team. From the sample output, my team mate was able to continue the pipeline and get the desired results. After the pipeline is validated, only then did I run it on the entire dataset. I think this was a good approach as iterative unit testing makes the workflow lean and manageable. Moving forward, we need AWS because some sentences are too long to use the PCFG grammar to parse.

Because it did not run on some sentences that are too long, I thought it would be helpful to visualize the token lengths of the sentences in this biomedical corpus - hopefully when we use AWS services the Stanford parser will work on the entire dataset:

Here we can see that most of the sentences have a token length of less than 50, but ultimately there is a right skew in the distribution and the sentences that are lengthier are what the parser has trouble with.

Another hurdle I faced was getting familiar with the Github integration to keep good track of the entire group’s repository and versions. Learning the direct integration with VS Code helped me get a tremendously more efficient and team-integrated workflow. It also helped me reproduce code much easier as I could make a Utilities repo for my personal use that Collin mentioned and I find to be a brilliant idea.

mtaruno · July 21, 2021, 8:31am

1. Overview:

Technical Area: Unittesting, coverage, Docker, EBC Algorithm
Tools: Better Object Oriented Design, Being able to load Docker Containers, Unit Testing with Unittest library
Soft Skills: Leadership, Initiating Pair Programming Sessions, Explaining and Distilling Highly Technical Concepts efficiently

2. Three Achievement Highlights:

Jython implementation running and integrated into the local system. Refactored the working repository with test code as well.
I have learned to implement better Software Development principles such as lower coupling, lower cohesion, SOLID principles, debugging, unit testing, and using composition over inheritance in object oriented design as a machine learning engineer
Stronger intuition of the EBC algorithm and the entire pipeline that occurs in the paper

3. Goals for Upcoming Week:

Finish implementation of ITCC Biclustering algorithm to get co-occurance statistics
For the supervised step, use information from the DrugBank database to create seed sets and test sets - ultimately to iterate to a minimum viable product and baseline model that represents the end-to-end pipeline
Find out where to directly find the Dependency Matrix implementation for a comparison and to further build pipeline from there (also will need to check with Collin to see if my current implementation is correct).

4. Statement of Tasks Done:

I refactored our code structure in our repository. A hurdle is to lead the team to be consistent in using the conventions set for our project so that we are all able to implement it well. Although it is still not perfect, initially creating a “template” of the best practices from my personal code is helpful in guiding the rest of the team to use the same or similarly effective conventions.
I presented my team’s progress to Collin and created a presentation. Created a pipeline diagram to represent our achievements through time. I also thought of good questions to ask and tried to drive a productive conversation about what it takes for all MLBI teams to produce a better replication and extension of the research paper. I attach the short presentation I used to this Self Assessment post.
I used test driven development to create a class that proprocesses the data that is fed into both Java and Jython implementations of the Stanford parser. Conceptual understanding of implementation details of this, including use of a debugger in tandem with unit testing. I addressed this through actively observing well developed repositories such as the EBC implementation and creating dummy examples from scratch.
Project management to keep my team on track. There is a struggle of active members, and I personally addressed this through personal reach outs.

STEM Away Progress Update and Questions July 12.pdf (1.9 MB)

mtaruno · July 28, 2021, 9:21am

1. Overview:

Technical Area: Distributional semantics deeper dive, Biclustering, Got AWS Cloud Certified
Tools: VSCode, Dask, AWS Instance Usages and Sagemaker, VSCode Efficiency
Soft Skills: Importance of showcasing, motivating team members as a team lead, better task delegation methods

2. Three Achievement Highlights:

Built in the EBC Pipelines into the repository
Improving as a coder to deliver high quality code in a more modular way and doing so faster
Started thinking and planning for showcasing

3. Goals for Upcoming Week:

Finish replication implementations - including ITCC Biclustering algorithm to get co-occurance statistics and seed set ground truth signals from Drugbank and agglomerative clustering
Refine Github and create high quality showcasing materials - including encouraging team to create their own Jupyter Notebooks for explaining each of their pipeline works, my blog post, and code quality
Do more research on distributional semantics and biclustering that is engaging towards a wide audience applications so that I can write an engaging blog post about this topic and finish in time for Collin to provide feedback (likely will be on the financial stakes that these ML techniques can provide high amounts of value towards)

4. Statement of Tasks Done:

Refactored and implemented the EBC Pipelines at a deeper level
Upskilled with the usages of Dask, AWS, VSCode familiarity/efficiency (use of linters, getting keyboard shortcuts more natural, and automatic best practice formatting), and writing production ready code
Project management for the team (scheduling meetings, guidance, relaying updates from Collin’s meetings)
Did deeper dive into the intuitive general purpose power of distributional semantics and biclustering in the context of financial stakes
In team meetings, making sure to engage with Collin to ask questions and provide updates

mtaruno · August 4, 2021, 1:38am

1. Overview:

Technical Area: Python packaging, module imports, co-occurance matrix, ITCC implementation, upskilling my reading skills
Tools: Github, MarginNotes, Python
Soft Skills: Project management, team motivation and leadership

2. Three Achievement Highlights:

Finished EBC Algorithm artifact creation by consolidating drug bank, all the dependency paths, a hash table mapping paths and pairs to its corresponding indices
Created a Notion document with a Kanban board, tasks, pipeline steps, and a log for everyone to have clear deliverables and facilitate more clear team contributions from members.
Understanding the co-occurrence matrix pipeline at a deeper level by further reading (and actively annotating - creating my own mind map representations of the complex concepts) in the research paper and catching up with previous meeting recordings

3. Goals for Upcoming Week:

Get the MVP showcase presentation - learn and download appropriate templates to make it more engaging
Finish ITCC algorithm implementation to get out the co-occurrence matrix
Get the MVP output dendrogram

4. Statement of Tasks Done:

Lead the team in establishing goals and project management - drug-gene pair parsing is finished from John and Sourav contributed by running his own pipeline to get the data artifact required to get drug-gene path mappings. The Notion document will be a big help in keeping everyone on the same page.
Further packaged up the main repository and finished implementation EBC artifact creation (including the seed set integration)
Learned the best merge of utilizing the Python interactive shell and creating scripts containing useful classes and functions for optimal productivity in my workflow
Debugged Github and script/module import problems - trying to merge the EBC implementations from the initial repo seamlessly into our repository flow

mtaruno · August 10, 2021, 8:28am

1. Overview:

Technical Area: ITCC Biclustering, EBC Scoring, AUC Evaluation Metrics
Tools: NumPy, Pandas
Soft Skills: Project management, leadership,

2. Three Achievement Highlights:

Finished ITCC biclustering and implementation details of co-occurance matrix
Major progress in presentation materials and pipeline detailing
Deeper comprehension of the AUC Evaluation Metrics and EBC Scoring

3. Goals for Upcoming Week:

Finish replication and extension of EBC Scoring, Final Dendrogram, and AUC Evaluation Metrics
Finish research on the research motivations, effectiveness, framing, and extensions to the research paper
Finish presentation preparation, practice, and materials

4. Statement of Tasks Done:

Finished implementation of ITCC biclustering to get the coocurance matrix
Gained comprehension of the scoring at a deeper level and AUC evaluation
Created a pipeline diagram (shown above) that outlines specific processes and the entire pipeline - continuously trying to find the most effective way to represent the complex processes to an audience in the most effective manner possible
Created the design and content template for the presentation
Thought of the most salient points to distill on the presentation and research to create an engaging showcase of our work to a wide audience
Project management for the team - hosted individual meetings with members and kept the team on the same page

mtaruno · August 19, 2021, 2:53am

1. Overview:

Technical Area: ITCC Biclustering, EBC Scoring, AUC Evaluation, Graphic Design
Tools: NumPy, Pandas, Google Slides, VS Code, Scikitlearn, Plotly, LucidChart
Soft Skills: Presentation, effectively distilling complex concepts, understanding motivations of wide audiences, project management, leadership, framing information

2. Three Achievement Highlights:

Final Presentation.pdf (4.5 MB)

Finished designing and creating presentation slides (I created 1 to 12, 19 to 30, and 32-36. Sourav made 13 - 18. Uyen made 30 - 32. John helped with slide 4 and 7.
Finished implementation of the EBC which is up to date in the Github
Presented to Collin alongside Sourav and Uyen

3. Next Steps

Navigate final finishing touches with the implementation (the AUC curve looks like it is inverted as the distribution is dense on 0.3 where in the research paper it is 0.7)
Implement the feedback given by Collin and Debaleena to improve the slides
- Make sure that the sources of the work provided is crystal clear as for where each component comes from (reduce ambiguity)
- Add Drugbank labels to some of the test set member visualizations
- And more, as discussed in the actual feedback session
More next steps are in slide 34 (refer to the actual slides above)
Write a blog post about an interesting application and part of the work we have done in the past 2 months

4. Statement of Tasks Done:

Created pipeline diagrams for EBC Scoring and AUC Scoring
Thought of how to effectively present and frame the work we have done so far
Read and actively thought about the dissertation and research paper from Percha
Researched about the merits of biclustering, ontologies, bioinformatic impacts, and the landscape of high volume unstructured data
Organized the rehearsal of the presentation and helped give some feedback to the team
Met with Collin during the week to ask questions about our current pipeline and about how scoring works