- Technical Area: Understood Stanford Parser at a deeper level by reading the research paper
- Tools: Github VSCode Integration, Use of Utilities file setup
- Soft Skills: Leadership: Making sure the team’s project management is not lagging behind
Three Achievement Highlights:
- Applied correct Stanford Parser implementation under correct configurations
- Well acquainted and very comfortable with VS Code’s Github integration now as well as use of value of the Utilities file
- Organized and planned meetings for the group and facilitated clear deliverables for the entire group
Goals for Upcoming Week:
- Complete the EBC algorithm
- Kickoff learning docker
- Significant progress with AWS cloud certification
Statement of Tasks Done
Here is my group’s current pipeline so far:
In the pipeline, I supervised and kept the group organized and worked on the Stanford Parser.
One big hurdle I faced is the size of the biomedical data parsed. I was feared that the parser would not efficiently run on the entire dataset, so I took a sample of the sentences and this is the output I fed to the next member of my team. From the sample output, my team mate was able to continue the pipeline and get the desired results. After the pipeline is validated, only then did I run it on the entire dataset. I think this was a good approach as iterative unit testing makes the workflow lean and manageable. Moving forward, we need AWS because some sentences are too long to use the PCFG grammar to parse.
Because it did not run on some sentences that are too long, I thought it would be helpful to visualize the token lengths of the sentences in this biomedical corpus - hopefully when we use AWS services the Stanford parser will work on the entire dataset:
Here we can see that most of the sentences have a token length of less than 50, but ultimately there is a right skew in the distribution and the sentences that are lengthier are what the parser has trouble with.
Another hurdle I faced was getting familiar with the Github integration to keep good track of the entire group’s repository and versions. Learning the direct integration with VS Code helped me get a tremendously more efficient and team-integrated workflow. It also helped me reproduce code much easier as I could make a Utilities repo for my personal use that Collin mentioned and I find to be a brilliant idea.
- Technical Area: Unittesting, coverage, Docker, EBC Algorithm
- Tools: Better Object Oriented Design, Being able to load Docker Containers, Unit Testing with Unittest library
- Soft Skills: Leadership, Initiating Pair Programming Sessions, Explaining and Distilling Highly Technical Concepts efficiently
2. Three Achievement Highlights:
- Jython implementation running and integrated into the local system. Refactored the working repository with test code as well.
- I have learned to implement better Software Development principles such as lower coupling, lower cohesion, SOLID principles, debugging, unit testing, and using composition over inheritance in object oriented design as a machine learning engineer
- Stronger intuition of the EBC algorithm and the entire pipeline that occurs in the paper
3. Goals for Upcoming Week:
- Finish implementation of ITCC Biclustering algorithm to get co-occurance statistics
- For the supervised step, use information from the DrugBank database to create seed sets and test sets - ultimately to iterate to a minimum viable product and baseline model that represents the end-to-end pipeline
- Find out where to directly find the Dependency Matrix implementation for a comparison and to further build pipeline from there (also will need to check with Collin to see if my current implementation is correct).
4. Statement of Tasks Done:
- I refactored our code structure in our repository. A hurdle is to lead the team to be consistent in using the conventions set for our project so that we are all able to implement it well. Although it is still not perfect, initially creating a “template” of the best practices from my personal code is helpful in guiding the rest of the team to use the same or similarly effective conventions.
- I presented my team’s progress to Collin and created a presentation. Created a pipeline diagram to represent our achievements through time. I also thought of good questions to ask and tried to drive a productive conversation about what it takes for all MLBI teams to produce a better replication and extension of the research paper. I attach the short presentation I used to this Self Assessment post.
- I used test driven development to create a class that proprocesses the data that is fed into both Java and Jython implementations of the Stanford parser. Conceptual understanding of implementation details of this, including use of a debugger in tandem with unit testing. I addressed this through actively observing well developed repositories such as the EBC implementation and creating dummy examples from scratch.
- Project management to keep my team on track. There is a struggle of active members, and I personally addressed this through personal reach outs.
STEM Away Progress Update and Questions July 12.pdf (1.9 MB)
- Technical Area: ITCC Biclustering, EBC Scoring, AUC Evaluation, Graphic Design
- Tools: NumPy, Pandas, Google Slides, VS Code, Scikitlearn, Plotly, LucidChart
- Soft Skills: Presentation, effectively distilling complex concepts, understanding motivations of wide audiences, project management, leadership, framing information
2. Three Achievement Highlights:
Final Presentation.pdf (4.5 MB)
- Finished designing and creating presentation slides (I created 1 to 12, 19 to 30, and 32-36. Sourav made 13 - 18. Uyen made 30 - 32. John helped with slide 4 and 7.
- Finished implementation of the EBC which is up to date in the Github
- Presented to Collin alongside Sourav and Uyen
3. Next Steps
- Navigate final finishing touches with the implementation (the AUC curve looks like it is inverted as the distribution is dense on 0.3 where in the research paper it is 0.7)
- Implement the feedback given by Collin and Debaleena to improve the slides
- Make sure that the sources of the work provided is crystal clear as for where each component comes from (reduce ambiguity)
- Add Drugbank labels to some of the test set member visualizations
- And more, as discussed in the actual feedback session
- More next steps are in slide 34 (refer to the actual slides above)
- Write a blog post about an interesting application and part of the work we have done in the past 2 months
4. Statement of Tasks Done:
- Created pipeline diagrams for EBC Scoring and AUC Scoring
- Thought of how to effectively present and frame the work we have done so far
- Read and actively thought about the dissertation and research paper from Percha
- Researched about the merits of biclustering, ontologies, bioinformatic impacts, and the landscape of high volume unstructured data
- Organized the rehearsal of the presentation and helped give some feedback to the team
- Met with Collin during the week to ask questions about our current pipeline and about how scoring works