Machine Learning Level 2 self-Assessment Ruijie Ma

Module1

Concise overview of things learned:

Deeply go through the 2015 paper to understand the request and deliverable.

Technical area:

  • Learning how text in specific field needed to use specific parser to gather relevant information. In this case, we use pubmed parser to parse the abstracts from medline data.

Tools:

Request Library, Pubmed_Parser, Pandas

Softskills:

  • Icebreaking and got familiar with teams. Understand the capacity and strength of each member to split task efficiently.
  • Learned the importance of breaking a task down to smaller parts.
  • Learned to use tool such as journal club to have a kind of comprehensive understanding of the task.

Achievement highlights:

  • Successfully split the “10 prompts” to our team and get everyone on the same page after journal club.
  • Presented and explained my understanding of part of the 2015 paper
  • Installed and used Pubmed parser to extract abstracts from medline data.

Detailed Statement of Tasks Completed:

  • To achieve the goal at the end, we started by explore a single document from medline database. We use the request package to gather one file out of the medline database and use the pubmed parser to extract the abstracts from that original files and convert it into a pandas dataframe to clean the data. On the cleaning stage, we simply eliminate rows that is empty and also eliminate other irrelevant information.

Goals for the upcoming week:

  • Filter the data with known drug, gene name. Then parse the dependency path by using Standford Parser.

Module2

Concise overview of things learned:

Filtering the dataset to a format that is usable as the input of Standford Parser and then parse the dependency path of the sentences.

Technical area:

  • Exploring the most reasonable logic to filter the sentence in the abstracts.
  • Figuring out how to perform the standford parser on python

Tools:

Pandas, Numpy, nltk , Standford Parser

Achievement highlights:

  • Finding out the most reasonable logic to filter sentences with different sets of drug and gene names
  • Learned how to perform the Standford parser on both java and jython.
  • Finding out a solution that runs standford parser on python that increase the efficacy and easiness on building the pipeline at the end of project.

Detailed Statement of Tasks Completed:

  • To filter the sentences in the abstract that has the drug and gene names we want. We went through the PharmaGKB, Uniprot and drugbank dataset and finally determine to use the gene name from PharmaGKB and drug name from drugbank to fit our goal on this project. We were stuck on how should we perform the Standford Parser on either jython or java and connect it with our python pipeline for a while. However, the problem is fixed after I found out that we can perform the parser through NLTK pacakge after setting the local parameters.

Goals for the upcoming week:

  • Figure out how does ITCC works and perform bi-clustering through EBC.

Module3

Concise overview of things learned: Working on the Esembled biclustering to perform the unsupervised stage of the project and figuring out the scoring function to perform the supervised stage of the project.

Technical area:

  • Learned how to run and gather desirable result through EBC with our inputs.
  • Perform grid search to find the optimal number of the row and column clusters.

Tools:

EBC package, pandas

Softskills:

  • learn to calculate the payoff between time and opportunity cost to determine whether to calculate and use different heuristic value on different dataset.

Achievement highlights:

  • Successfully convert our data matrix into sparse and runs EBC through it multiple times.
  • Finding out the optimal number of row and column clusters.
  • Understand how does the scoring function works.

Detailed Statement of Tasks Completed:

  • We went though the EBC package and figured out how does the ITCC algorithm works. In order to find the most accurate result, I followed the steps that researchers used to perform grid search to find the optimal numbers of row and column clusters. I wrote a python script to find out the k,l of our datasets.

Goals for the upcoming week:

  • Finished the scoring function and create dendrogram.