Concise overview of things learned:
Understood how to extract raw Medline sentences through PubMed parser
Learned about Drug-Target relationships, working of Drugs, Vector Space models, Literature mining by reading prerequisite & supplementary materials given.
Read the exact research paper to deep dive into what is to be done
Request Library, Pubmed_Parser, nltk , gzip module , Trello
Learned to work in teams, divided the whole task into subtasks & then compiled them to complete the task in time.
Learned how to use Trello for structuring & managing the project in a easy way.
Improved communication skills while interacting with other team members during team meetings & for the very first time worked on Presentation slides for Team wide & Pathway wide Journal Clubs.
Presented the research paper called “Literature Mining for Biologists” & some parts of the exact research paper in the team wide journal clubs
Presented the result part of the exact research paper to the pathway wide journal club
Extracted the raw Medline sentences using pubmed_parser
Detailed Statement of Tasks Completed:
- For extracting raw Medline text task main motive was to collect all the drug-gene pairs & extract only those sentences which contain atleast a drug-gene pair. I downloaded the drug-gene pairs from Pharmgkb & drug list from Drug-Bank (it took a lot of time to get approval from DrugBank). One of our team members John explained us how to extract .xml files. With that help I used requests module & pubmed_parser for extracting raw data & gzip library for .xml files. Then I created a list of drugs & genes with the data from PharmGKB & Drugbank. It has 17K unique drugs & 20K unique genes. After extracting the raw sentences I used nltk library for tokenization. Then I checked whether the token is present in the drug or gene list ,if present added to the usable_sentences list & stored the data in .tsv format. I was able to extract more than 2000 raw Medline sentences.
Goals for the upcoming week:
- Dependency parsing of raw data & creating the dependency matrix for EBC algorithm.