CharlesIm - Machine Learning (Level 3) Pathway

Week 1

Technical Area

  1. Refamilarized with pandas library, mainly dataframe and associated utilities

  2. Refamilarized with Regular Expression library

  3. Learned Beautiful Soup, Pubmed_parser, and Requests libraries

Tools

  1. Jupyer Notebook (via Google Colab)

  2. Magic commands (like %time and %pip)

  3. Beautiful Soup library

  4. Pubmed_parser library

  5. Regular expression library

  6. Requests library

Soft Skills

Facilitated international team meetings and answered general questions

Achievements Highlights

  1. Successfully finish the web crawler and scraped all data

  2. Review research paper and attached code

  3. Scraped all Medline data but not time efficient (~7 hours to parse and process)

Upcoming Goals

  1. Data cleaning
  2. Feature Engineering
  3. Stanford parser
  4. Dependency matrix

Tasks

Used web crawler to scrape required data from Medline database website and created csv file to hold abstract data.

Week 2

Technical Area

  1. Learned Dask library, mainly dataframe and associated utilities

  2. Refamilarized with Regular Expression library again

Tools

  1. Jupyer Notebook (via Google Colab)

  2. Magic commands (like terminal commands)

  3. Regular expression library

  4. Dask library

Soft Skills

  1. Facilitated international team meetings and answered general questions

  2. Asked questions about gene targets list

Achievements Highlights

  1. Successfully reviewed the Dask Training Video

  2. Reviewed parser videos

  3. Deployed the scraped complete but not time efficient Medline data to a Dask dataframe

Upcoming Goals

  1. Data cleaning

  2. Feature Engineering

  3. Stanford parser

  4. Dependency matrix

Tasks

Loaded Dask dataframe with csv file and attempted to process; however, lacked the gene list to complete and tried to use regex to data clean. Now that the team has access to the gene list, will reprocess and refactor the code base.