AliJilani - Machine Learning Pathway

Overview of things learned:

  • Technical Area: Was able to practice more with various cleaning techniques and pre-processing. Practiced more with pandas dataframes and csv files and applied them into our assignments. Through the BERT webinars learned the significance and application methods behind the BERT model. Through NLP webinars and team meetings, also learned significance of NLP models and how they are used through BERT.
  • Tools: Kept communication channels open with my team though Slack, kept up with our tasks using Asana, and GitHub.
  • Soft skills: Applied communication skills with sub-team to plan out our tasks to achieve our goals. Used time-management techniques to provide myself and team with adequate time to work on our respective tasks.

Three achievements I had so far:

  • Our team has successfully scraped more than 5000 posts from a discourse website.
  • Cleaned and pre-processed the data by removing tags and stopwords, tokenization, lemmatizations.
  • Learned more about the BERT model, NLP, and how it may be applied to our project.
  • Applied BERT to smaller sample of our overall collected data

List of meetings I have joined so far:

  • Weekly team meetings including all the general team meetings, web scraping workshop meetings, and data processing meetings.
  • BERT Webinar(s) by Industry Lead
  • Git Webinars
  • Intro to Python Webinar(s)
  • STEMCast Webinar(s)

Goals of the upcoming week:

  • Successfully use the Bert model to our set of data (5k+ posts).

Task Done:

  • Web Scraping: Scraped data from forum collecting 5000+ posts using Selenium and store the data in a data frame and then converted to csv file.
  • Data Cleaning: Clean data with different methods using the nltk module.
  • Learned more about how BERT works by re-watching the webinars and online research
  • Applied BERT model to smaller sample of data collected from our selected forum
  • Communicated with team along each step of the way.

Problems Faced and how they were solved:

  • Was still a bit unclear how BERT was to be implemented with our data, but after re-watching the webinars was given a better view on how to go about it.

Self Assessment Week 3:

Overview of things learned:

  • Technical Area: Was able to practice more with various cleaning techniques and pre-processing. Practiced more with pandas dataframes and csv files and applied them into our assignments. Through the BERT webinars learned the significance and application methods behind the BERT model. Through NLP webinars and team meetings, also learned significance of NLP models and how they are used through BERT.
  • Tools: Setup communication channels with my team though Slack, kept up with our tasks using Asana, and share our work after scrapping and pre-processing though GitHub.
  • Soft skills: Applied communication skills with sub-team to plan out our tasks to achieve the goal. Used time-management techniques to provide myself and team with adequate time to work on our respective tasks.

Three achievements I had so far:

  • Our team has successfully scraped more than 5000 posts from a discourse website.
  • Cleaned and pre-processed the data by removing tags and stopwords, tokenization, lemmatizations.
  • Learned more about the BERT model and how it may be applied to our project.

List of meetings I have joined so far:

  • Weekly team meetings including all the general team meetings, web scraping workshop meetings, and data processing meetings.
  • BERT Webinar(s) by Industry Lead
  • Git Webinars
  • Intro to Python Webinar(s)
  • STEMCast Webinar(s)

Goals of the upcoming week:

  • Successfully use the Bert model to our set of data (5k+ posts).

Task Done:

  • Web Scraping: Scraped data from forum collecting 5000+ posts using Selenium and store the data in a data frame and then converted to csv file.
  • Data Cleaning: Clean data with different methods using the nltk module.
  • Communicated with team along each step of the way.

Problems Faced and how they were solved:

  • At first when we scrapped our data, we used a forum that had less than 200 posts. After learning that 5000 posts would work much better for our future tasks, I found a new forum with 5k+ posts that we then used instead
  • Having to scrap that much data did take quite bit of time so I reduced the delay time which I initially used to let the posts download

Self Assessment Week 1-2:

Overview of things I learned:

  • Technical Area: I learned how to do collect data from online community forums through web scraping in python using Beautiful Soup and Selenium. I was then able to save that data and perform preprocessing on it such as lower casing, lemmatization, and tokenization using ntlk. I also learned of other forms of preprocessing.
  • Tools: I learned how to use python in Jupyther notebooks, use Asana, Slack, and the StemAway website for communication and team management. I also learned how to collaborate on a project through GitHub.
  • Soft skills: I learned how to manage time so that I was able to follow along with the material being presented and have adequate amounts of time to practice between meetings. I also was able to communicate with our team through meetings and other forms of communication.

Three Highlights:

  1. Was able to scrap data from multiple different forums and sites alike in order to grab different types and sets of data.
  2. Was able to perform preprocessing on these sets of data including lower casing, lemmatization, and tokenization.
  3. Followed along with the team to learn more about the overall project and skills required.

List of meetings I have joined so far:

  • STEMCast Webinar(s) on Webscraping
  • Our team meetings dealing with team management and workshops covering webscraping, and data cleaning/preprocessing.
  • Git Webinar from our industry lead
  • Intro to Python webinars (3)

Goals of the upcoming week:

  • Become more acquainted with data cleaning and be able to use the knowledge we’ve learned so far to initiate the project.

Task Done:

  • Web Scraping: Using Beautiful Soup and Selenium to gather data from various forums on Discourse.
  • Set up accounts and communications avenues on Slack, Asana, Google Suite, and Github.
  • Data Cleaning/Preprocessing: Used ntlk to lowercase sets of strings, perform tokenization and lemmatization on my collected data.

Issues and How I solved them:

  • For the most part, the webinars/workshops were great in demonstrating how to perform the required web scraping and data processing. Any other issues were mainly due to the slight learning curve of Python, which I wasn’t entirely used to, but having the intro webinars, as well as own research has helped greatly.