Overview of Things Learned:
Technical Area: I learned about scraping data from the discourse forum with the help of BeautifulSoup and Selenium . I used Pandas and CSV module to store the scraped data. I further preprocessed the collected data by using the String and re module for string manipulation. I learned about punctutation and whitespace which already existed as predefined strings in the string module.
Tools: I used Jupyter notebooks for writing my codes. I am also using the STEM-Away forum and Slack channel for regular updates from the mentors.
Soft Skills: Soft skills that came into play this week include effective communication with the project leads and technical leads, regularity in attending meetings and punctual submission of work according to the given deadlines.
- I had previously only used BeautifulSoup4 to scrape data. But I encountered a different issue while scraping the Discourse forum for codecademy which was solved by Selenium. So I learned a new module to use for webscraping.
- I successfully completed the pre-processing task to remove all unwanted characters which are obtained by scraping data from the web pages.
- I already have some experience in working with dataframes using pandas but this week’s scraping and pre-processing tasks helped me revise some concepts and gave me a better exposure at using dataframes to store real-world data.
- Kick-off meeting for Team4 (7//20)
- Web Scraping check-in (7/29)
- Web-scraping presentations and preprocessing (7/31)
Goals for the Upcoming Week
We were introduced to TF-IDF and count vectorizers towards the end of week 1. So I hope to learn more about the same this week. And help the fellow team members with any issues they might still be facing in the tasks of Week 1.
- Web Scraping- Used BeautifulSoup4 and Selenium to scrape 3 categories (Help, Community and Projects) of the Codecademy forum in Discourse and saved their respective data in different csv files. Initially I encountered the issue of not being able to pre-load the complete page using bs4 but I solved it by using Selenium.
- Pre-processing- Cleaned the data scraped from the threads in the forum to remove any unwanted punctuations and whitespaces using string and pandas module.