McCarthy Nolan - Machine Learning Self Assessment

July 20 - August 2

On the technical side, I learned how to scrape a website, which I had never done before. I had to learn the important things to highlight in scraping, how what fields to grab and what not to. I didn’t use too many tools this week, but I started on learning how to use Asana. I also was involved in researching StackExchange, and collaborated with a teammate on creating the report summarizing our findings.

Three achievement highlights

  • Met up for the first meetings, and got settled into the teams
  • Learned how to scrape a website, and prepared a report discussing how to do it for StackExchange
  • Started learning and using Asana

List of meetings/ training attended including social team events

I attended all the meetings, twice a week.

Goals for the upcoming week.

I’d like to successfully scrape the website I’m assigned.

Detailed statement of tasks done. State each task, hurdles faced if any and how you solved the hurdle.

I researched the viability of scraping StackExchange, and whether or not it would be worthwhile to train our model on the StackExchange data. This required connecting disparate parts from the Stem-Away forums to the StackExchange forums. There weren’t many hurdles to this, but one was the differences between the forums. I had to determine if these differences were possible to overcome through rethinking the problem, if they weren’t a big enough deal to matter, or if they were catastrophic enough to make StackExchange not worthwhile.

August 3 - August 9

This week I finished my program for scraping the Codecademy forums, and opted to use Selenium to do so. I had to learn how to use this library, and how to use some of the more in-depth features of Python. I also led our team in this endeavor, and helped some of the members with the technical details.

Three achievement highlights

  • I wrote a program which successfully scrapes the entire Codecademy forums (or any subsection)
  • I led a team to this goal, helping members when needed and using those who were farther along to finish the task
  • I cleaned the data which both teams scraped

List of meetings/training attended including social team events

I attended all the meetings, 3 times this week

Goals for the upcoming week

I’d like to develop a model based on the cleaned data to recommend posts. I don’t expect it to be perfect, but I’d like a working prototype to build off of.

Detailed statement of tasks done. State each task, hurdles faced if any and how you solved the hurdle.

I wrote a program which successfully scrapes the entire Codecademy forums. It also takes command line arguments and can be called as a library, enabling a lot of end-user customization. I fully commented everything, from function definitions to the inner workings of the functions and classes, both for anyone who uses the program to scrape, or for my teammates who need some help understanding how it works. The main hurdles in this were related to Selenium. I needed to make some code to scroll down an infinite-scrolling page, which took some research. I needed to deal with some posts on the forums which were abnormally formatted, which mostly just took time. In general, I got significantly better at using Selenium, which resolved most of my issues.

I also finished cleaning the data. This was simpler, as it was just removing data points which weren’t useful, but the difficulties here lay in dealing with a file that I didn’t create. It took time to figure out how to read in the file with its abnormalities, but from then I was just implementing a few rules to trim down the dataset.