Week 1 Friday Report Submission:
I love Shreya’s point about adding more information. I do require that we stick to our guideline and post something by tomorrow morning. However this is not ‘homework’ or a ‘test’. We are doing a project together, so if you feel like you can supplement your work after your initial submission, you should.
My pick for a forum is the Bank of New Zealand’s forum. You may find the link to their website here:
I like the idea that we could find a business use case for this forum quite easily. I know that we will only be doing simple classification with this data set. But I feel like with more advanced models and more meta-data collection there could be real insights we could bring to the bank. For example we could scrape to find out the most common problems their clients are having.
Another positive with the website is that it looks easy to scrape. I know the js for most discourse forums are similar but it seems like this one is especially straight forward. I know we were thinking of using a web driver to retrieve the HTML data and work that way. But it feels like if we do more specific HTTP requests we may be able to get json responses that could help us crawl through the website better and provide us only with the useful information.
I did some work scraping this website and I can see from inspecting the network activity that when I hit the bottom of the page the new set of forum topics are loaded by a HTTP get request that specifies a ‘page_count’ parameter. Therefore we can send specific requests to this url and iterate through the page_count parameter to crawl through the website.
I wrote a very simple script that demonstrate this and pushed it to the DevB branch on our GitHub repo. You should go check it out. Since I saw that some people are testing out GitHub and I don’t yet want to restrict this testing, I fear that someone might override this script that I wrote. To prepare for that, I’m also sending a Colab notebook with my simple script that prints out the dictionary with all the forum submission on a given page of the Bank’s forum. All you have to do is to give a number value when my script asks for it (this isn’t a real fool-proof method I’m just trying to test this concept so simple numbers like 1,3,7 are good). The script will then pretty print a python dictionary with all the forum submissions on that page, alongside a bunch of metadata that we don’t need.
The link to the Colab notebook is here:
My pick for a team name is Machine Submarine.