Team 5 Progress Reports [Action Required for Team 5 Participants]

Hi ML Team 5,
As discussed in my post from yesterday your report submissions are due tonight. Since the concept of ‘tonight’ differs for each one of us I want to give you guys the maximum amount of time I can. Which will be 9am EST on Thursday. Please submit your Wednesday report by that time. In case you missed any of the previous communications @Ishan is going to send a direct message to all of you soon with this thread and the previous thread with the detailed information on what the report should entail. Once you’ve got the information necessary and you’re confirmed for email and slack, please write your report on this thread as a response to this message.

Below is my post detailing the report and the tools you have to log on:

There are issues we are facing with Asana, I sent an email to you guys recently with a new link. If this problem persists, youre not required to confirm your position on the Asana project in your report. However I really hope that you will confirm your email and presence on the slack channels in your report. We can fix all other problems and trouble shoot anything as long as we have a strong method of communication, and you are all responsive to my messages.

Best,
Eren

  1. One area that I am not too sure about is how the workflow for designing a machine learning algorithm works. So, I don’t have a very clear picture of how machine learning algorithms are designed in a step by step procedure and want to be able to gain understanding in this area.

  2. I think my strengths are in Python and the computational side of things because I have a good understanding of coding. Another strength of mine is the mathematical portion as I have a good amount of experience in Linear Algebra.

  3. I began using the tools and have made accounts for them. My GitHub account name is: DhruvKumarSTEM

2 Likes
  1. My biggest concern with this project is using NLP and BERT, primarily because I have not worked with either before.

  2. My strength for this project is using Python, as I have intermediate experience in this language and I am interested in working on the coding side of this project (especially learning how to do web scraping/crawling).

  3. I have started using Git (my account name is shreya1chandra), Slack, and Google Calendar, as well as looked through the deliverables. Asana works for me now too!

2 Likes
  1. Apart from the very abstract overview of the project given in the first call, I am not clear about many specifics in the project. A good project documentation providing with the details, final goals and evaluation metrics for the project would be helpful.

  2. I have about 2 years Deep Learning Research experience using Pytorch. I would be most efficient implementing NLP Learning algorithms using Pytorch and working on improving the performance of the Networks once trained.

  3. Accounts setup in Asana, Slack and GitHub ( Username : badhri-stem )
    Gmail Id: Badhri_Narayanan_Sur@mentorchains.com

1 Like
  1. The area I am a little concerned about is the data collection part of the project as I have little experience in web scraping.

  2. My prior background with different deep learning frameworks and overall experience in building machine learning and deep learning models will help me in handling the classification and NLP(text-embedding) part of the project efficiently.

  3. I’ve set up accounts on Git (user: saiparsaSTEM), Asana, and Slack.

1 Like
  1. My biggest concern for the project is that I have theoretical understanding of web scraping and classification but lack experience in application but I hope I can adapt quickly and contribute on that portion.
  2. I have experience in Linear Algebra as well as Python and look forward to putting it into practical use in PyTorch.
  3. I am connected on Slack, Asana and GitHub. My GitHub username is alicia-weaver.
1 Like

1)My concern is that I am not familiar with some of the tools and technologies here but since the project has not started yet so I don’t have any specific project related concerns at this time.

2)My strength is that I am a very hard working and quick learner. I’ll learn everything quickly. I also have some experience in Python and have a good background in cybersecurity/IT Essentials field with using interfaces such as linux, ubuntu and kali. I hope those come in handy!

3)I have started using all the tools and created my accounts on Slack, GitHub and ASANA. My GitHub username is Karandeeptheboss.

1 Like
  1. My main concern is that I haven’t done anything with machine learning or webscraping before and so I’m not really familiar with the concepts involved. I’ve also never had to do a coding project with a team of more than two people before.

  2. I have an excellent math background (non-stats unfortunately) and I’ve done research level projects in both Python and C++.

  3. I am on Slack, Asana, and GitHub. My GitHub name is KylePacheco.

1 Like

Hi guys! I started figuring out the web crawling with the help of Maleeha’s webinar (It was an excellent introduction). I get some errors in my code right now but I’m sure I’ll figure it out soon.

1- Right now I am constantly worried that I may miss an announcement, post, webinar, etc. Hopefully, this gets better as I get used to the Stem-away platform.
2- I think my strength is problem solving and python. I have intermediate skills in python but since I love programming I’m always more than willing to work on the coding part.
3- I was able to log into my email, Asana, Slack, and GitHub. My username for GitHub is “shedaya1”.

I’m really tired of working on the code right now but I’ll continue working on it tomorrow and I’ll update you guys once I got some results :blush:

1 Like
  1. My biggest concern with the project is that I have some theoretical base of web scraping but I do not have much experience in it. I hope I can accumulate more experience in it.
  2. I think my strengths are Python and mathematics, and I also have a lot of application experience in image process algorithms. I am familiar with PyTorch and TensorFlow.
  3. I have started using all the tools and created accounts on Asana and GitHub. My GitHub username is Ricccccky.
1 Like
  1. My biggest concern with the project is the forum selection, the using of web scraping efficiently to generate data and distribution of tasks to the team.

  2. My strength is in python and data manipulation. I have worked with classification models before but this is the first time I would be using BERT.

  3. I have started using all the tools including Asana, Slack and Github. My Git hub username is ishanstemaway.

You guys are amazing,
all but one of the participants have responded. This is what we need moving forward. It’s an impossible task to manage a project with such a big team with such different skill sets if we don’t communicate effectively. Delivering on time and communicating effectively is what we need. I will try and help bring everybody on board during this first week, but I hope that by second week everyone knows everything that’s going on in the forum and the slack channel.

My Wednesday Report:

  1. My main concern is the effective communication and presence of all participants. This will be a challenge working from different parts of the world with different skill sets. It’s going to be an amazing learning experience. I’m also not an NLP expert like the super talented industry mentors, so this will be a great opportunity to get advanced knowledge.

  2. I’m a learn-a-holic. I love learning new things and throwing myself into new problems. I’m best with using API’s, web scraping and data cleaning. I have experience from college and the industry and I understand how to problem solve in the real world.

  3. I guess it’s easy to guess that I’m on all the platforms :grinning:. My GitHub account name is: ‘ErnclbMentor’ . Expect an email soon from me that adds you as a participant on our project repository.

I will add all of you all to the GitHub repository very soon. Your branches won’t be set up though, so let’s wait until tomorrow’s webinar so that everyone knows how to use git before we start test committing things. I highly recommend that everyone attend the webinar tomorrow. It was initially going to be Wednesday but now it’s Friday. We will have our scheduled google meet at 11am EST tomorrow and we can attend the webinar as a group afterwards (there will definitely be a break, we won’t take 3 hours for the google meet).

Expect a message from me soon that explains the Friday report and the Friday meeting. I will continue on this thread until I could arrange a more organized forum usage for our ‘reports’ with management. So the Friday report will also be due on this thread by the morning of Saturday. Wait until Fridays morning meet to write it though. As it might be good to discuss in person before we write.

Best,
Eren

Wednesday Submission

  1. My biggest concern with this project is the complexity with BERT.
  2. I have intermediate experience with Python and machine learning and no experience with Pytorch. I have worked on NLP projects at an intermediate level. My strengths lies in technical understanding and business sense
  3. I have started using Google calendar, gmail, slack. GitHub account name: bhavish729

Having problems setting up Asana

Before I lay out the details of this week’s Friday report I want to remind you of the video conference we will have 11am EST tomorrow. All the due dates for the reports and the Friday meetings links are on the StemAway Calendar that I shared with you. Since this is our first week I’ll be sending the link and make sure everyone can attend. Moving forward I expect the participants to already know that there is a report submission for Wednesday and Friday and a video conference on Friday mornings wether I send a reminder or not.

The link to tomorrow’s video conference is below:

For Fridays report which is due Saturday 9am EST. I want to hear about two main things:

  1. Pick a forum that you would like to be scraping. Give us a link and describe it a bit. If you have the skills to do so, try to scrape it and tell us what kind of http requests, what headers, what post methods we can use to iterate through it (or wether it would be easier to iterate with a web driver like selenium). I expect everyone who detailed their computer science skills as a strength and have somewhat of an experience to provide some technical information.

  2. Browse through the submissions to our welcome message thread and pick your favorite submission for a team name. (You can’t pick your own submission). I’ll provide the link to that thread below for your convenience.

See you lads tomorrow,
Eren

Hey guys as mentioned in todays meeting the forum you choose must be a discourse forum.
Here is a link of some discourse forum examples:

Best
Ishan

Hello everyone, here is the link to the webinar that I mentioned today. In this webinar Maleeha did an excellent job introducing web scraping and she went through some examples step-by-step.

  1. The forum I would like to scrape is Codecademy (https://discuss.codecademy.com/). I haven’t web scraped before, but I’ll tinker around with this site over the weekend and add more information to this post.
    Update: I tried web scraping the site, however I keep running into errors in my code. I will need to keep researching how to scrape successfully. Looking at the site layout itself, the CodeCademy forum appears to be simple to follow, with the three main topics being categories, topics, and latest. We could scrape how many new topics are posted each week (such as in the “Get Help” category), which could be used by the company to determine how many employees should moderate the forums each week, as well as how many topics each employee should take on. Also, we could scrape the dates of the topics, as that could be used to determine whether or not an outdated topic should be archived, or brought to the front of the forum (if multiple people continuously experience the same issue).

  2. My favorite team name is “Data Elephant.”

6/5/20 Friday Report Submission
1.
Website: https://www.freecodecamp.org/forum/
Tool: Scrapy or Selectorlib
Tutorial: https://blog.datahut.co/scraping-amazon-reviews-python-scrapy/
https://www.scrapehero.com/how-to-scrape-amazon-product-reviews/

For Enriching our dataset based on topics or hastags we can webscrape from Facebook, twitter, youtube just by using API keys (https://github.com/ScriptSmith/socialreaper)

I will explore more on webscraping Discourse forums.

  1. My Favorite Team Enigma

Week 1 Friday Report Submission:
I love Shreya’s point about adding more information. I do require that we stick to our guideline and post something by tomorrow morning. However this is not ‘homework’ or a ‘test’. We are doing a project together, so if you feel like you can supplement your work after your initial submission, you should.

My pick for a forum is the Bank of New Zealand’s forum. You may find the link to their website here:

I like the idea that we could find a business use case for this forum quite easily. I know that we will only be doing simple classification with this data set. But I feel like with more advanced models and more meta-data collection there could be real insights we could bring to the bank. For example we could scrape to find out the most common problems their clients are having.

Another positive with the website is that it looks easy to scrape. I know the js for most discourse forums are similar but it seems like this one is especially straight forward. I know we were thinking of using a web driver to retrieve the HTML data and work that way. But it feels like if we do more specific HTTP requests we may be able to get json responses that could help us crawl through the website better and provide us only with the useful information.

I did some work scraping this website and I can see from inspecting the network activity that when I hit the bottom of the page the new set of forum topics are loaded by a HTTP get request that specifies a ‘page_count’ parameter. Therefore we can send specific requests to this url and iterate through the page_count parameter to crawl through the website.

I wrote a very simple script that demonstrate this and pushed it to the DevB branch on our GitHub repo. You should go check it out. Since I saw that some people are testing out GitHub and I don’t yet want to restrict this testing, I fear that someone might override this script that I wrote. To prepare for that, I’m also sending a Colab notebook with my simple script that prints out the dictionary with all the forum submission on a given page of the Bank’s forum. All you have to do is to give a number value when my script asks for it (this isn’t a real fool-proof method I’m just trying to test this concept so simple numbers like 1,3,7 are good). The script will then pretty print a python dictionary with all the forum submissions on that page, alongside a bunch of metadata that we don’t need.

The link to the Colab notebook is here:
https://colab.research.google.com/drive/16CLjQM1uN9CyA503KiFKb05kMXGqtTpx?usp=sharing

My pick for a team name is Machine Submarine.

Best,
Eren

6/5/20 Friday Report:

  1. My pick for a discourse forum is Gearbox Software’s forum: https://forums.gearboxsoftware.com/categories
    I chose this forum mainly because I play a lot of their games and it would be interesting for me. Unfortunately, I don’t really know anything about web scraping to get into the technical details. If I can, I’ll try to learn more about it and update this post.

  2. I like the team name Data Elephant.