Week 1 & 2
Overview of things learned:
Technical Area: I learned about scraping data from the discourse forum with the help of BeautifulSoup and Selenium. I was able to store the scraped data with the help of the Pandas and CSV module. I was able to preprocess the collected data by applying lowercasing, lemmatization, and tokenization on the collected data by the help of the ntlk module. I also used the String and re module for string manipulation. I was able to explore the TfidfVectorizer available in the sklearn module.
Tools: During the last two weeks I explored how to use Jupyter notebooks and project management tools like Asana. I also got familiar with efficiently using the various sections on the StemAway website.
Soft skills: I learned how to follow up with what’s discussed during the meetings and was able to devote adequate time to practice the things learned. I was able to effectively communicate during team meetings and other means of communication. I was able to go through the references given by the Team and Project leads.
Three Achievement Highlights:
- I was able to scrap/collect data from multiple discourse forums, able to explore how different things can be accessed from a webpage and able to effectively use various tools available in the webpage window using the inspect element feature.
- I was able to store the collected data in the form of a CSV file and also able to remove whitespaces and special characters in the collected data using re and string module.
- I was able to perform preprocessing and data cleaning on the collected dataset by lower casing, lemmatization, and tokenization techniques.
List of meetings I have joined so far:
I have taken up all the general meetings conducted till date except the first introductory meeting on June 1 due to the misinterpretation of timezones.
1. Make-Up Meeting
2. Web Scraping and Data Mining Technical Workshop
3. Git webinar by Industry Mentor
4. Web Scraping & Data Mining Workshop Pt 2
5. Data Preprocessing & Word Embeddings Basics Workshop
Goals of the upcoming week:
Become more comfortable with data preprocessing and experiment on how to use the TF-IDF vectorizer and be able to contribute to the project efficiently along with fellow participants and observers.
Task Done:
1. Web Scraping: Successfully scraped data from the discourse forums using Beautiful Soup and Selenium. Webinars helped in gaining a high-level overview of how things work along with the learning references provided at the end of the workshop helped me in completing the task.
2. Storage: Successfully stored the scraped data into a Pandas dataframe and converted it into .csv format. I have was able to perform it quite easily as I have worked on it previously.
3. Data Cleaning/Preprocessing: Successfully cleaned the collected data by various techniques using the ntlk module. I got an idea on how to implement various techniques from the webinar and a little more research from my side helped me in performing the task.
4. Utility Tools: Successfully set up my mentorchains.com account along with Slack, Asana, and Github. Initially, I got some issues regarding the setup but @shreyas helped me in resolving them
Other Issues and How I solved them:
The webinars/workshops gave me a good insight on how to approach web scraping and data preprocessing. Other than that there we some minute problems that I faced in setting up my chrome driver path and accessing web page elements by its XPath, but a little research helped me resolve them were great in demonstrating how to perform the required web scraping and data processing. I am still not confident enough in data cleaning and vectorizing it, but I am working on it.