Project Title:
AI Code Assistants
Subtitle:
A learning experience focusing on machine learning, web scraping, and peer collaborative work.
Pathway:
Web-scraping + EDA, Machine Learning Models
Mentor Chains® Role:
Web-scraping + EDA, Machine Learning Models Participant
Goals for the Internship:
- To gain a basis of understanding for ML and implementing ML models.
- To collaborate effectively within a team, working together towards a goal, thereby increasing reach, and learning opportunities.
- Gain experience working with ML, and a team.
Achievement Highlights:
- Successfully web scraped over 2000 Reddit entries.
- Processed collected data to prepare for sentiment analysis.
- Implemented sentiment analysis through text classification using Term-Frequency and K-Means clustering to group large datasets and assign tags.
Challenges Faced:
- Learning how to web scape with the Reddit API.
- Exceeding rate limit allowed by Reddit API, had a lot of trial and error.
- Not knowing how to get started with text classification from no prior knowledge or experience.
Detailed Statement:
I started the project with data scraping from the Reddit API. Learning web scraping for other websites and then transferring the knowledge to Reddit was a very significant learning experience. There were many things I had to become familiar with specific to the Reddit API, but I learned a lot more this way. This process not only improved my data acquisition skills but also taught me how to handle large volumes of unstructured data efficiently. I was able to web scrape over 2000 entries of reddit posts and the top comments of each post.
I also became proficient in preprocessing textual data, including tokenization, stemming, removing stop words to prepare the data for clustering, and was able to preprocess my reddit data for analysis. Then completed sentiment analysis using VADER for my data.
Finally, I and another person worked together to complete text classification and tag clustering for the data. We were able to preprocess data by filling any NaN values with empty strings, create a TF-IDF vectorizer, perform K-Means clustering, and get cluster labels for our data’s questions and answers. Python programming, particularly with the libraries Pandas, NumPy, and scikit-learn were central to all of this.
Apart from technical skills, this project allowed me to strengthen several soft skills. Communication was a very large part of this project as it was a group effort. My team and I did very well to communicate through the process, and ask each other for help when needed. Also in the text clustering part, me and my partner had to work together as a team, and we worked efficiently, and with constant communication throughout. Altogether, this project was a valuable learning experience that allowed me to apply a range of technical skills in NLP and data analysis, develop essential soft skills, and enhance my abilities in web scraping and handling large-scale data acquisition.