Project Title:
AI Code Assistants
Subtitle:
An experience that allowed me to learn new things about coding and working with other people.
Pathway:
Web Scraping/EDA
Mentor Chains® Role:
Mentor Chains® Particpant
Goals for the Internship:
- Learn more python
- Learn more about machine learning and artificial intelligence
- Working with other people
Be concise. Draw people in. Have 5 goals max.
Achievement Highlights:
- Was able to successfully scrape more than 50 transcript entries on Youtube
- Was able to perform a Vader sentiment analysis on the transcripts
- Learned more python than I had before
Be concise. Draw people in. Have 5 highlights max
Challenges Faced:
- I had very little python knowledge
- Doing the other types of EDA was hard to implement
Detailed Statement:
Project Report - EDA/Web Scraping
I have completed my jobs, by scraping more than 50 Youtube videos and I have performed a Vader sentiment analysis on most of the videos. However due to school I wasn’t able to test out the other types of unsupervised sentiment analysis like word2vec and complete the rest of the sentiment analysis on the other videos.
Phase one:
In phase one we started by scraping off data from respected websites. I scraped off Youtube transcripts with videos relating to AI code assistants.
There are many ways of scraping off of Youtube, but I mainly used help from Dilshaan’s web scraping videos to help me get some kind of an idea on how I should scrape the transcripts.
- Import CSV module - I needed this so I can organize my transcripts
- Importing the youtube_transcript_api - This API consists of many different things that can be used in order to help scrape transcripts, I was originally just going to use beautifulsoup, which worked, but this one was much easier to use
- List - I used a list in which I put all of the Youtube video ids and from there the youtube-transcript-api works by using the video id and scraping it from there
- Defining my functions- I defined various functions which each do its own purpose and I implemented them within each other for ease purposes. I would have a split_into_sentences in which it splits the transcript into sentences for every new line that occurs. I would have a process_transcript function in which it would split the url so only the video-id is used when processing transcripts. The modules don’t know how to use it with a normal youtube url so splitting it would help it read the transcript. The last function that I have is scraping the data and putting it into a list. This is important as it stores the data properly. Inside this scraped_data function I have also told it to print the video-id, and transcript so I can easily copy and paste it to the google sheets where all of my data is stored.
- I have also scraped views using an HTML Session function which was able to scrape the views as of that given time and I was successfully able to do that for most of the videos (I wasn’t able to finish the vader and views because school started by then)
My next steps for this project is to complete my views and vader analysis for the project. I also hope that I can try and implement different kinds of unsupervised sentiment analysis because I can figure out which one yields better results and we can compare data this way.
Phase 2:
In this phase I have completed most of my transcripts and data and now going onto the unsupervised sentiment analysis.
- Planning out which unsupervised sentiment analysis to use
Figuring out which one to use was kind of difficult as each one yields different results and the way it works. I finally came across vader from Youtube videos and couple websites, as it was easier to use and it analyzes using the words given and gives a positive, neutral and negative score from each passage or set of words
I wasn’t sure how to code it, but looking at a few Youtube videos, it all made more sense. I had to install a few libraries such as vader_lexicon, punkt, and stopwords. The next thing I had to do was implement this within my code. I did this by taking every transcript and putting it into the preprocessing code in order to filter out useful words and perform an analysis on it. This worked, and I was able to get it for most of the transcripts.
As school was starting, I didn’t have time to complete the rest of the transcripts, but I hope to work further into completing those soon when I have the time.
Conclusion:
This internship which ends September 30th, has taught me many different skills, both technical and soft. The technical skills I learnt were scraping data with tools that were unfamiliar to me at first as well as learning more python than I had at first when I first joined the internship. Soft skills that I have learnt were collaboration and communication. Collaboration was important as I was working with people that I have not known and working with them with ease. Communication was also another important thing because we were asking each other questions when we needed help and checking up where we were with our work to make sure that we were collecting the data and were on task. In conclusion, this internship has taught me many skills such as improving my knowledge with python which wasn’t as strong before as well as collaborating with others whom I didn’t know before.