The things I have learned includes webscraping of large sets of data from public Discourse forums to get the text from posts and their associated metadata, subsequent data cleaning and how to transform the data into vectors in a meaningful way using BERT. Through this process, I learned how to use Selenium, Webdriver and Beautiful soup library for scrapping web html. I gained knowledge on Transformer Neurals algorithm and used Sentence_transformers python library for creating sentence embeddings based on BERT. Further, I familiarized myself with Google Colab , Github, VS Code. Some soft skills that I learned were in communication and how it is important to be flexible and adapt myself according to the project requirements.
- I wrote a python script that takes control of a browser instance and scrolls all the way to the bottom of a page and extracts all topic links in the history of a forum and then saved the dataset in a csv file.
- I wrote a function in python that preprocesses the dataset using python libraries like stopwords, WordNetLemmatizer and PorterStemmer.
- I implemented TF-IDF (term frequency inverse document frequency), which leverages the counts of words and their relative rarity across all documents to determine a similarity score. I used the sentence_transformers library, a pretrained BERT model for getting different types of sentence embeddings and use scipy to find the most-similar embeddings for queries.
I attended every team meeting and also the training session on Github from the industry leaders.
Goals for the upcoming week.
I will use other model with the data collected and see the use of other similarity index metrics (so far I am scoring based on CosineSimilarityLoss).
Detailed statement of tasks done.
- Implemented web scraping of data from Atom discussion forum.
Difficulties faced: Initially confused with the use of scrapy to extract forum contents. I was not aware how to scroll a page and extract all possible informations.
- Preprocessed the data and wrote a function script in python
Difficulties faced: Initially faced troubles as how to remove certain characters which we don’t need for conecept learning of the texts.
Solved: With the help of the tutorial shared by the project lead, I was able to pre-process the data sequentially by lowrercasing characters, removing stop words, stemming/Lemmatization of same type of words.
- Implemented a Document retrieval script using TF-IDF matching score.
- Used BERT that gives a deeper sense of language context and flow than single-direction language models.
Difficulties faced: Initially faced troubles to understand as to how the algorithm works.
Solved: Solved with materials shared and online help.
- Still not sure how the word2vec model works. Need to learn more.