Classifiers and Recommenders

Building an NLP Pipeline: Text Classification and Beyond


The principal aim of this project is to construct a Natural Language Processing (NLP) pipeline with a primary focus on text classification. Students will learn how to convert textual data into numerical representations (using two distinct methods: OpenAI’s embedding APIs and Python libraries), and develop models that perform accurate classifications based on these representations. As a secondary task, the project also includes building a basic recommendation system.

Learning Outcomes

By the completion of this project, you will:

  • Comprehend the fundamentals and practical applications of an NLP pipeline, with an emphasis on text classification.
  • Understand the concept and application of embeddings, including popular types like word2vec, GloVe, and BERT.
  • Gain experience using OpenAI’s Embedding APIs.
  • Acquire hands-on skills building text classification models and recommendation systems.
  • Become proficient in using Accuracy, Precision, Recall, and F1-Score for model evaluation.
  • Build a competitive model for a STEM-Away competition using your newly acquired knowledge and skills.

Steps and Tasks

1. Conceptual Understanding

This curated list of resources has been designed to guide you through the journey of building a text classification system. Starting from the very basics of Machine Learning and NLP, you’ll progressively move to more specialized topics like text embeddings, transformer models, and finally, practical applications with PyTorch.

These resources are ordered to ensure a logical progression and comprehensive understanding. Take your time to fully comprehend each topic before moving to the next. Remember, the goal is to build a solid foundation and then apply this knowledge to real-world problems. Let’s get started!

Take your first steps with:

Get familiar with:

Start mastering:

:bulb: You are now ready to begin coding! If you require assistance in setting up the coding environment, please refer to the Appendix. The Appendix provides detailed guidance on establishing your development environment and configuring OpenAI access.

2. Embedding the Data

In this step, you’ll utilize Python and OpenAI’s APIs to convert textual data into significant numerical vectors, a process known as embedding. The data you will employ is derived from the Web Scraping starter project that derived data from Startup Jobs. If you did not participate in that project, you can request a cleaned version of the dataset.

There are two routes to this:

Route 1: Using OpenAI’s Embedding APIs

OpenAI provides a simple way to generate embeddings via their API. We have shared an example code below. However, to better understand the API usage, please check the official guide for detailed instructions and examples for a range of applications, including recommendation and classification systems.

Click to view basic example of how to use OpenAI API to embed your data
import openai

openai.api_key = 'your-api-key'

def embed_text(text):
    response = openai.Embed.create(model="text-davinci-002", texts=[text])
    return response['embeddings'][0]['vectors'][0]

# Assume 'data' is a list of your texts
embeddings = [embed_text(text) for text in data]

Route 2: Embeddings using Python Libraries

For the Python route, there are several libraries available, such as Word2Vec (part of Gensim), GloVe, and BERT. You can refer to the following resources for detailed guides:

Click to view a simple code snippet demonstrating how to use BERT for a text classification task

Here’s a simple code snippet demonstrating how to use BERT for a text classification task. This example covers tokenizing sentences, preparing the labels, and performing a basic training step. Remember, this is a very basic example, and real-world applications would require additional steps, like data splitting, multi-epoch training, model performance evaluation, and model saving.

# First, we install the necessary libraries
# !pip install transformers
# !pip install torch

# Import the required libraries
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the pre-trained BERT model and the BERT tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Assume we have some training data in two lists: sentences (the text data) and labels (the associated labels for each text)
sentences = ['This is the first sentence.', 'This is another sentence.']
labels = [0, 1]  # Suppose it's a binary classification task

# Tokenize the sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Specify the labels
inputs['labels'] = torch.tensor(labels).unsqueeze(0)

# Now, we're ready to train the model. This is a simple example, so we'll just do one pass over the data
outputs = model(**inputs)

loss = outputs.loss

optim = torch.optim.Adam(model.parameters())

# Remember, this is a very simplified example. A real-world application would involve more steps, 
# including splitting your data into a training and validation set, running the training process for more than one epoch, 
# periodically evaluating performance on the validation set, and saving the best model

These embeddings can be used locally on your machine without needing to make requests to an external API. Do note that using these libraries may require a good understanding of NLP and may also require significant computational resources, especially for large datasets.

By the end of this step, you will have transformed your text data into a form that can be input into various machine learning algorithms, setting the stage for building your recommendation and classification systems.

:bulb: Choosing between Python-based libraries and an API like OpenAI’s depends on your specific use case, and each approach provides a unique learning experience.

Learn more about the 2 approaches

Python-based libraries such as Word2Vec, GloVe, and BERT are excellent tools for gaining a deeper understanding of how embeddings work under the hood. Implementing these models from scratch offers invaluable experience and understanding in natural language processing (NLP) and machine learning. You’ll also get to handle challenges in computational resource management and efficiency, as these models can be resource-intensive. This approach can be advantageous if you’re looking for a lot of customization, as these libraries are highly flexible and adaptable.

On the other hand, OpenAI’s API provides a more user-friendly and less computationally intense approach. It’s a great option if you’re new to the field, as it simplifies the process and allows you to quickly move onto the stages of building recommendation and classification systems. The API’s encapsulated nature means that you can access powerful embedding models without needing to understand their inner workings in depth or manage computational resources.

Both paths provide the essential skill of transforming raw text data into numerical form that can be processed by algorithms. Using both approaches in this project allows you to compare their results and gain a balanced and well-rounded understanding of the domain. Having experience with both methods prepares you for a variety of scenarios in your future data science projects, as you can choose the best tool for your specific needs. The hybrid approach also makes you versatile and adaptable in the fast-evolving field of machine learning and data science.

3. Building a Text Classification System

Once your data is embedded, you can start building a a text classification system. This system will categorize the data based on certain criteria (for instance, you could classify job postings by industry or role based on the job description).

  • Preprocessing the Data

You’ll need to preprocess your data to ensure your classification labels are well-defined and uniformly formatted. For example, if you’re classifying job postings by industry, make sure the industry labels are consistent.

  • Choosing and Implementing the Models

    For your first classification models, you’ll be using Support Vector Machines (SVM) and Logistic Regression. Both of these are common models used in text classification, and they work well with text data represented by embeddings.

  • Training the Models

    After implementing the models, you’ll need to train them using your preprocessed data. This involves splitting your data into a training set and a testing set, then fitting your models on the training set.

    Here is a code snippet on how to train a Logistic Regression model:
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    # Assume 'embeddings' is a 2D numpy array containing your embeddings and 'labels' is a 1D numpy array containing your labels (0 and 1)
    X_train, X_test, y_train, y_test = train_test_split(embeddings, labels, test_size=0.2, random_state=42)
    # Create a Logistic Regression classifier
    classifier = LogisticRegression(), y_train)
    # Evaluate the classifier on the test data
    print(f"Accuracy: {classifier.score(X_test, y_test)}")
  • Evaluating the Models

    Once your models are trained, you need to evaluate them to see how well they’re performing. Use metrics such as accuracy, precision, recall, and F1 score to evaluate your models. You can refer to this guide to understand these metrics better.

  • Exploring Further

    After working with SVM and Logistic Regression, do some research to implement a third classification model of your choice. Remember not to use pre-trained models for this task. This step will help you explore different options and understand the strengths and weaknesses of various models.

By following these steps, you’ll gain a solid understanding of how to create a text classification system, from data preprocessing to model evaluation. It will also give you a good foundation to explore more complex models and techniques in the future.

4. Building a Text Recommendation System

You will now build a recommendation system using the same embeddings as before. This system should be able to recommend jobs based on their description.

Here are the five most beginner-friendly techniques for building a text recommendation system, along with brief explanations and resources for further learning. Remember that it’s important to understand your specific problem and dataset to choose the best method.

  • Cosine Similarity

    It measures the cosine of the angle between two vectors. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), they could still have a smaller angle between them.
    Cosine Similarity - Understanding the math and how it works (with python codes)

  • Euclidean Distance

    It’s the “ordinary” straight-line distance between two points in Euclidean space. With text data, documents are converted into vectors, and the Euclidean distance between them represents how similar those documents are. You can also use Manhattan Distance.
    Euclidean Distance in Python
    Manhattan Distance in Python

  • Jaccard Similarity

    It’s used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets.
    Understanding the Jaccard Similarity Coefficient

  • TF-IDF (Term Frequency-Inverse Document Frequency)

    This reflects how important a word is to a document in a collection or corpus. It’s often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
    TF-IDF from Scratch in Python on Real World Dataset

Click to view a basic example using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Assume 'embeddings' is a 2D numpy array containing your embeddings
# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(embeddings)

# Function to recommend a job based on a given job description
def recommend_job(job_index, similarity_matrix=similarity_matrix):
    similarity_scores = list(enumerate(similarity_matrix[job_index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 10 most similar jobs
    similarity_scores = similarity_scores[1:11]
    # Get the job indices
    job_indices = [i[0] for i in similarity_scores]
    # Return the top 10 most similar jobs
    return df['Job Title'].iloc[job_indices]

Remember, this is just a basic example. There are many ways to build recommendation systems, and exploring these different methods is part of the fun and challenge of machine learning and data science!

For the scope of this project, we’ll only evaluate the text classification system, as it provides a more straightforward problem for beginners. While evaluating recommender systems is a critical part of the process, it requires user interaction data and pose a more complex challenge. Mastering text classification evaluation first will set a solid foundation for exploring more complex evaluation scenarios in the future.

STEM-Away Competition/ Evaluation

Now is the time to put all the knowledge you’ve gained to the ultimate test. We invite you to participate in the STEM-Away Machine Learning Competition.

In this competition, your task is to evaluate the performance of your job classification model on a separate test dataset that we will provide.

Your model will be evaluated primarily on the F1 score, a metric that considers both precision (the percentage of your model’s positive predictions that are correct) and recall (the percentage of actual positive instances that your model correctly identified).

In the context of this challenge, a higher F1 score translates into a model that is more proficient at accurately classifying job descriptions.

Please note, for this challenge, the use of pre-trained models is not allowed. You are required to train your model from scratch using the techniques and methodologies you’ve learned.

Are you ready to showcase your skills? Best of luck, and happy coding!

STEM-Away Evaluation for Microcredentials and Virtual-Internship Eligibility

For consideration for STEM-Away microcredentials or potential inclusion in our virtual internship program, your evaluation will encompass either a live demonstration or a video recording of your work. This will involve:

  • Job Classification Model: If you have completed the model and entered the STEM-Away Competition, you are eligible for consideration.

  • Progress Update: If you have not been able to complete your model, demonstrate the progress you’ve made so far. Show the functionality you’ve implemented, explain any challenges you faced. Even if the project isn’t fully completed, showing where you reached is adequate to demonstrate your learning and problem-solving skills.

Note: This section has been updated as of 07/01 to include Progress Update for those who are unable to complete the full model


Environment Setup

Environment Setup

Before you can begin working on the project, you’ll need to set up your programming environment. If you’re new to Python or need a refresher, here’s a step-by-step guide to get you started:

1. Install Python

Download and install Python from the official Python website: Download Python | The latest version as of my training cut-off in September 2021 is Python 3.9, but any version from Python 3.6 and up should work for this project.

2. Install pip

pip is a package manager for Python that we’ll use to install the necessary libraries for this project. If you installed Python from the official website, you should already have pip installed. You can check by typing pip --version in your command line.

If you don’t have pip installed, you can follow the instructions here: Installation - pip documentation v23.2.1

3. Set up a Virtual Environment

It’s a good practice to create a virtual environment for each of your Python projects. This isolates your project and its dependencies from other projects, which can prevent version conflicts.

Here’s how you can set up a virtual environment:

  • On Windows:
python -m venv myenv
  • On macOS and Linux:
python3 -m venv myenv

Replace myenv with the name you want to give your virtual environment.

To activate the virtual environment, navigate to the directory where you created the environment and run:

  • On Windows:
  • On macOS and Linux:
source myenv/bin/activate

4. Install Necessary Libraries

Once your virtual environment is activated, you can install the necessary Python libraries for this project using pip. For this project, you’ll likely need the following libraries:

  • numpy
  • pandas
  • sklearn
  • nltk
  • tensorflow or pytorch (for BERT embeddings)
  • gensim (for word2vec and GloVe embeddings)

You can install them using the following command:

pip install numpy pandas scikit-learn nltk tensorflow gensim

For PyTorch installation, follow the guide on the official PyTorch website (, as the installation command varies depending on your system and whether you have CUDA available.

With these steps, you should have your programming environment set up and ready to go for this project. Happy coding!

OpenAI Setup

1. Create an OpenAI Account

Go to OpenAI Platform and sign up for an account. You will need to provide some information, including your email address. After you sign up, you should receive a confirmation email. Once your email is confirmed, you can log in to your account.

2. Get Your API Key

Once you’ve logged into your OpenAI account, navigate to the API section of the dashboard (OpenAI Platform). From here, you can generate a new API key.

3. Install OpenAI Python Client

You can install the OpenAI Python client library using pip:

pip install openai

4. Use the API Key in Your Code

Now that you have your API key, you can use it in your Python code to make requests to the OpenAI API. Here’s an example of how you might use it:

import openai

openai.api_key = 'your-api-key'

# Example usage of the API:
response = openai.TextEmbeddings.create(
  texts=["Once upon a time..."]

Replace 'your-api-key' with the API key you obtained from the OpenAI dashboard. Be sure to keep your API key confidential, do not publish or share it with anyone, and do not commit it in version control systems.

That’s it! You should now be able to use OpenAI’s Embedding API in your project. Remember to follow OpenAI’s usage policies and guidelines while using their APIs.

Feeling stuck? Click to see a Python code snippet showing how to train a simple text classification model using embeddings
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec

# Assume X is your list of sentences (list of lists of words)
# and y is the list of corresponding labels.

# First, let's create word2vec embeddings for our text data.
# We're creating a Word2Vec model and training it on our sentences.
model = Word2Vec(sentences=X, vector_size=100, window=5, min_count=1, workers=4)
model.train(X, total_examples=len(X), epochs=10)

# Now, we'll create sentence vectors by averaging word2vec vectors of all words in a sentence.
X_vectors = []
for sentence in X:
    sentence_vector = np.mean([model.wv[word] for word in sentence], axis=0)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size=0.2, random_state=42)

# Define the model
lr_model = LogisticRegression()

# Train the model, y_train)

# Evaluate the model
print('Training accuracy:', lr_model.score(X_train, y_train))
print('Testing accuracy:', lr_model.score(X_test, y_test))

Recording of Machine Learning Mentor Webinar (06/21)

I would like to request a clean version of the startup jobs dataset

Tagging @Dilshaan_Sandhu.

Dilshaan, can you provide the dataset. Thanks!

1 Like

@Fay_Elhassan and anyone else who will be using these datasets

The folder I have provided below has multiple CSV files that are relatively clean. The Job Description Column, Job Title Column, Job Tag Column, and Job Link Column have no NA or null values; however, the Job Type and Job Location columns do due to the fact that many jobs scraped did not have information provided on those attributes.

If you want to create a model that predicts those two attributes, please use the all_jobs.csv file and filter out all the jobs that have any NA values. This will provide you with data from different types of jobs with no NA values.

Finally, feel free to request more data if needed! I am able to scrape jobs from various locations, fields, and time commitments and am willing to provide that data if needed.

If you have any further questions, message me on the STEMAway site at @Dilshaan_Sandhu.

Data: STEMAway StartUp Job’s Datasets

Good luck with the Starter Projects!
Dilshaan Sandhu

1 Like