Code Along for WebScrapper DataDive

stemaway · June 2, 2024, 5:48pm

The following code snippets offer a glimpse into the basic operations involved in web scraping, data cleaning, and EDA. Remember, real-world data requires more rigorous analysis and might necessitate advanced data cleaning steps such as removing duplicate entries or outliers, normalizing text data, etc. As you progress, make it a point to explore multiple aspects of your data and identify relationships between different variables.

1. Web Scraping with BeautifulSoup

import requests
from bs4 import BeautifulSoup

# URL of the page you want to scrape
url = "https://startup.jobs/"

# Send a GET request to the webpage
response = requests.get(url)

# Parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find job titles
titles = soup.find_all('h5', class_='job-listing-title')

# Find job locations
locations = soup.find_all('span', class_='job-listing-location')

# Print each job title and location
for title, location in zip(titles, locations):
    print(f'Title: {title.text.strip()}')
    print(f'Location: {location.text.strip()}')
    print("\n")

2. Data Cleaning with Pandas

import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Remove rows with missing values
df_cleaned = df.dropna()

# Save the cleaned data back into a CSV file
df_cleaned.to_csv('cleaned_data.csv', index=False)

3. Data Visualization with Pandas and Matplotlib

import matplotlib.pyplot as plt

# Load your cleaned data
df = pd.read_csv('cleaned_data.csv')

# Get a basic statistical summary of the data
print(df.describe())

# Visualize the distribution of job locations
df['Location'].value_counts().plot(kind='bar')
plt.title('Distribution of Job Locations')
plt.ylabel('Count')
plt.xlabel('Location')
plt.show()

A video from alumnus turned mentor: @Dilshaan_Sandhu.