The following code snippets offer a glimpse into the basic operations involved in web scraping, data cleaning, and EDA. Remember, real-world data requires more rigorous analysis and might necessitate advanced data cleaning steps such as removing duplicate entries or outliers, normalizing text data, etc. As you progress, make it a point to explore multiple aspects of your data and identify relationships between different variables.
1. Web Scraping with BeautifulSoup
import requests
from bs4 import BeautifulSoup
# URL of the page you want to scrape
url = "https://startup.jobs/"
# Send a GET request to the webpage
response = requests.get(url)
# Parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find job titles
titles = soup.find_all('h5', class_='job-listing-title')
# Find job locations
locations = soup.find_all('span', class_='job-listing-location')
# Print each job title and location
for title, location in zip(titles, locations):
print(f'Title: {title.text.strip()}')
print(f'Location: {location.text.strip()}')
print("\n")
2. Data Cleaning with Pandas
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Remove rows with missing values
df_cleaned = df.dropna()
# Save the cleaned data back into a CSV file
df_cleaned.to_csv('cleaned_data.csv', index=False)
3. Data Visualization with Pandas and Matplotlib
import matplotlib.pyplot as plt
# Load your cleaned data
df = pd.read_csv('cleaned_data.csv')
# Get a basic statistical summary of the data
print(df.describe())
# Visualize the distribution of job locations
df['Location'].value_counts().plot(kind='bar')
plt.title('Distribution of Job Locations')
plt.ylabel('Count')
plt.xlabel('Location')
plt.show()
A video from alumnus turned mentor: @Dilshaan_Sandhu.