STEMCasts - Technical Deep Dive - Data Mining

Thank you @maleeha_imran for an amazing webinar! We will have the recording up shortly. Meanwhile, here are a few screenshots for those of you who are super keen to try out the steps demonstrated in the webinar!

  • What is Web Crawling? Gathering data from a website
  • Python Basics - installation, variables, understanding HTML structure
  • Intro to Web requests/responses - urllib(Python library for web requests)
  • Building a basic crawler - scrapy ( a free/open source Python web crawling framework), Beautifulsoup(pulling data out of HTML/XML files).
  • Crawling Tutorial


First pass editing done. Will be uploaded to STEMCasts library shortly.

Overview:



The primary ways to get data:



What are you allowed to crawl? Honor the rules in Robots.txt.
Start with sites created specifically to help you learn. Example: quotes.toscrape.com






BeautifulSoup Code to sort and print tags from quotes.toscrape.com:



Tool built specifically for scraping:



2 Likes

Hey all!
Just wanted to pop in and mention that if you got a chance to experiment with web scraping since the webinar and had any questions I’m here to help!

1 Like

Thank you @maleeha_imran!

BeautifulSoup will be used in the 1st step.

ML Participants & Observers: Get familiar with the tool with this excellent webinar!

Thank you very much Maleeha!
I was trying to use the way you showed in the vedic to open a son file of a STEM-Away webpage, but I got this error that I am confused about. It says that son does not have the load attribute. Could you help? Thank you.Screen Shot 2020-05-21 at 11.31.07 AM

You might have called your own file json.py. Make sure that you do not have any file named json.py in your folder.
Besides, if your python version is below 2.6, try to use “import simplejson as json” instead. You might need to “pip install simplejson” in the terminal before importing it in python.

#stemaway-star

Thank you for the response Xianbo! I am using python 3. So does that mean I don’t need the “import simplejson as json”? And I changed my file name to demo.py instead in order to not overlap with the written code. But this new weird error popped up…Do you know how I can fix it? Thank you. Screeyon Shot 2020-05-21 at 12.25.44 PM

Can you try to set the error parameter in open() to ‘ignore’? And set the encoding to ‘utf-8’? Meaning change line 3 to:
with open(’/Users/kittyguz/Desktop/2032.webarchive’,
encoding=“utf8”, errors=‘ignore’)) as json_file:

After that, compare the data variable with the actual contents of that file you are trying to open. You might be missing a few characters that could be causing the UnicodeDecodeError.

Thanks for the reply Maleeha! I am quite new to web crawling and such so I am still pretty confused with you last statement about " comparing the data variable with the actual contents of the file." Do you mean that I might need to change “data” into something else based on the content of the file that I am trying to open? Here I just included a screenshot of the file(white background), my changed code and the new error message. I googled a bit about this new error, but I did not seem to understand it. Would you explain a bot more? Thank you!

Python3 has the package json so yeah you don’t need. The reason was just because you have a file called json.py.

The error means that webarchive file is not in utf-8, since it’s a file type created by Apple. I don’t know how to encode it, so you could convert the type of file. You could install the webarchive package by “pip install pywebarchive” and then try the following codes to convert the webarchive file to html file(or xml if you want). You can install the package with “pip install pywebarchive”

import webarchive
archive = webarchive.open(“your_file_name.webarchive”)
archive.extract(‘new_file_name.html’)

Then you can read the content of html file by simple codes:

data = open(‘new_file_name.html’)
data.readline()#read one line of data

Thanks again of the detailed reply, Xianbo. The webarchive file I use is just a random one that I downloaded from a page of STEM-Away by typing “.json” after the url, and it says I can only download it with “.webarchive”. Would you explain what it is the point to install the pywebarcive? So the error is caused because the type of file is not in the right format/ in utf-8 so it cannot be read?
Sorry about the question bombing and thank you very much!

Webarchive file is created by Safari. The error is caused at least some of the content is not utf-8. I’m not familiar with .webarchive. This format is not commonly used, unlike html, you might need some additional work to open it in python.

One way to open .webarchive file is to use pywebarchive package as I previous mentioned. I just converted it to .html file with this package. An alternative solution is using other web browsers such as Google Chrome to download the web into html format to avoid this format.

Besides, there are many ways of getting the content from a webpage, and it’s not necessary to download the webpage to your computer. For example, try the following codes to see if it works:

import urllib
link = “https://stemaway.com/
file= urllib.request.urlopen(link)
data = file.read()
print(data)

1 Like

I tried the code and terminal shows “module ‘urllib’ has no attribute ‘request’” :thinking:

Edit: Following your other suggestion, I got it by downloading from Google Chrome! Thank you very much Xianbo!

If you are using python2, urllib.urlopen should work instead.

Hi Maleeha! If possible, can you share the slides that you used for this presentation. Thanks!

Sure thing. https://docs.google.com/presentation/d/1kY1ITkC1ZH_370HB2fVyQU3Px3xsDQ3Pg-zs--uHfSc/edit?usp=sharing

2 Likes

Hi Maleeha

Hope you are doing well. Thank you for the webinar, I was just able to go through it. It has been easy to follow but I need help in two places.

  1. For your demonstration of Beautifulsoup, I wrote the script :

This is the output I get when I print the ‘soup’ and ‘tags’

for some reason the script is not picking the tags from the website.

  1. I failed to understand how to setup the folders for how to get the crapy and spider working. I tried looking online but could not find something relevant to what I though you were explaining. Can you please suggest some literature I can go through to catch up on this?

Thank you so much!

Saad

Hi Saad,
Would you kind printing out the base + page link that you are trying to access? The html that popped up on your terminal output looks like a redirect page maybe, not the quotes page. Also, I would suggest that if we need to continue debugging that you message me on Slack just so we don’t crowd up this particular topic/post.

Also, for your 2nd question, here’s a link to a tutorial:
https://docs.scrapy.org/en/latest/intro/tutorial.html

Hi Maleeha

Thank you for the reply! You found the exact problem. The base + page was “quotes.toscrape.com/1” instead of it being “quotes.toscrape.com/page/1” Adding a slash to the front of page in the frontier list helped me fix this problem and now I am getting the output I want. Thank you.

I am going through the scrappy tutorial as well right now, I will leave a message in slack if I need any support.

It would be super helpful if you can please reply with the email you use for slack so I can add you?

Saad

Sure, I’m using my mentor chains email, which is maleeha_imran@mentorchains.com