Writing your scrape script
This page is also available as Jupyter interactive notebook. Download it from here and run it in your cloudstor.
Now that we know where our data is, we can start coding our web scraper. You can follow this tutorial either in Jupyter or by executing the code on your own computer.
First, we need to import all the libraries that we are going to use.
from bs4 import BeautifulSoup
import urllib.request
If you get an error, make sure you have:
- Correctly installed
BeautifulSoup
: see here, - Restarted the Kernel after completing the installation with “Kernel” -> “Restart”.
Next, declare a variable for the url of the page of Apocalypse Now on the Internet Movie Database (IMDb).
# specify the url
imdb_page = 'https://www.imdb.com/title/tt0078788/'
Then, make use of the Python urllib to get the HTML page of the url declared.
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(imdb_page)
Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup()
to work on it
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
Now we have a variable soup
containing the HTML of the page. Here’s where we can start coding the part that extracts the data.
Remember the unique layers of our data? BeautifulSoup()
can help us get into these layers and extract the content out easily by using find()
. In this case, since the HTML tag containing the title of the movie (h1
) is very unique on this page, we can simple query <h1>
:
# Take out the <div> of name and get its value
movie_title = soup.find('h1')
After we have the tag, we can get the data by getting its text
.
movie_title = movie_title.text.strip() # strip() is used to remove starting and trailing
print(movie_title)
Similarly, we can get the movie score too. Still, in this case we want to be more precise, instead of just searching for the the tag span
, we also want to specifiy that the attribute itemprop
has value ratingValue
.
# get the score
score_box = soup.find('span', attrs={'itemprop':'ratingValue'})
score = score_box.text
print(score)
Exercise
How would you get the duration of the movie?
duration_box = soup.find('_____')
duration = duration_box._____.strip()
print(score)