In this project, you'll write a program to fetch and analyze top movies of your favorite genre (mine is Crime). In particular, we will scrape data from IMDb and then analyze the data to spot top directors and actors, related genres, and so on. In the process, you'll see how to look for and use a new Python library (Python has hundreds of them). Let's get started!
Note: You need to have Python installed on your computer to be able to do this project.
Overview
Overall, the project will comprise of the following steps.
- Step 1: Scrape IMDb, i.e. download data from IMDb.com using the urllib.request Python library (you learnt about it in the Python3 utilities tutorial). After this step, you will have the raw HTML of some webpages from IMDb.com.
- Step 2: Parse HTML to extract data: The raw HTML we get will be extremely dirty. We'll see how to parse the HTML and extract the information we care about, such as movie titles, rating, genre, director and cast names, etc. For this, we'll be using the BeautifulSoup Python library. (more on this later)
- Step 3: Analyze the data, i.e. calculate related genres, spot top directors and actors / actresses, etc.
Step 1: Scrape IMDb
Here's the IMDb link to Highest Rated Crime Feature Films With At Least 25000 Votes. Go to the page and see the information available on the page, as well as the structure of the URL.
Since IMDb is a database, you can query the IMDb data by changing any of the parameters in the URL. In particular, feel free to choose a genre of your choice (I've chosen Crime). And notice that you can go to the next page by changing the page parameter.
For step one, use the urllib.request library to download the HTML of the page. Some tips:
- You might find the Examples section of the urllib.request library documentation useful. Make sure the result you have is a Python string, and not bytes.
- It will probably be good to have a variable called NPAGES based on which you decide how many pages to scrape. (It's convention to use variable names in all capital letters if the value is a constant, i.e. never changes during the execution of the code). Set it to 1 initially, so that while you are figuring things out, things are quick to run. Once all parts are working, you can change it to 10, which gives you data about 500 movies.
Mac OSX issue: SSLCertVerificationError
If you are on Mac OSX and you get the following error: (when opening the url)
SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED]
then go to Applications > Python3.x and double click on Install Certificates.command. In general, when you encounter specific errors which you don't understand, copy pasting the last line of the error into Google is the best option (last line because it contains the error code and one line summary).
Step 1 Checkpoint
This step is complete if you have the HTML for a few pages in a Python list or something similar.
The HTML should look similar to the following.
<!DOCTYPE html><htmlxmlns:og="http://ogp.me/ns#"xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">......
In particular, if you search for the movie name on top of the IMDb page ("The Godfather" for me) in the HTML, you should see a section similar to the following:
<div class="lister-item mode-advanced"><div class="lister-top-right"><div class="ribbonize" data-caller="filmosearch" data-tconst="tt0068646"></div></div><div class="lister-item-image float-left"><a href="/title/tt0068646/?ref_=adv_li_i"> <img alt="The Godfather" class="loadlate" data-tconst="tt0068646" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY98_CR1,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png" width="67"/></a> </div><div class="lister-item-content"><h3 class="lister-item-header"><span class="lister-item-index unbold text-primary">1.</span><a href="/title/tt0068646/?ref_=adv_li_tt">The Godfather</a><span class="lister-item-year text-muted unbold">(1972)</span></h3><p class="text-muted "><span class="certificate">R</span><span class="ghost">|</span><span class="runtime">175 min</span><span class="ghost">|</span><span class="genre">Crime, Drama </span></p><div class="ratings-bar"><div class="inline-block ratings-imdb-rating" data-value="9.2" name="ir"><span class="global-sprite rating-star imdb-rating"></span><strong>9.2</strong></div>......<p class="text-muted">The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.</p><p class="">Director:<a href="/name/nm0000338/?ref_=adv_li_dr_0">Francis Ford Coppola</a><span class="ghost">|</span>Stars:<a href="/name/nm0000008/?ref_=adv_li_st_0">Marlon Brando</a>,<a href="/name/nm0000199/?ref_=adv_li_st_1">Al Pacino</a>,<a href="/name/nm0001001/?ref_=adv_li_st_2">James Caan</a>,<a href="/name/nm0000473/?ref_=adv_li_st_3">Diane Keaton</a></p><p class="sort-num_votes-visible"><span class="text-muted">Votes:</span><span data-value="1354603" name="nv">1,354,603</span><span class="ghost">|</span> <span class="text-muted">Gross:</span><span data-value="134,966,411" name="nv">$134.97M</span></p></div></div>
Make sure you can spot the title, rating, year of release, etc mentioned in lines 12-25 above.
Pretty messy huh? Let's clean it up!
Step 2: Parse HTML to extract data
Now, we need to clean the HTML and extract the data. Starting from the messy HTMLs, we want a list of dictionaries that look like the following,
{'title': 'The Godfather','year': 1972,'duration': 175,'genres': ['Crime', 'Drama'],'rating': 9.2,'director': 'Francis Ford Coppola','stars': ['Marlon Brando', 'Al Pacino', 'James Caan', 'Diane Keaton']}
Here are some tips:
- Use the BeautifulSoup library. You can install is using pip3 install bs4. How would you find the library if I didn't tell you? Just Google "python library to parse html". Python has hundreds of useful libraries, and in general for anything thats involved, its good to see if someone has already made a library instead of implementing it yourself from scratch.
- See the Quick Start section of Beautiful Soup Documentation. It will contain examples of everything we need.
- In particular, use soup.prettify() to get a prettier version of the HTML, and understand the structure of the HTML page.
- Then use the soup.find_all() function to find relevant parts of the HTML. For example, soup.find_all(attrs={'class':'lister-item'}) gives all elements in the page with class 'lister-item'. Assuming IMDb hasn't changed the structure of its website, that should give you a list of soups, where each one contains all the details for one movie. If the HTML has changed since I wrote the code, you'll have to update the above based on what you see in soup.prettify().
- Then, for each movie_soup, see how you can extract the title, genres, duration, etc. For example, the following code extracts the year information.
year = int(movie_soup.find(class_='lister-item-year').get_text()[-5:-1])
6. Don't worry if the code gets messy. Web scraping always is.
This is the most time consuming and frustrating step in the entire project. Expect yourself to run into lots of issues. Be patient, keep on.
Step 3: Analyze the data
All the hard work is done! Now is the easy part, analyzing clean data. Here is some example analysis you could do.
First step to analyzing data, is to look at it. Let's print the list of movies.
Here's my sample output (rightmost column is a visualization of the rating):
The Godfather (1972) - Rating 9.2 *********************The Dark Knight (2008) - Rating 9.0 ********************The Godfather: Part II (1974) - Rating 9.0 ********************Pulp Fiction (1994) - Rating 8.9 *******************12 Angry Men (1957) - Rating 8.9 *******************Goodfellas (1990) - Rating 8.7 ******************Se7en (1995) - Rating 8.6 ******************The Silence of the Lambs (1991) - Rating 8.6 ******************Léon: The Professional (1994) - Rating 8.6 ******************The Usual Suspects (1995) - Rating 8.6 ******************City of God (2002) - Rating 8.6 ******************The Departed (2006) - Rating 8.5 *****************The Green Mile (1999) - Rating 8.5 *****************
How many movies in the top 500 were released in each decade?
Here's my sample output:
================================================================================Number of movies by decade--------------------------------------------------------------------------------1930s 11940s 71950s 121960s 151970s 281980s 411990s 1072000s 1752010s 114
Which genres are most related to your favorite genre?
Calculate this by seeing the most common genres for the top 500 movies. Here's my sample output:
================================================================================Related genres--------------------------------------------------------------------------------Drama 344Action 139Thriller 132Comedy 95Mystery 66Biography 47Adventure 25Romance 17Animation 10Horror 9
Which directors have the most movies in the top 500?
Here's my sample output:
================================================================================Top directors--------------------------------------------------------------------------------Martin Scorsese 10Steven Soderbergh 7Quentin Tarantino 6Sidney Lumet 6Tony Scott 6Francis Ford Coppola 5David Fincher 5Guy Ritchie 5Ridley Scott 5Brian De Palma 4
Which stars (actors and actresses) have performed in the most movies among the top 500?
Here's my sample output:
================================================================================Top stars--------------------------------------------------------------------------------Robert De Niro 16Al Pacino 11George Clooney 9Denzel Washington 9Brad Pitt 8Christian Bale 7Samuel L. Jackson 7Kevin Spacey 7Mark Wahlberg 7Jason Statham 7
Solution
I hope you were able to complete the project. If you get completely stuck and need help, you can take a peek at my implementation: imdb.py. Note that if IMDb changes their website structure, then the above code might stop working. As of 2nd August, 2018, the code produces output you saw in the tutorial.
Next steps
You can very easily change the code to perform the above analysis for all genres instead of just one genre. It might take some time to scrape that data, so it might be a good idea to write your scraper in a way such that (a) it saves the data that it scrapes to your computer, and (b) it checks if something has already been scraped before scraping again (this ensures that if you stop it in the middle or encounter an error, the program can continue from where it left off).
Conclusion
Congratulations on learning how to scrape the web with Python! Similar to making HTTP requests with urllib and parsing HTML with BeautifulSoup, python has hundreds of libraries for all sorts of stuff.
This includes sending emails and SMS, drawing visualizations and graphs, working with images, and so on. Since you can't know all the libraries, the best you can do is be able to use a new one you come across. And if you completed this project, then you were able to do just that. Kudos!