If you are using Microsoft Windows, note that the virtual environment activation command above is different, you should use venv\Scripts\activate.
Basic Scraping Technique
The first thing to do when writing a scraping script is to manually inspect the page(s) to scrape to determine how the data can be located.
To begin with, we are going to look at the list of PyCon videos at http://pyvideo.org/category/50/pycon-us-2014. Inspecting the HTML source of this page we find that the structure of the video list is more or less as follows:
<divid="video-summary-content"><divclass="video-summary"><!-- first video --><divclass="thumbnail-data">...</div><divclass="video-summary-data"><div><strong><ahref="#link to video page#">#title#</a></strong></div></div></div><divclass="video-summary"><!-- second video -->
So the first task is to load this page, and extract the links to the individual pages, since the links to the YouTube videos are in these pages.
Loading a web page using requests is extremely simple:
That's it! After this function returns the HTML of the page is available in response.text.
The next task is to extract the links to the individual video pages. With BeautifulSoup this can be done using CSS selector syntax, which you may be familiar if you work on the client-side.
To obtain the links we will use a selector that captures the <a> elements inside each <div> with classvideo-summary-data. Since there are several <a> elements for each video we will filter them to include only those that point to a URL that begins with /video, which is unique to the individual video pages. The CSS selector that implements the above criteria is div.video-summary-data a[href^=/video]. The following snippet of code uses this selector with BeautifulSoup to obtain the <a> elements that point to video pages:
From these pages we can scrape the session title, which appears at the top. We can also obtain the names of the speakers and the YouTube link from the sidebar that appears on the right side below the embedded video. The code that gets these elements is shown below:
The URLs returned from the scraping of the index page are relative, so the root_url needs to be prepended.
The session title is obtained from the <h3> element inside the <div> with id videobox. Note that  is needed because the select() call returns a list, even if there is only one match.
The speaker names and YouTube links are obtained in a similar way to the links in the index page.
Now all that remains is to scrape the views count from the YouTube page for each video. This is actually very simple to write as a continuation of the above function. In fact, it is so simple that while we are at it, we can also scrape the likes and dislikes counts:
The soup.select() calls above capture the stats for the video using selectors for the specific id names used in the YouTube page. But the text of the elements need to be processed a bit before it can be converted to a number. Consider an example views count, which YouTube would show as "1,344 views". To remove the text after the number the contents are split at whitespace and only the first part is used. This first part is then filtered with a regular expression that removes any characters that are not digits, since the numbers can have commas in them. The resulting string is finally converted to an integer and stored.
To complete the scraping the following function invokes all the previously shown code:
video_page_urls = get_video_page_urls()for video_page_url in video_page_urls:print get_video_data(video_page_url)
The script up to this point works great, but with over a hundred videos it can take a while to run. In reality we aren't doing so much work, what takes most of the time is to download all those pages, and during that time the script is blocked. It would be much more efficient if the script could run several of these download operations simultaneously, right?
from multiprocessing importPooldef show_video_stats(options):
video_page_urls = get_video_page_urls()
results = pool.map(get_video_data, video_page_urls)
The multiprocessing.Pool class starts eight worker processes that wait to be given jobs to run. Why eight? It's twice the number of cores I have on my computer. While experimenting with different sizes for the pool I've found this to be the sweet spot. Less than eight make the script run slower, more than eight do not make it go faster.
The pool.map() call is similar to the regular map() call in that it invokes the function given as the first argument once for each of the elements in the iterable given as the second argument. The big difference is that it sends all these to run by the processes owned by the pool, so in this example eight tasks will run concurrently.
The time savings are considerable. On my computer the first version of the script completes in 75 seconds, while the pool version does the same work in 16 seconds!
The Complete Scraping Script
The final version of my scraping script does a few more things after the data has been obtained.
I've added a --sort command line option to specify a sorting criteria, which can be by views, likes or dislikes. The script will sort the list of results in descending order by the specified field. Another option, --max takes a number of results to show, in case you just want to see a few entries from the top. Finally, I have added a --csv option which prints the data in CSV format instead of table aligned, to make it easy to export the data to a spreadsheet.
The complete script is available for download at this location: https://gist.github.com/renjithsraj/9fc25b13ec875d128973
Below is an example output with the 25 most viewed sessions at the time I'm writing this: