Coursera Scraper

A programmer will always aim to take out human repetition. Usually, this means a small investment in time leads to a much larger windfall later when the magic happens with the click of a button (or more likely a command sent to the shell). Sometimes, however, the opposite happens.

My goal was simple: to write some code to automatically download all videos and lecture notes for an online course. Before you say it, I will: yes, this already exists and yes, I could have just used a browser extension or scraper code from someone who’s better at this than me. But then I wouldn’t be learning anything! I actually made this work perfectly a few months ago. I wrote the scraper in Python using Mechanize and Beautiful Soup.

The basic structure was simple. First I needed to emulate a browser, which Mechanize did. I was able to log in using my credentials and open the page of all video lectures. I could then load the source code and parse it with Beautiful Soup allowing me to pick out the links and download them. But then Coursera changed their website, which meant I needed to change my code.

I visited their updated site and, thanks to help from a coworker, discovered they were now using backbone and everything was built in JS. This is a problem because Mechanize cannot run JS and only interacts with the code of the initial page, not the page loaded with JS. Furthermore, there was no easy work-around as the login form sent POST data without a CSRF key, meaning they are likely checking session cookies. I decided to go with Selenium for the browser emulator as it can handle JS.

To see the above, on Chrome use the JavaScript debugger or on Firefox use Firebug and check the Network & Sources tabs). In this screenshot, it’s apparent the code is written with Backbone.js:

And here, we can examine the POST request made when a “Sign In” is attemped:

Selenium is an awesome tool for browser automation. And watching it in action makes it feel like there is a ghost controlling your computer. However, even Selenium had its short-comings. When going to download from the page, Selenium couldn’t interact with the browser download popup window (the one that asks to save to disk or open file). However, Firefox has many, many configuration parameters that can be tweaked to auto-download. To see these parameters, type “auto:config” in Firefox. After spending too long “reading the docs”, I got the pdfs working, but the mp4 was still a problem. I added the mp4 Mimetype (essentially an internet identifier) to the neverAsk.saveToDisk property (which should work!), but it wasn’t enough. I think it had something to do with the way the link was structured on the Coursera page. However, after many frustrating changes of parameters, I decided to try a different approach.

Instead of creating a blank profile, I could tweak my current profile and load that instead. In my regular profile, I ensured that pdf and mp4 files download automatically. Then, I logged in to Coursera which would save session cookies in this profile. Now, in Selenium I just had to load the profile and I was ready to go. It could go directly to the course website and download. I let it loose and great success!

So how long did I spend banging my head on the keyboard to tweak Firefox parameters to download files the way I wanted? Not important. I also put some code on GitHub for the first time so feel free to take a look at my amateur attempt. I still need to update the README to spell out the directions, but that’s for another day.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *