18 August, 2014 by Denise << java, play-framework, api >>

Screen Scraping Secure Pages

Recently I needed an API for some data I wanted to use in an app. The data was behind a secure login with no API exposed and I decided to screen scrape. Unfortunately, screen scraping behind secure pages is a little tricker and so to get around it, I wrote some code to hit the login page, parse the cookies and authenticity token, and then post to the form url with the above information and the user provided login credentials.

The code looks like this:

Note that you should only do this kind of scraping if you're the owner of the site and can't or don't have time to implement a proper API. Additionally, this code is really brittle and any structural changes to the DOM will probably cause your screen scraping to fail.

One (reasonable) use case for implementing something like this might be where you have an old website which for some reason you can't update to build a real API but you need one to access the information it provides.

Another thing to consider is that given we're passing the username and password straight through to our code, if you decide to expose the code as a public API it should be over HTTPS only.