About Dev |

All articles, tagged with “python”

Syncing Safari Downloads — an intro to screen scraping

In honor of the upcoming PyCon (which I’ll be attending on behalf of The Open Planning Project) I decided to write about Python today.

Some time back I wrote myself a simple utility for synchronizing Safari downloads (the book service, not the web browser), and I decided to polish it up, release it, and write about the process.  This is the first of two parts where I will talk about my first time handling the start to finish of publishing an open source python package.  The next part will be a tutorial on how to screen scrape the web, from inspecting the HTTP headers to using CSS selectors with lxml to parse out the interesting data.

Anyway, back to the topic at hand.  If you’ve never heard of Safari, and you’re a tech professional, than I hope it’s because you have personal access to the Library of Alexandria.  If not, then let me be your personal cluestick.  For about the price of five tech books (per year), you can maintain an online bookshelf that gives you access of up to 120 books in that year.  In practice, I think I average about 30, but this also gives you the ability to search through their entire library to find the answers you need.  When you find a book, you add can add it to your bookshelf with two clicks (Thanks Amazon!) and then start reading.  What’s more, the service includes 5 downloads per month (usually one chapter or section of a book),  that give you a personalized PDF for offline reading.

My only problem with the service is managing the downloads.  Once you’ve downloaded a chapter, it will always be available to you (at least as long as you have an account), but the PDFs are auto-generated on demand, and when you save them, you end up with files named something like 0EITGkillY6ALIkill3kHfWkillC4RwjkillwKb69kill736MGkillY4UuykillEJTsC.pdf.  I tried to give them sensible names, and organize them, but it was always a pain, and I always had the weirdest urges just afterward.  To top it all off, the last time I changed computers, I decided not to copy the files (knowing that I could re-download them), so I was left with a lot of manual work to do.

Well, I’ve been telling myself for some time that I wanted to play with lxml (it’s the fast python library for working with XML and HTML).  Also, I’ve been working entirely in javascript lately, so I felt that it was time to stretch some mental muscles and get something done in python.  For the impatient, you can get a copy of the script by typing the following at a terminal:

export STATIC_DEPS=true # Only necessary on a Mac
easy_install safarisync

If the output you get looks something like this:This is Easy Install on Windows

Fear not, poor windows user, I intend to release a simple executable to coincide with the second part of this article.  If you don’t feel like waiting, you can download and install python, then download and install setuptools, and finally fix up your PATH environment variable.

For everyone else, you can start playing along.  Just type safarisync to start the process, or safarisync --help to get a list of options.

Since I’ve only worked with lxml peripherally before (as it was embedded into other projects I was working on), I ended up writing three completely different versions.  The first version was fully functional, using the cookie handling that I learned from this well written tutorial.  It also iterated through all of the elements in the tree to find the ones we were interested in.  Just after finishing it up, I stumbled across this quick intro to lxml, written by a colleague of mine (Ian Bicking).  If you haven’t heard him speak somewhere already, than chances are high that either you’ve used something he’s written, or used something based on something he’s written.

His article introduced me to the CSS selector engine and form handling now built into lxml.  Thus was born the second version of safarisync.  The only problem was that it usually didn’t work.  In the debug shell, I could usually get the code to run, after some tinkering, but never standalone.

The first problem I always had was unnecessarily hard to diagnose.  I was consistently receiving a UnicodeDecodeError from lxml.  I was confused by this because the string I passed in had the proper encoding specified within:

<?xml version="1.0" encoding="utf-8"?>

I received the help I needed from my colleague Luke Tucker (of Melkjug fame, which by the way, you should check out, they just released a new version).  As it turns out, there was a problem in the error handling of lxml such that if you had a bug AND you had unicode data, instead of getting the correct bug reported, you got a UnicodeDecodeError.  He suggested I strip any unicode data and try the same operations to get to the real error.  Thankfully, I’ve been told that this is fixed in the latest version.

Solving the last problem took me outside of the debugging shell, and into the bowels of lxml.  It’s partially written in Cython, which is a python-like language that compiles down to C.  This means (in theory) that you get the speed of C with the beauty of Python.  In practice, this is only half true.  You get the speed of C.  Beauty, however, is in the eye of the beholder.  In any case, peering through the code showed me that while the new form handling code uses python for network access, the rest of lxml uses the built-in downloading facilities of libxml, the C library it wraps.  This means that you have to avoid lxml’s network helpers almost entirely if you need to handle cookies.

The third version of the code can be found at my public source repository.  The interesting code is found in safarisync.py.  I’ve tried to comment it well enough that you can follow through, even without my help.  I’ve had it reviewed by Ian and Robert Marianski, another colleague of mine and talented python programmer.  He helped me with the details necessary to publish the package on PyPI. (For example, if you want your package to have an executable shortcut, you need to create a specially named entry point in setup.py).

Well, thanks for tuning in.  Come back next week for a detailed tutorial teaching you how to write your own screen scraping tools.

Finding the location of the current bash script

In my work for TOPP, I'm the middle of some changes to our build system.  We're using an in-house build tool called fassembler.  Considering that it's completely specific to our needs, and was written mostly from scratch, it's got some pretty great features (e.g. color coded output, database initialization).  Our config files are stored in subversion, checked out, and then compared against when there's an update.  If they differ, you're prompted to either replace, discard, view the diff, or merge the files.  This is great for when you're running a build.

As the Deployment Manager for openplans.org, however, I'm running tens or hundreds of builds.  My goal is to make building and maintaining a deployment easier, and so I need to be able to run the build unattended, and not in a way that blindly discards or overwrites those changes.

Enter Gentoo Linux.  Gentoo is a distribution of linux where all of the packages are built from source.  On a system-wide level, or for each individual package, build options can be set before installing a piece of software.  A fully installed Gentoo system, whether a server or desktop, can contain hundreds of packages, and users don't have the time to sit interactively through the building and updating of each package.

Gentoo uses a script called etc-update to handle the merging of configuration files separately from the building of software.  It works by saving the new configurations with a mangled name (e.g. httpd.conf would become ._cfg0000_httpd.conf), building the list of these files, and then allowing the user to diff, overwrite, discard, or merge any of the new configurations.  It allows you to configure which tools to use, defaulting to diff, smerge, and nano.  I'm a vi user, but I have that set at a system level, so that's picked up by the script.  smerge is just fine for me, but I prefer colordiff (some screenshots), because of it's nicely readable output, and so I have that overridden in a configuration file.

etc-update is licensed under version 2 of the GPL, and so we will be redistributing it bundled with the rest of our build software.  Where our situation is different, however, is that we can build in a myriad of locations, and the configuration files are specific to each build.  In Gentoo's version of the script, portage (their packaging system) is queried for the location of configuration files, but we don't have the luxury of a system level tool to perform that work for us. I looked at a couple of possible solutions to the problem:

The command line
etc-update alread includes a way to pass directories on the command line, but this requires too much typing by the user.
Building a custom script
Easy to type, but it means installing modified versions of the script all over the place, which is just harder to maintain.
Reading from the environment
It requires the user to set the environment somehow, requiring extra steps, and is very hacky
Look in a path relative to the current script
Some magic involved, but if we at least use a configuration file relative to the script, it's relatively straightforward, and the only magic involved is in expecting where the list of directories is saved.

Based on these options, I decided on the latter option.  But this all hinges on knowing during execution where the script is located.  Well, I know how the script has been called.  That's available as Arg0 ( $0 ) in the shell, I figured it would be pretty easy to go from there to the actual location of the script.

Being a python programmer, my first instinct was to code the logic in python,  This wasn't too tough.  I took advantage of the fact that you can pipe a script to the python shell, but used bash string interpolation to pass the argument hardcoded into the script.  Since it was a multiline program, I used a bash here document to make it readable. Here's an example script (that just returns Arg0).

#!/bin/bash

RESULT=`python << EOF

print '$0'

EOF`

It took me about five minutes to put together a final script. It first checked to see if the script was called with any path information (e.g. relative: ../script.sh or absolute: /home/script.sh)  If not, it looked for the script file in the $PATH command variable.  Failing that, it tried to join the current directory to Arg0 to find the actual location.  (Python's os.path.normpath command will override the base path if the search path is absolute).

This script worked, and was easy to ready for python programmers.  It bothered me a bit, however, because: 1) I was embedding a python script into a bash script, which could be rather confusing, and 2) it was 32 lines long, not exactly the shortest of solutions.  This is that script:

#!/bin/bash

# Python script to figure out where this file is located.
HERE=`python << EOF

import os 
import sys 

# The path environemnt variable as a list.
path='$PATH'.split(':') 

# How the script was called    
arg0='$0'

# The current working directory.
working_dir='$PWD' 

# If the script was called in any way that includes path information
# (relative or absolute), we will not look in the system path.
search_in_path=(arg0==os.path.basename(arg0))

if search_in_path:
    for dir in path: 
        if os.path.exists(os.path.join(dir, arg0)):
            print os.path.join(dir, arg0)
            sys.exit(0) 

fullpath=os.path.normpath(os.path.join(working_dir, arg0))
if os.path.exists(fullpath):   
    print fullpath 
    sys.exit(0) 

sys.exit(1)
EOF`

My next thought was to re-implement the script algorithm natively in bash.  Unfortunately, bash doesn't have the python standard library at its disposal.  Thankfully, however, there are a number of commands that allow us to achieve more or less what I wrote above.  I use "readlink -f /basepath/../somepath" to convert two joined paths into a normalized path.  The only problem with this is that when we executed a symlink to a shell program, it returns the location of the actual file and not the symlink.  I'm not really sure if this is a problem that merits any worrying, but I could imagine having a single "source" script, and symlinking it into different environments.  The second command I needed to replicate was os.path.basename (used to extract the directory from the scripts full path); luckily the basename program handles this identically.

I ran into one final problem in interpreting this algorithm in bash, and that was splitting the $PATH variable.  Normally the for..in control structure in bash splits a string by spaces.  We could use sed or tr to convert the colon seperated pathinto a space seperated path, but that's going to run into problems when you have spaces in you directory names.  HEre's where the $IFS variable saves us.  The $IFS variable is a variable that tells bash what characters to use to split up a string into a set.  For our purposes, we temporarily save $IFS and set it to a single colon.  This allows you to perform a simple "for DIR in $PATH".  If you've got colons in you directories, well hey, you could have used python... ;-) Here's that script:

 
# The same algorithm implemented almost purely in bash
if [ "$0" == "`basename $0`" ]; then 
    # The IFS internal variable tells bash how to split a string into
    # variables for a list.  Since the PATH variable is colon seperated, we
    # will temporarily change this variable in order to interpret the path.
    export SAVED_IFS="${IFS}"; 
    export IFS=":"; 

    for DIR in $PATH; do       
        if [ -f "${DIR}/$0" ] || [ -L "${DIR}/$0" ]; then
            THERE="${DIR}/$0" ;
            break;
        fi;
    done;
    
    # We restore the saved IFS variable to return string handling to normal.
    export IFS="${SAVED_IFS}"  
else
    THERE=`readlink -f $0` ;   
fi

The same script is 20 lines in bash, which is an improvement.  At this point I was happy enough with the result that I started to embed it into our local copy of etc-update.  In doing so, however, I ran across a usage of the type built-in command that piqued my interest.  It was being used to test for the existence of egrep on the system.  It turns out that "type -p path" looks for a file-based command and prints it if it exists.  I figured that this could be used in an even shorter bash only script, and wrote a test script to do so.  In checking out the various permutations (in a symlink, from the path, etc.) I found out something interesting: when you invoke a script through in the path directly, bash sets Arg0 to the full path.  "Great!" I thought, combine that with readlink from above, and I have a one-liner.

And then it hit me.

which

From the which man page:

Which takes one or more arguments. For each of its arguments it prints to stdout the full path of the executables that would have been executed when this argument had been entered at the shell prompt. It does this by searching for an executable or script in the directories listed in the environment variable PATH using the same algorithm as bash(1).

The captian obvious award of the day goes to me.  "which $0" will always return the full path, as bash sees it, of the script file.