All articles, tagged with “planetdev”

The wild west of javascript.

Just last week, I was working on the new version of Xinha.  If you don't know, Xinha's a web-based document editor.  Embed it in your blog, your web software, so that you and your users can create web documents. Xinha is WYSIWYG, so there's no need to know HTML.  The Open Planning Project, my employer, uses Xinha to power OpenPlans, which is why I get to work on it.  Xinha is Open Source Software, so we use it, and contribute fixes and enhancements back to the original project.

I was working with Nicholas Bergson-Shilcock, my colleague, on his new plugin for Xinha.  With this plugin, you can finally make great footnotes in your documents.  We were testing his code on Internet Explorer, and we noticed IE acting strange.  Now I don't mean normal IE strange, IE is the bane of all web developers, so I'm used to strange.  (If you use IE, then please don't.  I don't care whether you use Mozilla Firefox, Google Chrome, Opera, Apple Safari, or if you connect to web servers directly with telnet.  Just do all web developers a favor and stop using IE.)

When I say strange, I mean screwy.  Certain places in the document just didn't seem to exist.  His code used Xinha in different ways than the rest of the plugins, so we were expecting edge cases.  But black holes?  Nobody expects black holes!

Editable documents are still the wild west of web development, and so I shouldn't be surprised.  Javascript and DOM has its Wyatt Earp and Doc Holliday, but document editing is too new to have seen the same kind of law enforcement.  When it comes to selection, manipulation, and document processing, the browser differences aren't well defined, and there are no libraries to abstract the problems.  Even Peter-Paul Koch (of QuirksMode) told me that "IE's TextRange is a disaster" when I asked for help.

After a bit of exploring the problem we figured out exactly what happens.  In Internet Explorer, you can't select the end of a text node (in javascript) if it's followed by a block node.  That means that for the valid HTML snippet:

<div>
  This is my first line
  <p>This is my second line</p>
</div>

You can't touch the end of the first line.  Let me say that again, you can't touch the end of the first line. What does that mean?  All of you DOM jockeys know how to get a reference to the node, and could manipulate the elements, but that's no help for the user.

Your user pushes that cursor beyond the event horizon.  They click on your footnote button to bring up a dialog.  You insert the text they type, and BAM!  The cursor's not where the user left it; you've just crapped markup at some other place in the document.  When you do things like that, users start to fear pressing buttons, and we can't have that.

Why haven't we seen it before?  Xinha was using pop-ups for dialogs, and they don't change the original selection.  Now that we've moved to a lightbox-style dialog system, we're moving the cursor about on the page, and we don't have a way to move it back.

How do we fix it?  Our first step was to test in IE8 beta to see if it was fixed.  No such luck; sometimes I wonder why I'm an optimist. ;-)  My next step was to try out StackOverflow, the new Jeff Atwood / Joel Spolsky software development community.  It's pretty hot right now, so I thought it would be a good place to get help, but again, no go.  The only answer I got was someone who seemed to remember some comments related to this bug in Javascript.  I tried to find the software he was referring to, but no bugfix there.  FCKeditor doesn't have a fix.  Neither does TinyMCE. Wikipedia offered up this link to a list of 5000 web-based editors.  I tried them all, and all of the software not using pop-ups had the exact same bug.

So, what can we do?  Unfortunately, I tried to see if there was a way to trick IE into moving the selection to where we want.  I tried moving the selection left, or right, and then back again.  I tried inserting content, then deleting it, but there was no direct way to solve the problem.  We ended up with three different workarounds, all of which have drawbacks, but are better than no solution at all:

Change the justification
If you change the justification on the current selection, IE modifies the document so that the selection continues to work.  Set it to no justification, and you even get valid HTML! Unfortunately, it re-parents the following element, moving it one node closer to the root of the document.
Insert an empty span
This works by making sure that you are attempting to select the span element, rather than a text node, and element selection actually works in IE.  It craps spans all over the document, though. and even though we try to clean these up, you never know.
Insert a visual cue
The final method works by inserting a visual cue for the user in the form of a little block (□), then selecting it.  If we're about to modify the document, or the user begins to type, the block will be removed automatically.  In any other case, the user will see the block and naturally want to delete it from the text.

All three are written in to the code, but we decided to default to the visual cue, because it's the safest in terms of damaging the markup.  Otherwise, we've done everything we could to avoid triggering the error, so we hope it won't affect too many users; it's always a trade off.

I wrote this to get some visibility for this problem.  This is probably just some sort of off by one error, and IE8 is still in beta, so maybe it can still get fixed.  If not, at least you'll have a way to work around the problem when you run into it.

Finding the location of the current bash script

In my work for TOPP, I'm the middle of some changes to our build system.  We're using an in-house build tool called fassembler.  Considering that it's completely specific to our needs, and was written mostly from scratch, it's got some pretty great features (e.g. color coded output, database initialization).  Our config files are stored in subversion, checked out, and then compared against when there's an update.  If they differ, you're prompted to either replace, discard, view the diff, or merge the files.  This is great for when you're running a build.

As the Deployment Manager for openplans.org, however, I'm running tens or hundreds of builds.  My goal is to make building and maintaining a deployment easier, and so I need to be able to run the build unattended, and not in a way that blindly discards or overwrites those changes.

Enter Gentoo Linux.  Gentoo is a distribution of linux where all of the packages are built from source.  On a system-wide level, or for each individual package, build options can be set before installing a piece of software.  A fully installed Gentoo system, whether a server or desktop, can contain hundreds of packages, and users don't have the time to sit interactively through the building and updating of each package.

Gentoo uses a script called etc-update to handle the merging of configuration files separately from the building of software.  It works by saving the new configurations with a mangled name (e.g. httpd.conf would become ._cfg0000_httpd.conf), building the list of these files, and then allowing the user to diff, overwrite, discard, or merge any of the new configurations.  It allows you to configure which tools to use, defaulting to diff, smerge, and nano.  I'm a vi user, but I have that set at a system level, so that's picked up by the script.  smerge is just fine for me, but I prefer colordiff (some screenshots), because of it's nicely readable output, and so I have that overridden in a configuration file.

etc-update is licensed under version 2 of the GPL, and so we will be redistributing it bundled with the rest of our build software.  Where our situation is different, however, is that we can build in a myriad of locations, and the configuration files are specific to each build.  In Gentoo's version of the script, portage (their packaging system) is queried for the location of configuration files, but we don't have the luxury of a system level tool to perform that work for us. I looked at a couple of possible solutions to the problem:

The command line
etc-update alread includes a way to pass directories on the command line, but this requires too much typing by the user.
Building a custom script
Easy to type, but it means installing modified versions of the script all over the place, which is just harder to maintain.
Reading from the environment
It requires the user to set the environment somehow, requiring extra steps, and is very hacky
Look in a path relative to the current script
Some magic involved, but if we at least use a configuration file relative to the script, it's relatively straightforward, and the only magic involved is in expecting where the list of directories is saved.

Based on these options, I decided on the latter option.  But this all hinges on knowing during execution where the script is located.  Well, I know how the script has been called.  That's available as Arg0 ( $0 ) in the shell, I figured it would be pretty easy to go from there to the actual location of the script.

Being a python programmer, my first instinct was to code the logic in python,  This wasn't too tough.  I took advantage of the fact that you can pipe a script to the python shell, but used bash string interpolation to pass the argument hardcoded into the script.  Since it was a multiline program, I used a bash here document to make it readable. Here's an example script (that just returns Arg0).

#!/bin/bash

RESULT=`python << EOF

print '$0'

EOF`

It took me about five minutes to put together a final script. It first checked to see if the script was called with any path information (e.g. relative: ../script.sh or absolute: /home/script.sh)  If not, it looked for the script file in the $PATH command variable.  Failing that, it tried to join the current directory to Arg0 to find the actual location.  (Python's os.path.normpath command will override the base path if the search path is absolute).

This script worked, and was easy to ready for python programmers.  It bothered me a bit, however, because: 1) I was embedding a python script into a bash script, which could be rather confusing, and 2) it was 32 lines long, not exactly the shortest of solutions.  This is that script:

#!/bin/bash

# Python script to figure out where this file is located.
HERE=`python << EOF

import os 
import sys 

# The path environemnt variable as a list.
path='$PATH'.split(':') 

# How the script was called    
arg0='$0'

# The current working directory.
working_dir='$PWD' 

# If the script was called in any way that includes path information
# (relative or absolute), we will not look in the system path.
search_in_path=(arg0==os.path.basename(arg0))

if search_in_path:
    for dir in path: 
        if os.path.exists(os.path.join(dir, arg0)):
            print os.path.join(dir, arg0)
            sys.exit(0) 

fullpath=os.path.normpath(os.path.join(working_dir, arg0))
if os.path.exists(fullpath):   
    print fullpath 
    sys.exit(0) 

sys.exit(1)
EOF`

My next thought was to re-implement the script algorithm natively in bash.  Unfortunately, bash doesn't have the python standard library at its disposal.  Thankfully, however, there are a number of commands that allow us to achieve more or less what I wrote above.  I use "readlink -f /basepath/../somepath" to convert two joined paths into a normalized path.  The only problem with this is that when we executed a symlink to a shell program, it returns the location of the actual file and not the symlink.  I'm not really sure if this is a problem that merits any worrying, but I could imagine having a single "source" script, and symlinking it into different environments.  The second command I needed to replicate was os.path.basename (used to extract the directory from the scripts full path); luckily the basename program handles this identically.

I ran into one final problem in interpreting this algorithm in bash, and that was splitting the $PATH variable.  Normally the for..in control structure in bash splits a string by spaces.  We could use sed or tr to convert the colon seperated pathinto a space seperated path, but that's going to run into problems when you have spaces in you directory names.  HEre's where the $IFS variable saves us.  The $IFS variable is a variable that tells bash what characters to use to split up a string into a set.  For our purposes, we temporarily save $IFS and set it to a single colon.  This allows you to perform a simple "for DIR in $PATH".  If you've got colons in you directories, well hey, you could have used python... ;-) Here's that script:

 
# The same algorithm implemented almost purely in bash
if [ "$0" == "`basename $0`" ]; then 
    # The IFS internal variable tells bash how to split a string into
    # variables for a list.  Since the PATH variable is colon seperated, we
    # will temporarily change this variable in order to interpret the path.
    export SAVED_IFS="${IFS}"; 
    export IFS=":"; 

    for DIR in $PATH; do       
        if [ -f "${DIR}/$0" ] || [ -L "${DIR}/$0" ]; then
            THERE="${DIR}/$0" ;
            break;
        fi;
    done;
    
    # We restore the saved IFS variable to return string handling to normal.
    export IFS="${SAVED_IFS}"  
else
    THERE=`readlink -f $0` ;   
fi

The same script is 20 lines in bash, which is an improvement.  At this point I was happy enough with the result that I started to embed it into our local copy of etc-update.  In doing so, however, I ran across a usage of the type built-in command that piqued my interest.  It was being used to test for the existence of egrep on the system.  It turns out that "type -p path" looks for a file-based command and prints it if it exists.  I figured that this could be used in an even shorter bash only script, and wrote a test script to do so.  In checking out the various permutations (in a symlink, from the path, etc.) I found out something interesting: when you invoke a script through in the path directly, bash sets Arg0 to the full path.  "Great!" I thought, combine that with readlink from above, and I have a one-liner.

And then it hit me.

which

From the which man page:

Which takes one or more arguments. For each of its arguments it prints to stdout the full path of the executables that would have been executed when this argument had been entered at the shell prompt. It does this by searching for an executable or script in the directories listed in the environment variable PATH using the same algorithm as bash(1).

The captian obvious award of the day goes to me.  "which $0" will always return the full path, as bash sees it, of the script file.

Greylisting for comments

Greylisting is an interesting idea that comes from the world of mail servers.  It’s a system used to combat SPAM that’s quite ingenious, and at least on my mail server, is 99% effective.  It’s very effective at blocking SPAM for three reasons:

  1. The internet protocol used for sending mail (SMTP) is quite complex.  Most spammers don’t have the time to write complete mail servers, they instead take shortcuts to cover the majority of cases.
  2. Spam is about turning computer time into money.  Spammers send out millions of mails per day, so if you increase the cost (in time) of sending mail, than you make spamming less attractive.
  3. While both whitelisting and blacklisting require humans to maintain lists of good and bad servers, greylisting is completely automated.  Since it’s automated, it’s easy to use.

They way greylisting works is by keeping a database of people sending mail to your server.  For each mail it receives, it looks at three things:

  1. The person sending the mail
  2. The person receiving the mail
  3. The computer performing the delivery.

If the server doesn’t already recognize all three of these properties, it responds with an error that tells the sender to try back a little bit later.  Real email servers will try again shortly, usually in less than 15 minutes.  A good number of spammers are stopped right here because their spam tools don’t handle this case.  When the real server tries again, this time the mail will just pass right through and be delivered.

That’s it!  That’s the magic of it all.  For any mail coming from people that your users already know, there’s no wait; they don’t see any difference, and mail just keeps coming in.  The first time someone sends a mail to your users, there will be a short wait, normally less than 15 minutes, and since mail isn’t guaranteed to be immediate, most people don’t notice the difference.

Now on top of greylisting, people often throw in Tarpitting.  A tarpit in computers is something that slows down the server, so the server responds more slowly, as if it were under a heavy load.  When combined with greylisting, this means that each mail coming from a new source costs the sender a whole lot more in computer time.  In the case of someone who will be sending you mails regularly, this one-time cost is quickly amortized, costing the sender nothing in the long run.  Spammers, however, who depend on sending millions of unique mails, see this cost with each email they send, and so your server becomes an unattractive target.

How does this relate to comments, you may ask?  Well, I’ve written a a greylisting/tarpitting Django-app for this and patched the code for this blog to use it.  For now, you can download it here: http://douglas.mayle.org/files/greylist.tgz

If you’d like the patch to enable this for your byteflow blog, it’s available at Byteflow Trac Ticket #93