Jacob Kaplan-Moss

How the news breaks

I wrote this post in 2006, more than 17 years ago. It may be very out of date, partially or totally incorrect. I may even no longer agree with this, or might approach things differently if I wrote this post today. I rarely edit posts after writing them, but if I have there'll be a note at the bottom about what I changed and why. If something in this post is actively harmful or dangerous please get in touch and I'll fix it.

I swear, sometimes this programming thing is really just the digital equivalent of baling twine and duct tape.

If you happen to be watching 6News in Lawrence last night, you’d have seen the election results crawling across the bottom of the screen:


Pretty much par for the course in terms of local TV coverage… but do you have any idea how that information gets there?

Let me break it down:

  1. Votes are collected by the precincts, who report the totals to the Secretary of State’s office. The Secretary of State publishes those results on a web page, but their IT department is paranoid, so only a single IP – that of the Journal-World’s corporate firewall – is given access to that page of results.

    (It almost goes without saying that the HTML of this result page is grossly invalid.)

    Our web servers, however, sit outside of the corporate firewall on a separate network, and so are unable to see that page.

  2. So, a small script on a Linux box under Matt’s desk downloads the page of results every time it changes, and then turns around and uploads it to a production server.

  3. Another small script (on our server this time) scrapes this shoddy HTML (using BeautifulSoup, of course) and inserts the data into our database. At this point the data shows up online, but the journey to the airwaves is far from complete.

  4. At this point, a third script fetches the data back out of the database and writes an Excel spreadsheet (using pyExcelerator).

  5. This spreadsheet is moved to a publicly-accessible URL.

  6. Over at 6News, a Windows box sits and runs a batch file which, using a Windows binary of wget, downloads the Excel file.

  7. Finally, the on-air graphics system reads this Excel file, and the data appears in the crawl.

If you’ve been keeping track, this process involves eight different machines:

  1. the Secretary of State’s vote machine,
  2. the Secretary of State’s web server,
  3. the Linux box under Matt’s desk,
  4. two of our web servers,
  5. our database server,
  6. the Windows box over at 6News, and
  7. the on-air graphics machine

and four glue scripts, in three different languages:

  1. the script that copies the results from the Secretary of State to our public server (shell),
  2. the data scraper (python),
  3. the Excel sheet writer (python)
  4. the Windows downloader (batch)

Here’s the kicker, though:

Despite – or because of – all of this, all night we had fresher data – often by 30 minutes or more – than any of our competition.

Baling twine and duct tape, man…