Freesteel Blog » How does work?

How does work?

Tuesday, September 11th, 2007 at 10:38 pm Written by:

Updates will be made to this article

Having done a lot of work for the UK Parliamentary parser in the service of my webpage (with Francis) and mySociety’s far more popular, the project was both obvious and familiar once I had seen that it was possible to download the Parliamentary documents from the United Nations website.

Unlike the UK Parliament, the source documents for the UN are in PDF form rather than HTML. This makes it far more difficult to extract the information from.

Almost all pre-1994 United Nations documents are bitmap scans embedded in a PDF file, while later ones contain character information which can be extracted using a piece of free software called PDFTOHTML.

The result is a file containing the pixel positioning of each word as it appears on the page, for example:

<text top="902" left="122" width="327" height="14" font="6"><b>The President </b><i>(interpretation from French)</i>: None</text>
<text top="920" left="90" width="358" height="14" font="1">the less, I should like to endorse strongly the practical</text>
<text top="938" left="90" width="358" height="14" font="1">suggestion made at previous sessions that each delegation</text>
<text top="956" left="90" width="358" height="14" font="1">designate someone to be present at the scheduled time. I</text>
<text top="974" left="90" width="358" height="14" font="1">sincerely hope that all delegations will cooperate in this</text>
<text top="992" left="90" width="42" height="14" font="1">regard.</text>
<text top="109" left="512" width="12" height="14" font="1">In</text>
<text top="109" left="543" width="60" height="14" font="1">paragraph</text>
<text top="109" left="621" width="19" height="14" font="1">11,</text>
<text top="109" left="659" width="18" height="14" font="1">the</text>
<text top="109" left="695" width="47" height="14" font="1">General</text>
<text top="109" left="761" width="67" height="14" font="1">Committee</text>
<text top="127" left="481" width="347" height="14" font="1">recommends to the General Assembly that delegations</text>
<text top="145" left="481" width="347" height="14" font="1">should be reminded of the utmost importance of</text>
<text top="163" left="481" width="71" height="14" font="1">punctuality.</text>
<text top="163" left="566" width="262" height="14" font="1">May I take it that the General Assembly</text>
<text top="181" left="481" width="194" height="14" font="1">approves that recommendation?</text>
<text top="218" left="512" width="111" height="14" font="5"><i>It was so decided.</i></text>

Once this has been done, it takes a good many days of hacking work using the very effective text processing capabilities of Python and fixing many of the invisible typos to develop what is known as a scraper and parser which can process it into a structured form more like:

<div class="spoken" id="pg002-bk08">
<h3 class="speaker"> <span class="name">The President</span> <span class="language">French</span> </h3>
	<p id="pg002-bk08-pa01">None the less, I should like to endorse strongly the practical suggestion....</p>
	<p id="pg002-bk08-pa02">In paragraph 11, the General Committee recommends to the General Assembly that...</p>

<div class="italicline" id="pg002-bk09">
	<p id="pg002-bk09-pa01">It was so decided.</p>

This then is amenable to the website generating tools and can be used to generate many of the capabilities that we should expect from online information, such as searchability and automatic hyperlinks.

Unfortunately, the software behind the webpage is quite primitive. You can see the files which run it here

Ideally, a bureaucratic organization such as the United Nations would migrate away from paper and conduct its business using structured data without reference to the printed documents. Unfortunately, this us unlikely to happen any time soon.


  • 1. Alexandre Rafalovitch replies at 12th September 2007, 1:06 pm :

    Many of the UN documents for the last 10 years are available as Microsoft Word. You cannot link to them directly, but you can search for them at

  • 2. Freesteel » Blog Ar&hellip replies at 31st December 2007, 5:36 pm :

    […] n English, and parses the speeches into plain text from the PDF files, as described in the How section of the FAQ. It would be a beautiful idea if the whole proces […]

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <code> <em> <strong>