French undemocracy proposal

The United Nations publishes everything in six languages (English, French, Spanish, Arabic, Russian, and Chinese). Currently, my webpage only downloads the documents in English, and parses the speeches into plain text from the PDF files, as described in the How section of the FAQ.

It would be a beautiful idea if the whole process could be done in French. I don’t know much French, but I could work with someone who did and re-use most of the parser, and perhaps get it running in a couple of weeks of sustained work. I have already solved all the technical problems for English, and I don’t think there can be anything worse with the French language. (I would not say so for Chinese.)

After the French parser is working, someone in France, where they might take the United Nations a bit more seriously than here in England, could build a really good front end webpage for it, and make it completely bilingual so I can throw away my version.

An extra advantage is that many of the UN interventions are in French speaking nations in Africa and the Caribbean. When people who really know what is going on get to see what is said about them, things can change.

It might be possible to get funding for this project on the back of the work done on Statistical machine translation. This requires a “bilingual text corpora” where you have a large body of translated text of two languages that are both aligned. It has been reported that this has been done using the Canadian Hansard which is produced in French and English. But more could be better. It should be possible to get some university or company to very modestly fund the development of the French UN General Assembly and Security Council speeches parser which is aligned with the English text, to be used as input…

<pauses to surf the web a bit>

Bugger it. It looks like some people are well ahead of us

We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT).We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.

Also, there’s this claim:

Google used the United Nations Documents to train their machine, and all in all fed 200 billion words. This is brute force AI, if you want – it works on statistical learning theory only and has not much real “understanding” of anything but patterns.

I don’t believe it for the speeches, because an website would be an obvious byproduct. There are word files for some of the documents, but they are incomplete and sometimes not very high quality.


  • 1. Dion replies at 1st January 2008, 9:39 am :

    You might want to research some of the other statistical machine translation efforts under way also. Philipp Koehn from University of Edinburgh that created Europarl (referenced in your blog entry) is also a share holder in Asia Online which is building out even greater quantities of corpus in Asian languages with a full matrix like Europarl between 11 Asian languages as well as eventually the Europarl set as well. By getting large amounts of data outside the domain of Europarl, many different domains of speech can be addressed.

