Freesteel Blog » Text mining for the brain-dead generation
Text mining for the brain-dead generation
Tuesday, November 11th, 2008 at 1:37 pm
Background rant (definitely skip this)
The great British newsmedia, on which so much of the nation’s self-evaluation depends, is notable for its unimagination and lack of systematic coverage. It is outrageous that there has never been any notable coverage of leading projects such Open Street Map, TheyWorkForYou, or FarmSubsidy — let alone in enough depth for people to hear about the underlying politics which they represent.
The lowest point in this never-ending news black-out came when Jeremy Paxman found time to interview a very bad political blogger on his own ridiculous terms. But to this day Newsnight has never spared a minute for a representative of mySociety to come on and explain why we exist, in spite of 5 years of sustained technical work to create the most user friendly and serious political website in the land without any cooperation from the inside.
One of the (many) reasons it is in the public interest for these projects to get some publicity is so that professors who need to know about them find out about them and value the data.
The Parliamentary transcripts (Hansard), which have been processed into usable form for the TheyWorkForYou project back to 2001, are interesting because they contain a near-verbatim record of the spoken words for a formal group of people covering long periods of their lives and the life of the nation.
Several years ago I wondered if a certain local university professor would be interested in using this data for running it through their fancy-schmancy grammatical analyzers and doing some research. Perhaps he’d be able to help fund us to parse and prepare the rest of the on-line data that goes back to 1988 and covers all kinds of legislation that was put in place then and affects us profoundly to this day. (Train privatization anyone?) It wouldn’t take much — just a sign that somebody out there actually cared.
So I visited Prof. Paul Watry (of the National Centre for Text Mining) across the road, spoke to him for an hour, met the programmers in the basement, gave them the links to download the Hansard data, and nothing came of it. I tried to download and run their software, but I wasn’t clever enough and didn’t get any help. Being an unpaid volunteer who has to earn money elsewhere, my time in this field is considerably more precious than those who are employed full time job with all the resources — even though our culture rates things the other way round: I am supposed to do all the running, and it’s entirely my fault if there’s no follow-up.
Had this project been shown on TV it would have probably been more interesting to the Professor. Money and celebrity are all that attracts attention these days, not technical results, even to people who should know better.
OCR-ing for Alzheimer’s (skip to here)
I was woken at 8:50am this morning by an interview on the Today Program with Dr Peter Garrard who was reporting some text mining from Hansard in an attempt to diagnose the early onset of Alzheimer’s disease in Harold Wilson and explain his sudden resignation in 1976.
It’s very early research, little more than a conjecture with no control studies — just the sort of half-baked science that the BBC likes to make prime-time.
The companion piece on the BBC website is here:
The study, published online in the Journal of Neurolinguistics, converted Hansard transcripts to digital format using optical character recognition software.
This allowed Dr Garrard to use markers to compare Wilson’s speech patterns, and the number of times he used certain words, with those of his parliamentary colleagues.
The analysis was based on techniques developed by literary scholars for quantifying the stylistic similarities and differences between authors, genres, and literary eras.
The findings showed that the content of Wilson’s speeches was identifiably different from those of other members of the House throughout his career as prime minister.
However, the difference was smaller during the months leading up to his resignation – a sign that he was losing his distinctive voice.
Dr Garrard said that could be a sign of the earliest stages of Alzheimer’s.
The paper is not available on the public internet. It’s been deposited here behind the eternally evil Elsevier’s pay-wall, so only people who work for institutions that pay the steep license fee for their dreadful web-interface have the right to access it.
This has been a long-running issue. In this country we privatize the dissemination science just like we privatize the law. Public money pays for the universities, the research grants, and the salaries of the professors who then write up their work in scientific papers and sign over their entire copyright for free to a no-value-added publisher like Elsevier following the necessary peer-reviewing by other professors (who don’t get paid), so it can be kept archived and inaccessible, and the universities have to buy it back at great expense.
Professors, just like those powerful lawyers in the legal system, don’t notice the problem, because the secretaries and lowly librarians deal with the issue of buying the materials back, and they don’t notice that no one else has access to this paid-for information (and anyway they don’t care because only Professors need to see these documents, and they all can).
Senior professors tend to have been brought up in an era when publishers actually did something expensive and skillful, like typesetting. They don’t notice that they and their new PCs now do this entire job themselves, and all the publisher does is run a web-server with an on-demand printing system while cashing in obscene profits and recycling some of the money back into political influence where it fights against reform of the system, for example by filling the Science Minister’s brain with the most bogus argument against Open Access publishing I have ever seen:
Lord Sainsbury of Turville: What the government does not think is right to do is to promote one model, open-access publishing, in the marketplace. It is not clear that on a like-for-like basis open-access publishing will have a lower cost base, and as it will transfer some of the payments from industry users to the authors, it is likely to lead to higher costs for universities and research institutes. Also, because Britain produces 5.3 per cent of articles in the world’s science journals while accounting for only 3.5 per cent of subscriptions, we would also lose out as a country. (2004-11-10)
Dr Garrard needs to pop over to his colleague Steven Harnad who has been banging away on this since about 1991, and has established the precedent that academics should upload their pre-press (no so-called publisher’s “added value” yet included) articles onto their public websites without any legal consequences.
Garrard’s research on evidence for Alzheimer’s disease in written text was first reported all over the place back in 2004 when he published a paper analysing the language of Iris Murdoch’s later novels. The old dear had just famously died, movies were being made about her illness, and this kind of thing piqued the literary interest of the old British media.
So it’s no surprise they picked up his similar research done four years later on Harold Wilson for the same deal.
Senile authors, senile parliamentarians: lots of text to use. Maybe when we invent life-blogging with microphones pinned to our throats containing reliable speech recognition systems that transcribe our entire spoken corpus from birth to death on a 100Gig memory card the size of a pin-head, it’ll be applicable to the rest of us.
Meanwhile, there are a whole lot more interesting questions to be finding out from this amazing corpus of text data, than looking for boring fragments of senility that don’t give us any clues about where we are going as a people or a nation. Everyone knew that Ronald Reagan was senile. Everyone knew that Tony Blair was a raving liar. The way that the political system is able to ignore and/or use these facts to its advantage its advantage group-dynamic-wise needs some serious looking into by thoughtful scholars. But who’s going to go there when our imagination is capped by narrow uninteresting questions that assume honourability and people trying to do their best. It’s as irrelevant as picking on George Bush’s speech impediments.
From Dr Garrard’s Journal of Neurolinguistics paper:
Transcripts of Prime Ministers Questions that were held while Harold Wilson was Prime Minister (ie firstly between October 1964 and June 1970 and secondly between March 1974 and April 1976) were obtained and converted to ASCII format using optical character recognition software. Markers were added to identify the date at each change of year and month, while the identity and party affiliation of every speaker was recorded at the beginning of any speech or contribution to debate.
The troubling thing, before I got into this whole rant, was that Parliament has already paid to have all the old transcripts scanned and OCRed, and people I know have been putting it online here. And here’s their development blog. Goddammit, if we can’t get these things together, what hope is there?
The Telegraph has the story from a yesterday too:
Last week the Medical Research Council awarded a research grant to Dr Garrard and Dr Celeste de Jager, a neuropsychologist at the University of Oxford.
The research team will collect and examine a large database of spoken and written language samples collected over the past twenty years as part of the Oxford Project to Investigate Memory and Aging.
Here’s the fancy website of the Medical Research Council. All kinds of stuff there to distract you from the fact that their one and only mission is the delivery of scientific grants to researchers.
I leave it as a difficult exercise for the reader to navigate through the website to the list of grants given.
What would make me really happy would be if the fruits of the vast work done into text mining software by academics were installed onto the TheyWorkForYou servers and ran live so that the results showed up in the Numerology section where it belongs.
Connect scholars to activists not via Elsevier’s dead tree vaults. Progress might then occur.