Did you know you contribute to old books digitization by visiting websites?
I wanted to write this article to show you that even the daily and insignificant computing operations are in fact not insignificant… You doubtless already heard about “CAPTCHA”. It means “Completely Automated public Turing test to tell Computers and Humans Apart”. It is a way to prove to the website that you want to visit,that you are a human being and not a robot (thus this device avoids websites to be spammed, or to see their data being massively downloaded, which would also slow down the internet traffic for the human visitors, or whatever).
The idea of reCAPTCHA1 is to use the human authentication on websites via the CAPTCHA technology to meet a need which has nothing to do with human authentication at first sight, and which is… to OCRise old books!!! (OCR for “Optical Character Recognition” means the translation in “digital” characters (i.e.: UTF-8 character set) of characters printed on papers or scanned as images. In plain language, it means to turn an old and damaged (and perhaps even hand writed) piece of paper into a numeric file.
Actually, digitizing old books provides numerous cultural advantages (among them: everybody can reach the same document simultaneously, even though this one exists in a unique paper copy only, the consultation of this document will not damage it, etc.) and therefore, massive operations of digitization were undertaken from the early 21th century.
However, for old documents, the automatic OCRization is complicated: de facto machines have difficulty in recognizing the printed characters, because these are not regular (because of the deterioration of the paper and the ink due to the time)… like in a CAPTCHA…
Concerning visitors, captchas appears often as an image containing a series of slightly distorted alphanumeric characters, a text field in which the visitor of the website has to type the characters showed on the image and a button designed to validate the manual data entry. On the image, the characters are not much distorted in order to allow a human being to recognize these, but are distorted enough not to be recognized by a machine. Concerning the website’s administrator, the image is bound to its textual version,which is recorded in a database, and when the CAPTCHA is subjected to the visitor, the website compares his answer with the textual version of the CAPTCHA that is stored in the database: if both match, then the website authorizes the human being to visit it; otherwise, a new CAPTCHA is subjected to the visitor.
The idea of reCAPTCHA is all in all of a childish simplicity: it is about making translate old documents … by human beings (!!!) who authenticate themselves as human beings to visit websites!!!
The idea is made possible by the very large number of websites using CAPTCHAs and thus very significant number of daily human authentications, which allows, literally, to translate word by word old books. The mechanism is the following one: a reCAPTCHA consists of two words: the first one is a classical CAPTCHA, used to know if the entity passing the test is human or not (it is considered as human if it manages to solve the CAPTCHA), the second one is the image of a word taken from an old book. If the entity managed to translate the first word, it is authorized to visit the website, and the translation which it gives of the second word is stored in memory. If several “human beings” translate the same image of the old book in the same way, their translation is “validated” and the digitization of the book progressed2.
WHAT CAN BE DONE WITH IT
In this way, over the only year of 2009, 20 years of archives of the New York Times were digitized3 with a reliability better than 99 %.1
Here is a beautiful example of human and machine collaboration!
1■ Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham et Manuel Blum, « reCAPTCHA: Human-Based Character Recognition via Web Security Measures », Science, vol. 321, no 5895, 12 septembre 2008, p. 1465-1468 [Science mag’s article on reCAPTCHA]