The difficulty in handling rare historical documents and manuscripts for analysis of their texts led a group of researchers from the State University of Southwestern Bahia (UESB) to develop a photography method that facilitates transcription and understanding of linguistic phenomena from bygone eras. “There are old documents and books for which the traditional methods of obtaining an image through scanning can damage or even destroy the original, because they often require folding it or removing it from its bindings in order to place it on a scanner,” says Professor Jorge Viana Santos, of the UESB Corpus Linguistics Research Laboratory (Lapelinc). The researchers are studying 19th century official registry books and documents that have already been handled often and are very fragile. “Unlike with photography, in scanning the document must adapt to the device, and not the opposite,” he says. There is already software that is able to convert typed or printed text into a text file using a method called optical character recognition (OCR), which takes a scanned image of a document as its input. This cannot be done with documents written by hand.
The method created by Professor Santos, together with Professor Cristiane Namiuti Tempon, also at UESB, begins with a photograph of the text. Before taking the photo, the document is placed on a flat sheet of gray plastic with millimeter markings that serve to inform the computer of the exact measurements of the document. Color tone scales, cataloging, pagination and sequence information are also placed on this Cartesian table. The document page can be shown on the computer with all of this information, or with just the handwritten part.
Details on the screen
The transposition of the document from the physical world to a digital format, through photography, is performed by software also developed at Lapelinc. It interprets this data and recovers the colors and tones of the original document on the computer screen. Thus, the method transposes historical handwritten documents into sets of electronic texts that are suitable for scientific research.
The advantages of the Lapelinc Method also include the ease of magnifying the original text on the computer screen in order to check details or answer questions in relation to the writing. With digital documents, the document can be consulted several times without damaging the historical material. According to Santos, the new method contributes to analyses performed by paleographers, specialists who read the text for language studies, transcribe it and adapt it to modern Portuguese, if necessary. Corpus linguistics (the study of texts in electronic format) requires that the documents studied be in text format in order to compile corpora (the plural of corpus) for automatic linguistic analyses. “Our method allows compilation of an electronic corpus, forming a database in which each word can be identified and labeled, facilitating the linguist’s work when searching for the object of study; for example, nouns and verbs can be tagged,” says Santos. “The historian can read the text in modern Portuguese, but the linguist wants to know how the text was written in the original language, to analyze the patterns and evolution of the language.”
Development of the Lapelinc Method began in 2008 and is ongoing. Transcription and editing of the text must still be incorporated into the software. The system created at UESB could also be useful at other academic institutions and even in business. “We do research, and external or commercial support would not change our work, but a prototype could lead to a product, since the method could result in a patent. We are currently wrapping up development,” explains Santos. The project was financed by the Bahia Research Foundation (FAPESB), the National Council for Scientific and Technological Development (CNPq) and the university itself.
Santos, J. V. and Brito, G. S. Fotografia técnica de documentos para formação de corpora digitais eletrônicos: o método desenvolvido no Lapelinc. Letras & Letras. V. 30, No. 2, p. 421-30. July/Dec. 2014.