Text Digitization and Ideas for Personal Corpus

On Wednesday, September 21st, me and my classmates in the Digital Humanities class, have tested out text digitization using Abbyy FineReader. It was enriching to see and learn the way in which historical documents, administrative papers, or any other sort of text in physical format, can be transformed into a piece of digital text, using only a scanner and an Optical Character Recognition (OCR) software.

The process is very easy, and can be done in very few steps:

  1. Scan the paper and save it as image (I think that both .jpg and .png work) or as a PDF
  2. Open the saved file using Abbyy FineReader
  3. Select the language of the text
  4. Command Abbyy FineReader to “read” the paper
  5. Adjust, select, delete as you prefer
  6. Export the text as a .RTF (for more efficiency when switching between operating systems) document

During class time I had the opportunity to both digitize a text and to analyze it, in order to observe the functionalities, as well as the shortcomings, of using OCR.

First of all, I was impressed at how easily it can reconstruct the text in a digital format, and the multitude of possibilities to select which parts of the text you further need for export. For example, Abbyy recognizes the page numbers, any annotations made, and even where the spine of a book (if such is the case) was scanned and lightens it. Also, close to the export process, the user can choose to preserve the format of the initial page. However, it does not correctly interpret the handwriting; one piece of handwriting scanned was interpreted as being written in Arabic.

Once the text is exported and opened as either a .doc or an .RTF, the even more interesting part of the digitization process is taking part. In class, I analyzed a short fragment of the Bible published in Romanian language, and a short piece of Arabic text. Screenshots of both are attached below:

screen-shot-2016-09-21-at-12-56-42-pm
Short fragment from the the Bible (Genesis) Published in 2001, in Bucharest, Romania. The language is Romanian.
screen-shot-2016-09-21-at-12-57-24-pm
A short piece of Arabic text, both in vowelled and non-vowelled script.

For the text in Romanian, a few things I noticed while looking over both the original text and the digital version of it are:

  • the export does not preserve the symbol of the cross, changing it (depending on the context) in “t” or ”f;”
  • it preserved, in some cases, the cursive ‘D’ in “Domnul,” whereas in other cases it replaced it with the copyright symbol ©;
  • it replaced some of the superscript letters (e.g. 1 instead of “i” or “!” instead of “1”);
  • it didn’t preserve all the whitespaces between words, joining them in unreadable syntagmas

For the Arabic text, the OCR interpreted the two scripts (vowelled and non-vowelled) as different ones (in the picture we can see that one is highlighted with green and the other in red). Far the most interesting comparison is present in the image below, where the export of the scanned image determined a new page layout (realigning the entire front page of a book to the right, according to Arabic writing standards), emphasizing some words over others, and not preserving the artistic aspect of the calligraphy (where it was present).

screen-shot-2016-09-21-at-12-59-32-pm
On the left: Original scanned page of a text in Arabic. The text is centered and stylized. On the right: the Abbyy FineReader processed and exported version of the image on the left. The style is not preserved, some words are in bold, and the alignment has shifted.

After this extremely fascinating exercise me and my classmates have done and after discovering some of the few things computing helps in dealing with text, I have thought about some project ideas and personal corpus to work with. Two ideas come up in my mind:

  1. An anthology of poems by Lucian Blaga (Romanian poet and philosopher) for which to find the most recurrent words/ series of words that are also associated with concepts in his philosophy; or
  2. A comparison between lines in screenplays and the actual dialogue that is used in a film (for those films for which I can find both data sets).