Corpus Presentations on Text Analysis

On Monday, October 17th, my class had the presentation on the corpus analysis. After working on them for the duration of the past couple of weeks of classes, me and my classmates presented to each other, to the professor, and to our guests, the results of our work. Not only that each of us learnt from each presentation, but we also learnt from the feedback we received from the audience.

The five collections of text the analysis were based on were from different categories, which shows the diversity of material a digital humanist can work with. We had a really fun and interesting analysis of the top 10 billboard songs in the past decades which looked at the patterns of writing a hit and how it varied over the years. Then, I presented my corpus analysis and quantification which will serve for the development of a poem generator in a Bacovian style. Third, another classmate provided us with an extremely detailed overview and comparison of two dictionaries of the Costa Rican Slang. Another classmate followed with a presentation on the portrayal of Islam and Arab culture in the Western media in an attempt to create a tool to destroy the stereotypes (and which I found very interesting, inspiring, useful, and applicable in other fields). The last presentation was an analysis of Paris’ 2005 Race Riots and their portrayal in the media.

As I already said, these presentations showed the wide range of topics one can choose when they decide on a text analysis in a digital humanist way.

Looking back on my work, I think that the process of quantifying the chosen poems using AntConc was incredibly useful and gave me a starting point for the development of the program I am currently working on. While using Voyant it was also a good time to see the similarities and differences between what I knew about the poems and what the tool allows one to find about them (if they never read them before). Now that I think of it, I could have done a more detailed analysis of the connections between the words used inside the poems. This would not only help me when programming the generator, but also offer me and others a better understanding of how, when, and why Bacovia chooses his words.

As a closing thought, I am eagerly looking forward to work on other digital humanist projects and apply what I learnt through this experience and by looking at others’ works.



The Citizen Scholars in Context

The class on the 26th of September had a very inspiring topic (which lead me to day-dreaming about implementing a crowdsourcing system in my hometown sometime in the future).

Citizen Science is, according to Wikipedia, “a scientific research conducted, in whole or in part, by amateur or nonprofessional scientists,” or simply put “public participation in scientific research.” The citizen scholars are, thus, any person who takes part in the research and contributes to the progress of it. An example of such project is Zooniverse, where people are invited to help recognize and classify faces of animals that would further contribute to the development of an AI feature that computers will use in recognizing those faces automatically.

In class, not only that we discussed about the benefits of such a mechanism, but we even tried it ourselves! Crowdtranscription is a subcategory of Crowdsourcing which requires the user’s help with recognizing and transcribing text in scanned images. Me and my classmates, together with our professor, went to 18thConnect and edited the Memoir of a chart of the east coast of Arabia from Dofar to the Island Maziera. The document had been previously digitized by an OCR program, but as we learned last time, the digitization of a text comes with occasional errors which, so far, only a human brain can correct. It was an amazing activity for me as I could take responsibility and contribute to other people’s attempts to create great online resources for the large public. At the same time, I was able to notice, as last time, other errors that appear in the process of text digitization  and also what decisions one editor needs to make when transcribing and/ or editing a text. For example, he or she needs to decide whether to preserve the italics, size, indentations, or superscripts that appear in a text, or simply to replace them and motivate their decisions in a note.

Since the text was documenting the journey of a sailor around the Arabian coasts, a thought popped up in my mind. I realized I know very little about the old history of the geographical area I am currently living in (Abu Dhabi, United Arab Emirate). Then I realized there are an incredible number of research that can be conducted using citizen science. The UAE and the Arab World in general is still so little known to those outside of it, especially when it comes to fields such as history, language, literature, culture, and even (old or traditional) cuisine (if you are to ask me). A research on almost anything in these categories would contribute to the dissemination of information beyond the Arab borders, out into the curious and intrigued world. After a quick search on Google I found that there are some projects (currently undergoing or already finished) on the topic. For example, the team behind the Arabic language collection claims that their collection comprises more than 100,000 books and more than 15,000 manuscripts. Still, very little of this is available online, to the large public, and, which is more, even fewer must have been translated to English. However, good news are announced, as some of the manuscripts are going through the process of digitization.

Continue reading “The Citizen Scholars in Context”

Text Digitization and Ideas for Personal Corpus

On Wednesday, September 21st, me and my classmates in the Digital Humanities class, have tested out text digitization using Abbyy FineReader. It was enriching to see and learn the way in which historical documents, administrative papers, or any other sort of text in physical format, can be transformed into a piece of digital text, using only a scanner and an Optical Character Recognition (OCR) software.

The process is very easy, and can be done in very few steps:

  1. Scan the paper and save it as image (I think that both .jpg and .png work) or as a PDF
  2. Open the saved file using Abbyy FineReader
  3. Select the language of the text
  4. Command Abbyy FineReader to “read” the paper
  5. Adjust, select, delete as you prefer
  6. Export the text as a .RTF (for more efficiency when switching between operating systems) document

During class time I had the opportunity to both digitize a text and to analyze it, in order to observe the functionalities, as well as the shortcomings, of using OCR.

First of all, I was impressed at how easily it can reconstruct the text in a digital format, and the multitude of possibilities to select which parts of the text you further need for export. For example, Abbyy recognizes the page numbers, any annotations made, and even where the spine of a book (if such is the case) was scanned and lightens it. Also, close to the export process, the user can choose to preserve the format of the initial page. However, it does not correctly interpret the handwriting; one piece of handwriting scanned was interpreted as being written in Arabic.

Once the text is exported and opened as either a .doc or an .RTF, the even more interesting part of the digitization process is taking part. In class, I analyzed a short fragment of the Bible published in Romanian language, and a short piece of Arabic text. Screenshots of both are attached below:

Short fragment from the the Bible (Genesis) Published in 2001, in Bucharest, Romania. The language is Romanian.
A short piece of Arabic text, both in vowelled and non-vowelled script.

For the text in Romanian, a few things I noticed while looking over both the original text and the digital version of it are:

  • the export does not preserve the symbol of the cross, changing it (depending on the context) in “t” or ”f;”
  • it preserved, in some cases, the cursive ‘D’ in “Domnul,” whereas in other cases it replaced it with the copyright symbol ©;
  • it replaced some of the superscript letters (e.g. 1 instead of “i” or “!” instead of “1”);
  • it didn’t preserve all the whitespaces between words, joining them in unreadable syntagmas

For the Arabic text, the OCR interpreted the two scripts (vowelled and non-vowelled) as different ones (in the picture we can see that one is highlighted with green and the other in red). Far the most interesting comparison is present in the image below, where the export of the scanned image determined a new page layout (realigning the entire front page of a book to the right, according to Arabic writing standards), emphasizing some words over others, and not preserving the artistic aspect of the calligraphy (where it was present).

On the left: Original scanned page of a text in Arabic. The text is centered and stylized. On the right: the Abbyy FineReader processed and exported version of the image on the left. The style is not preserved, some words are in bold, and the alignment has shifted.

After this extremely fascinating exercise me and my classmates have done and after discovering some of the few things computing helps in dealing with text, I have thought about some project ideas and personal corpus to work with. Two ideas come up in my mind:

  1. An anthology of poems by Lucian Blaga (Romanian poet and philosopher) for which to find the most recurrent words/ series of words that are also associated with concepts in his philosophy; or
  2. A comparison between lines in screenplays and the actual dialogue that is used in a film (for those films for which I can find both data sets).