The Citizen Scholars in Context

The class on the 26th of September had a very inspiring topic (which lead me to day-dreaming about implementing a crowdsourcing system in my hometown sometime in the future).

Citizen Science is, according to Wikipedia, “a scientific research conducted, in whole or in part, by amateur or nonprofessional scientists,” or simply put “public participation in scientific research.” The citizen scholars are, thus, any person who takes part in the research and contributes to the progress of it. An example of such project is Zooniverse, where people are invited to help recognize and classify faces of animals that would further contribute to the development of an AI feature that computers will use in recognizing those faces automatically.

In class, not only that we discussed about the benefits of such a mechanism, but we even tried it ourselves! Crowdtranscription is a subcategory of Crowdsourcing which requires the user’s help with recognizing and transcribing text in scanned images. Me and my classmates, together with our professor, went to 18thConnect and edited the Memoir of a chart of the east coast of Arabia from Dofar to the Island Maziera. The document had been previously digitized by an OCR program, but as we learned last time, the digitization of a text comes with occasional errors which, so far, only a human brain can correct. It was an amazing activity for me as I could take responsibility and contribute to other people’s attempts to create great online resources for the large public. At the same time, I was able to notice, as last time, other errors that appear in the process of text digitization  and also what decisions one editor needs to make when transcribing and/ or editing a text. For example, he or she needs to decide whether to preserve the italics, size, indentations, or superscripts that appear in a text, or simply to replace them and motivate their decisions in a note.

Since the text was documenting the journey of a sailor around the Arabian coasts, a thought popped up in my mind. I realized I know very little about the old history of the geographical area I am currently living in (Abu Dhabi, United Arab Emirate). Then I realized there are an incredible number of research that can be conducted using citizen science. The UAE and the Arab World in general is still so little known to those outside of it, especially when it comes to fields such as history, language, literature, culture, and even (old or traditional) cuisine (if you are to ask me). A research on almost anything in these categories would contribute to the dissemination of information beyond the Arab borders, out into the curious and intrigued world. After a quick search on Google I found that there are some projects (currently undergoing or already finished) on the topic. For example, the team behind the Arabic language collection claims that their collection comprises more than 100,000 books and more than 15,000 manuscripts. Still, very little of this is available online, to the large public, and, which is more, even fewer must have been translated to English. However, good news are announced, as some of the manuscripts are going through the process of digitization.

Continue reading “The Citizen Scholars in Context”

Text Digitization and Ideas for Personal Corpus

On Wednesday, September 21st, me and my classmates in the Digital Humanities class, have tested out text digitization using Abbyy FineReader. It was enriching to see and learn the way in which historical documents, administrative papers, or any other sort of text in physical format, can be transformed into a piece of digital text, using only a scanner and an Optical Character Recognition (OCR) software.

The process is very easy, and can be done in very few steps:

  1. Scan the paper and save it as image (I think that both .jpg and .png work) or as a PDF
  2. Open the saved file using Abbyy FineReader
  3. Select the language of the text
  4. Command Abbyy FineReader to “read” the paper
  5. Adjust, select, delete as you prefer
  6. Export the text as a .RTF (for more efficiency when switching between operating systems) document

During class time I had the opportunity to both digitize a text and to analyze it, in order to observe the functionalities, as well as the shortcomings, of using OCR.

First of all, I was impressed at how easily it can reconstruct the text in a digital format, and the multitude of possibilities to select which parts of the text you further need for export. For example, Abbyy recognizes the page numbers, any annotations made, and even where the spine of a book (if such is the case) was scanned and lightens it. Also, close to the export process, the user can choose to preserve the format of the initial page. However, it does not correctly interpret the handwriting; one piece of handwriting scanned was interpreted as being written in Arabic.

Once the text is exported and opened as either a .doc or an .RTF, the even more interesting part of the digitization process is taking part. In class, I analyzed a short fragment of the Bible published in Romanian language, and a short piece of Arabic text. Screenshots of both are attached below:

screen-shot-2016-09-21-at-12-56-42-pm
Short fragment from the the Bible (Genesis) Published in 2001, in Bucharest, Romania. The language is Romanian.
screen-shot-2016-09-21-at-12-57-24-pm
A short piece of Arabic text, both in vowelled and non-vowelled script.

For the text in Romanian, a few things I noticed while looking over both the original text and the digital version of it are:

  • the export does not preserve the symbol of the cross, changing it (depending on the context) in “t” or ”f;”
  • it preserved, in some cases, the cursive ‘D’ in “Domnul,” whereas in other cases it replaced it with the copyright symbol ©;
  • it replaced some of the superscript letters (e.g. 1 instead of “i” or “!” instead of “1”);
  • it didn’t preserve all the whitespaces between words, joining them in unreadable syntagmas

For the Arabic text, the OCR interpreted the two scripts (vowelled and non-vowelled) as different ones (in the picture we can see that one is highlighted with green and the other in red). Far the most interesting comparison is present in the image below, where the export of the scanned image determined a new page layout (realigning the entire front page of a book to the right, according to Arabic writing standards), emphasizing some words over others, and not preserving the artistic aspect of the calligraphy (where it was present).

screen-shot-2016-09-21-at-12-59-32-pm
On the left: Original scanned page of a text in Arabic. The text is centered and stylized. On the right: the Abbyy FineReader processed and exported version of the image on the left. The style is not preserved, some words are in bold, and the alignment has shifted.

After this extremely fascinating exercise me and my classmates have done and after discovering some of the few things computing helps in dealing with text, I have thought about some project ideas and personal corpus to work with. Two ideas come up in my mind:

  1. An anthology of poems by Lucian Blaga (Romanian poet and philosopher) for which to find the most recurrent words/ series of words that are also associated with concepts in his philosophy; or
  2. A comparison between lines in screenplays and the actual dialogue that is used in a film (for those films for which I can find both data sets).

 

 

What I discovered about digital projects

I was very pleased to learn the different functionalities of a digital project, the forms it can take, and the purpose it serves.

During one of the seminars for my digital humanities course, I discovered that digital projects all come with a set of general steps “to follow” before and after their implementation. This means that goals needs to be set, methodologies must be determined, and resources engaged in them. There is usually a team working on one project, a process that observes the changes over time, as well as media (often visual) employed in it. The two main characteristics of such a project are the interdisciplinarity and the generative aspect of it. The first, means that it resorts to more than one discipline/ field of study in order to achieve its goal (for example, computing and literature, or history and data science), while the latter suggests that the aim of the project is trial before success, and that learning by doing and failure are two recurrent occurrences in the process.

The projects, then, differ in their form and purpose. Some of them use the online platform as their main mean of interaction with the user, whereas others mostly use the online medium to disseminate the information regarding the offline events and for networking purposes. The projects can be person-based, they can follow various historical periods in different locations (the routes of the envelopes of great philosophers of the Renaissance, http://republicofletters.stanford.edu/), or they can simply focus on one moment in time and space to recreate it online (e.g.  the World’s Fair in Italy that took place in 1911,  http://www.italyworldsfairs.org/). Last, but not least, the projects’ content may be front-ended or back-ended, which requires a different type of engagement of the user with the material. In the first case, the user is not concerned with studying all the data and drawing conclusions (as in the second case), but rather he or she is given the results (sometimes displayed in an interactive form) of a long-term research conducted by the digital project team.