What happens in the DH world

I am at a point in the semester where, before I start working on my own digital projects, I need to look back and reflect on what I have discovered so far about the world of digital humanities. And what I have learnt is a lot in comparison with what I previously knew. I will do this, however, in two different ways: I will update my digital narrative and I will write down some remarks I have made or thought about during these almost two months since the beginning of the class. The latter first.

As I’ve seen until now, digital humanities covers – as humanities do – a whole lot of possibilities in terms of what to do with it, what (research) project to do with the knowledge it provides. One can choose to map out the places where the first historical sources were found (I would love to see such a project on Romania’s early history); another could simply digitize the collection of letters of a famous historical figure; or, if someone feels more ambitious, they could gather data on the Neo-modernist literature in ex-communist countries and see how the regime influenced the authors’ themes, ways of expressions, and purpose of writing.

Among all the possibilities there are in developing a digital humanities project, I have noticed there are some themes and ideas project initiators and researchers lean towards the most. Three such examples are: online collections, visual representations,  and research and process.

The first one – online collections – is one of the initial forms of digital projects which started when contemporary humanists learnt the benefits of having text in a digital form. Online collections presuppose the existence of a physical collection that would be photographed or scanned and then either  transcribed by the project team or digitized using an OCR software. Examples of online collections are: Arabic Collections Online, Early English Books Online,  Al-Maktaba Al-Shamela, Eighteen Century Collections Online, Blue Mountain Project.

How to read such collections? (and why can they be considered digital humanist projects – rather than simple collections of author work). Let’s take the example of EEBO (link above). After going through the search process on and reaching the desired book to view, the fun begins! There are two ways in which the text is presented on EEBO:

  • First, the photocopies. If the text hasn’t been digitized, the viewer is confronted with photocopies of a printed edition of the book. What is extremely valuable here, in having the book displayed in such way, is the preservation of forms, spelling, and grammar of those works. Many of the books available online today are “adjusted” (edited) so that the contemporary casual reader can understand them without further research. Moreover, any element that was intentionally preserved or any old form of a word that was mandatory to keep (e.g. to preserve the verse length), is more often than not explained in a footnote. This is not the case, however, with online collections of old books, where the creators of the collections only reproduce the works in their initial form.
  • Second, the photocopies might be accompanied by digitized text of their contents. For example, in the image below (a print screen from EEBO), we are given the digitized form of the work (a randomly selected discourse by Pierre Ayrault):


First, we must notice that the text formation was preserved (the writing in italics or bold). However, if we open the link above the title, which sends the user to the original photocopy of the text, we are faced with a completely different representation of it (seen below):


Being offered such representations is invaluable for humanist researchers who have little to no access to the original forms of the works they are studying. Not only that they are given a photo copy of the work they need (with all bonus annotations that could help guide their research), but they are also given a “translated” form of the text, which preserved the words form and (as much as possible) text formatting. This makes the reading process easier for our researcher, without taking away from him or her the incredibly interesting facts of the original form of the work on which to continue their research.

An extremely interesting feature of having such collections in digital formats is the different ways to access the contents of a book/ manuscript/ article, in order to further analyze data. For example, Austen Said contains some of Jane Austen’s most popular novels and allows the user to „explore Austen’s pattern of diction” such as word frequencies or other novel visualizations. Which brings us to…

… Visual representations – maps, graphs, charts etc. – which are also extremely popular among digital humanities projects. Maps are an interesting and useful tool for visualizing (and, consequently, better grasping) the different distributions of data out there. For example, this map from the Linguistics Landscapes of Beirut (project by David J. Wrisley) beautifully shows what one would take hours to learn: the different occurrences of Arabic, Latin, or mixed scripts appear in a delimited area of Beirut. By using colors to represent each type of script, the author(s) have significantly decreased a reader’s work. They no longer have to represent in their minds, while reading a text, where each of these scripts would be found. They are already given the visualization, making it possible for them to immediately start analyzing the no-longer-raw data. (e.g. to determine in which region – and attempt to explain why – the occurrence of Arabic script is higher than that of Latin script). Other such projects, that either output a map, or a chart, or even an interactive graphic are: Mapping the Republic of Letters, Digital Karnak: Timemap, Ibn Jubayr.

The third type of theme I have noticed to occupy a large space in the digital humanist world is the user-input based research project. This kind of project’s primary purpose, before diving into data analysis, is gathering data from the users (the large public). For example, Zooniverse asks its visitors to help recognize faces of wild animals – which would probably further lead to the development of an AI tool that would do that for us, but which lacks the database to operate in such way. This type of projects are valuable in the sense that they familiarize the user with the problems and topics digital humanists are studying and involve them in the process. This could easily mean that, once a person offers their input, he or she would be also interested in checking the progress and finalization of the project, and in supporting it all the way through – something that happens less often or not at all when one randomly comes across a digital project online. We tend to look over a finalized research, read some about it, and later almost forget it existed, unless we need it for other purposes.

This novelty is recognized as the so-called social turn in scholarship because the impact of the new research methods are, well, social. While engaging the wide public in their research projects and problems, scholars benefit in two ways. First, they gather the necessary data for the development of the research, and they disseminate in real time the process and the results of it. This dissemination happens because the users are the creators of their small piece of data – thus, they already know that much, but also because more often than not, the user will also be curious to find about the outcome of a project they also took part in. At the same time, the user benefits from the status of ‘collaborator’ and from a feeling of accomplished social responsibility.

As we have seen, there is a lot happening in the field of digital humanities, yet the processes are not always visible or engaging for the large public for more than the time period they have a separated interest in the topic explored. Let’s hope that, as the subject is getting wider academic recognition, people will also get more acquainted with it.


Text Digitization and Ideas for Personal Corpus

On Wednesday, September 21st, me and my classmates in the Digital Humanities class, have tested out text digitization using Abbyy FineReader. It was enriching to see and learn the way in which historical documents, administrative papers, or any other sort of text in physical format, can be transformed into a piece of digital text, using only a scanner and an Optical Character Recognition (OCR) software.

The process is very easy, and can be done in very few steps:

  1. Scan the paper and save it as image (I think that both .jpg and .png work) or as a PDF
  2. Open the saved file using Abbyy FineReader
  3. Select the language of the text
  4. Command Abbyy FineReader to “read” the paper
  5. Adjust, select, delete as you prefer
  6. Export the text as a .RTF (for more efficiency when switching between operating systems) document

During class time I had the opportunity to both digitize a text and to analyze it, in order to observe the functionalities, as well as the shortcomings, of using OCR.

First of all, I was impressed at how easily it can reconstruct the text in a digital format, and the multitude of possibilities to select which parts of the text you further need for export. For example, Abbyy recognizes the page numbers, any annotations made, and even where the spine of a book (if such is the case) was scanned and lightens it. Also, close to the export process, the user can choose to preserve the format of the initial page. However, it does not correctly interpret the handwriting; one piece of handwriting scanned was interpreted as being written in Arabic.

Once the text is exported and opened as either a .doc or an .RTF, the even more interesting part of the digitization process is taking part. In class, I analyzed a short fragment of the Bible published in Romanian language, and a short piece of Arabic text. Screenshots of both are attached below:

Short fragment from the the Bible (Genesis) Published in 2001, in Bucharest, Romania. The language is Romanian.
A short piece of Arabic text, both in vowelled and non-vowelled script.

For the text in Romanian, a few things I noticed while looking over both the original text and the digital version of it are:

  • the export does not preserve the symbol of the cross, changing it (depending on the context) in “t” or ”f;”
  • it preserved, in some cases, the cursive ‘D’ in “Domnul,” whereas in other cases it replaced it with the copyright symbol ©;
  • it replaced some of the superscript letters (e.g. 1 instead of “i” or “!” instead of “1”);
  • it didn’t preserve all the whitespaces between words, joining them in unreadable syntagmas

For the Arabic text, the OCR interpreted the two scripts (vowelled and non-vowelled) as different ones (in the picture we can see that one is highlighted with green and the other in red). Far the most interesting comparison is present in the image below, where the export of the scanned image determined a new page layout (realigning the entire front page of a book to the right, according to Arabic writing standards), emphasizing some words over others, and not preserving the artistic aspect of the calligraphy (where it was present).

On the left: Original scanned page of a text in Arabic. The text is centered and stylized. On the right: the Abbyy FineReader processed and exported version of the image on the left. The style is not preserved, some words are in bold, and the alignment has shifted.

After this extremely fascinating exercise me and my classmates have done and after discovering some of the few things computing helps in dealing with text, I have thought about some project ideas and personal corpus to work with. Two ideas come up in my mind:

  1. An anthology of poems by Lucian Blaga (Romanian poet and philosopher) for which to find the most recurrent words/ series of words that are also associated with concepts in his philosophy; or
  2. A comparison between lines in screenplays and the actual dialogue that is used in a film (for those films for which I can find both data sets).