Corpus Presentations on Text Analysis

On Monday, October 17th, my class had the presentation on the corpus analysis. After working on them for the duration of the past couple of weeks of classes, me and my classmates presented to each other, to the professor, and to our guests, the results of our work. Not only that each of us learnt from each presentation, but we also learnt from the feedback we received from the audience.

The five collections of text the analysis were based on were from different categories, which shows the diversity of material a digital humanist can work with. We had a really fun and interesting analysis of the top 10 billboard songs in the past decades which looked at the patterns of writing a hit and how it varied over the years. Then, I presented my corpus analysis and quantification which will serve for the development of a poem generator in a Bacovian style. Third, another classmate provided us with an extremely detailed overview and comparison of two dictionaries of the Costa Rican Slang. Another classmate followed with a presentation on the portrayal of Islam and Arab culture in the Western media in an attempt to create a tool to destroy the stereotypes (and which I found very interesting, inspiring, useful, and applicable in other fields). The last presentation was an analysis of Paris’ 2005 Race Riots and their portrayal in the media.

As I already said, these presentations showed the wide range of topics one can choose when they decide on a text analysis in a digital humanist way.

Looking back on my work, I think that the process of quantifying the chosen poems using AntConc was incredibly useful and gave me a starting point for the development of the program I am currently working on. While using Voyant it was also a good time to see the similarities and differences between what I knew about the poems and what the tool allows one to find about them (if they never read them before). Now that I think of it, I could have done a more detailed analysis of the connections between the words used inside the poems. This would not only help me when programming the generator, but also offer me and others a better understanding of how, when, and why Bacovia chooses his words.

As a closing thought, I am eagerly looking forward to work on other digital humanist projects and apply what I learnt through this experience and by looking at others’ works.

 

 

What happens in the DH world

I am at a point in the semester where, before I start working on my own digital projects, I need to look back and reflect on what I have discovered so far about the world of digital humanities. And what I have learnt is a lot in comparison with what I previously knew. I will do this, however, in two different ways: I will update my digital narrative and I will write down some remarks I have made or thought about during these almost two months since the beginning of the class. The latter first.

As I’ve seen until now, digital humanities covers – as humanities do – a whole lot of possibilities in terms of what to do with it, what (research) project to do with the knowledge it provides. One can choose to map out the places where the first historical sources were found (I would love to see such a project on Romania’s early history); another could simply digitize the collection of letters of a famous historical figure; or, if someone feels more ambitious, they could gather data on the Neo-modernist literature in ex-communist countries and see how the regime influenced the authors’ themes, ways of expressions, and purpose of writing.

Among all the possibilities there are in developing a digital humanities project, I have noticed there are some themes and ideas project initiators and researchers lean towards the most. Three such examples are: online collections, visual representations,  and research and process.

The first one – online collections – is one of the initial forms of digital projects which started when contemporary humanists learnt the benefits of having text in a digital form. Online collections presuppose the existence of a physical collection that would be photographed or scanned and then either  transcribed by the project team or digitized using an OCR software. Examples of online collections are: Arabic Collections Online, Early English Books Online,  Al-Maktaba Al-Shamela, Eighteen Century Collections Online, Blue Mountain Project.

How to read such collections? (and why can they be considered digital humanist projects – rather than simple collections of author work). Let’s take the example of EEBO (link above). After going through the search process on and reaching the desired book to view, the fun begins! There are two ways in which the text is presented on EEBO:

  • First, the photocopies. If the text hasn’t been digitized, the viewer is confronted with photocopies of a printed edition of the book. What is extremely valuable here, in having the book displayed in such way, is the preservation of forms, spelling, and grammar of those works. Many of the books available online today are “adjusted” (edited) so that the contemporary casual reader can understand them without further research. Moreover, any element that was intentionally preserved or any old form of a word that was mandatory to keep (e.g. to preserve the verse length), is more often than not explained in a footnote. This is not the case, however, with online collections of old books, where the creators of the collections only reproduce the works in their initial form.
  • Second, the photocopies might be accompanied by digitized text of their contents. For example, in the image below (a print screen from EEBO), we are given the digitized form of the work (a randomly selected discourse by Pierre Ayrault):

eebo

First, we must notice that the text formation was preserved (the writing in italics or bold). However, if we open the link above the title, which sends the user to the original photocopy of the text, we are faced with a completely different representation of it (seen below):

eebo

Being offered such representations is invaluable for humanist researchers who have little to no access to the original forms of the works they are studying. Not only that they are given a photo copy of the work they need (with all bonus annotations that could help guide their research), but they are also given a “translated” form of the text, which preserved the words form and (as much as possible) text formatting. This makes the reading process easier for our researcher, without taking away from him or her the incredibly interesting facts of the original form of the work on which to continue their research.

An extremely interesting feature of having such collections in digital formats is the different ways to access the contents of a book/ manuscript/ article, in order to further analyze data. For example, Austen Said contains some of Jane Austen’s most popular novels and allows the user to „explore Austen’s pattern of diction” such as word frequencies or other novel visualizations. Which brings us to…

… Visual representations – maps, graphs, charts etc. – which are also extremely popular among digital humanities projects. Maps are an interesting and useful tool for visualizing (and, consequently, better grasping) the different distributions of data out there. For example, this map from the Linguistics Landscapes of Beirut (project by David J. Wrisley) beautifully shows what one would take hours to learn: the different occurrences of Arabic, Latin, or mixed scripts appear in a delimited area of Beirut. By using colors to represent each type of script, the author(s) have significantly decreased a reader’s work. They no longer have to represent in their minds, while reading a text, where each of these scripts would be found. They are already given the visualization, making it possible for them to immediately start analyzing the no-longer-raw data. (e.g. to determine in which region – and attempt to explain why – the occurrence of Arabic script is higher than that of Latin script). Other such projects, that either output a map, or a chart, or even an interactive graphic are: Mapping the Republic of Letters, Digital Karnak: Timemap, Ibn Jubayr.

The third type of theme I have noticed to occupy a large space in the digital humanist world is the user-input based research project. This kind of project’s primary purpose, before diving into data analysis, is gathering data from the users (the large public). For example, Zooniverse asks its visitors to help recognize faces of wild animals – which would probably further lead to the development of an AI tool that would do that for us, but which lacks the database to operate in such way. This type of projects are valuable in the sense that they familiarize the user with the problems and topics digital humanists are studying and involve them in the process. This could easily mean that, once a person offers their input, he or she would be also interested in checking the progress and finalization of the project, and in supporting it all the way through – something that happens less often or not at all when one randomly comes across a digital project online. We tend to look over a finalized research, read some about it, and later almost forget it existed, unless we need it for other purposes.

This novelty is recognized as the so-called social turn in scholarship because the impact of the new research methods are, well, social. While engaging the wide public in their research projects and problems, scholars benefit in two ways. First, they gather the necessary data for the development of the research, and they disseminate in real time the process and the results of it. This dissemination happens because the users are the creators of their small piece of data – thus, they already know that much, but also because more often than not, the user will also be curious to find about the outcome of a project they also took part in. At the same time, the user benefits from the status of ‘collaborator’ and from a feeling of accomplished social responsibility.

As we have seen, there is a lot happening in the field of digital humanities, yet the processes are not always visible or engaging for the large public for more than the time period they have a separated interest in the topic explored. Let’s hope that, as the subject is getting wider academic recognition, people will also get more acquainted with it.