In order to follow along with this post you will need:
- Basic knowledge of the Text Encoding Initiative guidelines for marking up texts.
- Understanding of the structure of XML and the basics of XPath.
- Some experience with Regular Expressions is helpful, but not necessary.
- A willingness to learn R!
A few months ago, I started working through Matt Jockers’ Text Analysis with R for Students of Literature. I wanted to improve my text analysis skills, especially since I knew we would be acquiring the EEBO-TCP phase II texts, which contain text data for thousands of early modern English texts (if you are an FSU student or faculty member and you want access to these files, email me). To start, I decided to do some analysis on Holinshed’s Chronicles, which are famous for their impact on Shakespeare’s history plays. While I have been able to create a few basic analyses and visualizations with this data, I’m still learning and expanding my understanding of R. If you ever want to work through some of the ins-and-outs (or would prefer an in-person consultation on R), you should attend the Percolator from 3-5 on Wednesdays in Strozier or email me to schedule a consultation. We will also be holding a text analysis workshop from 10-11 on April 14.
I am going to be working from two of the EEBO TCP phase I texts, since these are currently open access. You can download the entire corpus for phase one in SGML format: https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb. I’ve used a stylesheet generated by the TEI council to transform the files into TEI P5-compliant XML files. You can get the example files on my GitHub page (along with the finalized code). Alternately, you can get all of the P5-compliant TEI files directly from the Text Creation Partnership Github.
If you want to follow along with this blog post, do the following:
Step 1. Get your texts. Go to my GitHub page and download holinshed-v1.xml and holinshed-v2.xml. Put them in a directory that you can easily find (I have mine on my desktop in a directory called “holinshed” within another directory called “eebo_r”).
Step 2. Download R and R Studio, as outlined in our Text Analysis libguide.
Step 3. Set Working Directory. Open R Studio, and type setwd(“”), where the path to the folder you created is contained within the quotes. On a Mac, your path will likely look something like this:
setwd("~/Desktop/eebo_r")
And on Windows it will look something like:
setwd("C:/Users/scstanley/Desktop/eebo_r")
(Note that you shouldn’t use a “\” character for windows filepaths, even though that is standard. Forward slashes are considered escape characters in R.)
You can either type this into the script pane or in the console. My script pane is on the top-left, but yours may be somewhere else within your RStudio Environment. If you are on a Mac, hit “ctrl+enter” Note: I am using the script pane to edit my code, and hitting ctrl + enter to have it run in the console. If you just want to run your code in the console without saving it as a script, you can type directly into the console.
Step 4. Install the XML and Text Mining packages. Go to Tools > Install Packages and type “XML” (all uppercase) into the Packages text field. Click “Install.” Do the same with “tm” (all lowercase). You could also enter install.packages(“tm”) and install.packages(“XML”) into your console with the same effect.
Step 5. Now that you have the XML and text mining package installed, you should call them into the session:
library(XML) library(tm)
Again, hit ctrl+enter.
Now you’re ready to get started working with R!
Remember from the beginning of this post that I created a directory within my working directory (“~/Desktop/eebo_r”) to store the files I want to analyze in. I called this directory “holinshed”. I am going to create an object called `directory` that references that filepath. To do this, I’m going to use an assignment operator (`<-`). This gets used quite frequently in R to assign some more complex or verbose object another name. In this case, we will say:
directory <- "holinshed"
Now, we want to get all of the files within that directory:
files <- dir(path=directory, pattern=".*xml")
This line of code sets another object called “files” which follows the directory we set with the “directory” object, and finds all of the objects within that directory that end in “.xml” (all of the XML files).
This is where things can get a little confusing if you don’t understand XML and XPath. For a basic overview, you can take a detour to my presentation on TEI from the Discover DH workshop series, which contains an overview of XML.
What you will need to know for this exercise is that XML structures are perfectly nested and hierarchical, and you can navigate up and down that hierarchy using a XPath. If XML is like a tree, XPath is your way of moving up and down branches to twigs, jumping to other branches, or going back to the trunk.
For the purposes of this assignment, I am interested in specific divisions within Holinshed’s Chronicles—specifically, the ones that are labelled “chapter” and “section” by the encoders of the EEBO-TCP texts. The way that I would navigate from the root of the document to these two types of divisions is with the following XPath:
/tei:TEI//tei:div[@type='chapter'] | /tei:TEI//tei:div[@type='section']
(find me all the divisions with a value of “chapter” on the type attribute AND find me all the divisions with the value of “section” on the type attribute.)
Out of the box, R could not parse XPath, but the XML package that you installed at the beginning will allow you to select only those pieces from your documents.
Now we need to get the XML content out of the two files in our “holinshed” directory. To do this, we will need to create a for loop. To start, create an empty list.
documents.list <- list()
This gives us a place to store the objects when the for loop finishes, and goes back to the beginning. Without the empty list, the content will just keep overwriting itself, so at the end you will only have the last object. So for example, I made the mistake of not creating an empty list while creating my for loop, and I kept only getting the divisions from the second volume of Holinshed’s Chronicles, since the second volume was overwriting the first.
Our for loop is now going to take every file in the “holinshed” directory and do the same thing to it. We begin a for loop like this:
for(i in 1:length(files)){ #the rest of the code goes here }
This basically says for every object in 1 to however long the “files” object is (in this case “2”), do the following. Also, note that the pound sign indicates that that line is a comment and that it shouldn’t be processed as R code.
Now, within this for loop, we are going to specify what should be done to each file. We are going to create a document object using `xmlTreeParse` for each object within the “holinshed” directory.
document <- xmlTreeParse(file.path(directory, files[i]), useInternalNodes = TRUE)
(If you find it hard to read long code on one line, you can put carriage returns. Just make sure that the returns happen at a logical place (like after a comma), and that the second line is indented. Spacing and indentation do matter in R. Unfortunately, WordPress isn’t allowing me to provide an example, but you can see how that would look in practice in the example R file provided in my eebo_r GitHub repository.)
The [i] in “files[i]” will be be where the numeric information will be stored on each loop. So the first loop will be files[1] and the second will be files[2] (which correspond to “holinshed-v1.xml and holinshed-v2.xml). If we had more than two xml files in this directory, the for loop would apply to all of those as well.
Next, you will use the empty list that you have created. Define each of the documents.l that corresponds to files[1] or files[2] (holinshed-v1.xml and holinshed-v2.xml, respectively) as being the nodeset that follows the XPath we created above. In other words, create a list of all of the divisions with a value on @type of “chapter” or “section” within each document.
documents.list[[files[i]]] <- getNodeSet(document, "/tei:TEI//tei:div[@type='chapter'] | /tei:TEI//tei:div[@type='section']", namespaces = c(tei="http://www.tei-c.org/ns/1.0"))
Ignore namespaces for now. They are important to understanding XML, but as long as you don’t have documents that contain multiple XML languages, you won’t need to worry as much about it. I can discuss the function and importance of namespaces in another post.
So, in the end, your full for loop will look like this:
for(i in 1:length(files.v)){ document <- xmlTreeParse(file.path(directory, files.v[i]), useInternalNodes = TRUE) documents.l[[files.v[i]]] <- getNodeSet(document, "/tei:TEI//tei:div[@type='chapter'] | /tei:TEI//tei:div[@type='section']", namespaces = c(tei="http://www.tei-c.org/ns/1.0")) }
If you want to run multiple lines of code, you can highlight the entire for loop, and hit “ctrl+enter.” Alternately, you can put your cursor at the beginning of the for loop in the script pane, and click “option+command+E” on a mac, or go to the menu and click “code > run region > run from line to end” to run from that line to the end of the script. This is also useful if you ever save an R script and want to come back to it later, and start from where you left off. This way you don’t need to go back and run each line individually.
Now you should have a list with two items. Each item on this list is a node set (which is a specialized type of list). Rather than having documents.l being two nested lists, I want to convert each document into its own list. I did it with the following code. See if you can figure out what exactly is happening here:
holinshed1.l <- documents.l[[1]] holinshed2.l <- documents.l[[2]]
Now that I have two separate lists for each document, I want to concatenate them into a single, list of divisions. In R, you use `c` to concatenate objects:
both.documents <- c(holinshed1.l, holinshed2.l)
Now, if you check `length(both.documents)`, you should get 359. Your console will look like this
> length(both.documents) 359
Basically, what this means is that there are a total of 359 divisions in both documents that have a value on type of either “chapter” or “section.”
Now, you are going to want to return all of the paragraphs that are children of these two divisions.* To do this, we are going to need to create another for loop. This time, instead of creating an empty list, we will create an empty vector. I’m going to call this vector paras.lower.
paras.lower <- vector()
I’m going to give you the full code for selecting the contents (text, basically) of all of the paragraphs, and then explain it point-by-point after.
for(i in 1:length(both.documents)){ paras <- xmlElementsByTagName(both.documents[[i]], "p") paras.words.v <- paste(sapply(paras, xmlValue), collapse = " ") paras.lower[[i]] <- tolower(paras.words.v) }
This says for every object in 1 to the length of “both.documents” (which we determined was equivalent to 359 divisions), do the following:
Create an object called “paras” which will select all of the children of the node set “both.documents” with the tag name of “p.” On each loop, do this for one division within both.documents.
Now create another object (this time a vector), that essentially takes the content of paras (the text within all the <p> elements, stripping the nested tags) and collapses it into a vector.
Now take the vector you’ve created (all of the words from each paragraph within each division) and make the characters all lowercase.
This process may seem slightly confusing at first, especially if you are unfamiliar with what each piece is doing. If you are ever confused, you can type ?term into the console, and you will find the documentation for that specific aspect of R. So, for example, if you typed ?sapply, you’d see that sapply applies a given function over a list or vector (so essentially the same thing happens to multiple objects within a vector or list, without you needing to explicitly state what happens to each item).
Now that you have your character vector with the content of all of the paragraphs, you can start cleaning the text. The one problem is that paras.lower.v contains multiple vectors that need to be combined into one. You can do this by using the paste() function we used in the last few lines.
holinshed.all <- paste(paras.lower, collapse=" ", sep="\n")
Now, if we ask for the length of holinshed.all, we see that it returns 1, instead of 359.
Now, we are going to use the tm package that we installed at the beginning. This package can facilitate a lot of types of analysis that we won’t cover in this post. We are going to simply use it to easily remove stopwords from our texts. Stopwords are commonly-occurring words that we may not want to include in our analysis, such as “the”, “a”, “when”, etc.
To do this, you are first going to create a corpus from your holinshed.all vector:
holinshed.corpus <- Corpus(VectorSource(holinshed.all))
Now you will remove stopwords from this corpus. You can use the following code to remove all English stopwords:
holinshed.corpus = tm_map(holinshed.corpus, removeWords, stopwords("english"))
However, with a corpus this big, R will run very slow (it will likely take upwards of 10 minutes to remove all the stopwords from your corpus). If you want to let it run and take a break here, feel free to do so. However, if you are impatient and would prefer to continue on right now, I have a premade text corpus in my R GitHub repository, which you can use instead of following the next step.
If you do want to remove the stopwords by yourself, run the above code, grab yourself a cup of coffee, work on some other writing projects for a bit, take a nap—whatever suits you best. Once the stopwords are removed, you will see a “>” once again in your console, and you can then type in
writeCorpus(holinshed.corpus, filenames ="holinshed.txt")
This will create a file that has all of the content of the paragraphs within the <div>s with the type value of “chapter” or “section” minus the stopwords.
**Impatient people who didn’t want to wait for the stopwords to get removed can start up again here**
Now that you have a text file with all of the relevant words from Holinshed’s Chronicles (holinshed.txt), we are going to analyze the frequencies of words within the corpus.
We are going to use the scan() function to get all of the characters in the Holinshed corpus.
holinshed <- scan("holinshed.txt", what="character", sep="\n")
This line of R will create an object called “holinshed” which contains all of the character data within holinshed.txt (the corpus you just created).
You will once again need to use the “paste” function to collapse all of the lines into one (as the line of code above separated the documents on each new line).
holinshed <- paste(holinshed, collapse=" ")
Now you will split this very long line of characters at the word level:
holinshed.words <- strsplit(holinshed, "\\W")
This splits the strings of holinshed at the level of the word (“\\W”). If you attempt to show the first 10 items within holinshed.words (`holinshed.words[1:10]`), you will notice that it gives you a truncated version of the whole document, and then 9 NULLs. This is because strsplit converts your vector into a list, and then treats the whole document like the first item on that list. Using unlist(), we can create another character vector:
holinshed.words <- unlist(holinshed.words)
Now, if you enter `holinshed.words[1:10]`, you will see that it returns the first 10 words… but not quite. You will notice that there are a number of blank entries, which are represented by quote marks with no content. In order to remove these, we can say:
holinshed.words <- holinshed.words[which(holinshed.words!="")]
Now, if you enter holinshed.words[1:10], it will display the first 10 words:
[1] "read" "earth" "hath" "beene" "diuided" "thrée" [7] "parts" "euen" "sithens" "generall"
In order to get the frequencies of the words within our corpus, we will need to create a table of holinshed.words. In R, this is incredibly simple:
holinshed.frequencies <- table(holinshed.words)
Now, if you enter length(holinshed.frequencies), R will return 37086. This means that there are 37,086 unique strings (words) within Holinshed’s Chronicles. However, if you look at the first ten words in this table (`holinshed.frequencies[1:10]`), you will see that they are not words at all! Instead, the table has also returned numbers. Since I don’t care about numbers (you might, but you aren’t writing this exercise, are you?), I’m going to remove all of the numbers from my table. I determined that we start getting actual alphabetic words at position 895. So all you need to do is redefine holinshed.frequencies as being from position 895 to the end of the document.
holinshed.frequencies <- holinshed.frequencies[895:37086]
Now you can sort this frequency table so that the first values of the table are the most frequent words in the corpus:
holinshed.frequencies.sort <- sort(holinshed.frequencies, decreasing = TRUE)
Now, if you enter `holinshed.frequencies.sort[1:10]` to return a list of the most often used words in our Holinshed corpus.
If you want a graphic representation of this list, you can plot the top twenty words (or 15 or 10):
plot(holinshed.frequencies.sort[1:20])
This graph should show up in the right pane of your RStudio environment (unless you have it configured in a different way), and will show you a visual representation of the raw frequencies of words within our corpus.
Try it on your own!
- We analyzed the top 20 words for the two combined volumes of Holinshed’s Chronicles, but what would our top 20 words look like if we analyzed each text individually?
- If you look closely at the XML, you will notice that our original XPath (/tei:TEI//tei:div[@type=’chapter’] | /tei:TEI//tei:div[@type=’section’]) excludes a lot of content from the Chronicles. Specifically, it ignores any division without those type attributes. Further, using `xmlElementsByTagName` only selects the direct children of the node set, which excludes paragraphs that occur within divisions nested within chapters or sections (see, for example `<div type=”part”>`, which occurs occasionally within `<div type=”chapter”>` in volume I). Write code that selects the contents of all paragraphs.
- Words in the top 20 list like “doo,” “haue,” and “hir” would presumably be picked up by a stopwords list, if they had been spelled like their modern English equivalents. How could you get rid of a few of these nonstandard stopwords?
Check back to my eebo_r GitHub page for additional R exercises and tutorials using the EEBO-TCP corpus! And if you have any questions about this post or want to learn more about R, schedule a consultation with me.
Notes
* I specifically don’t say that you are looking for all the paragraphs within these divisions, because the code we are about to use only selects children, not descendants. Understanding the difference between these requires some knowledge of XPath and the structure of XML documents.