A Deep Dive into the Dubious Claims of Online Test Proctoring
By Adam Beauchamp
When universities across the United States reacted to the coronavirus pandemic by shifting to remote instruction last spring, many of us quickly adopted new technologies to keep our courses running. Now, as we prepare for another semester of remote instruction, we have an opportunity to reassess these tools and ask ourselves if they still meet our educational needs and comport with our values. In this time of heightened stress and trauma, I suggest that we abandon technologies or practices that create an adversarial relationship between teachers and students. These include plagiarism detection software, technologies that track students’ movements, and classroom policies that privilege compliance over learning, what Jeffrey Moro refers to colorfully and astutely as “cop shit.”
Among the most egregious examples of “cop shit” we’ve allowed to enter our classrooms is online test proctoring. FSU has contracted with Honorlock, one of several companies enjoying increased profits from the emergency pivot to remote instruction. As reported in Inside Higher Ed, an April 2020 poll showed 77% of institutions were either already using or considering an online proctoring service. In that article, Examity, one of Honorlock’s competitors, reported growth 35% above expectations for the fiscal quarter and was struggling to meet demand. An Honorlock press release in March announced that the company raised $11.5 million in venture capital funding amid expectations that “online proctoring [is] growing to a $19 billion market.”
FSU announced the availability of Honorlock on March 19, 2020 as a solution to remote assessment. Online proctoring services were available before the pandemic, but were mostly limited to online degree programs. FSU students encountering Honorlock for the first time this spring decried the invasion of privacy required to use this technology. A petition to FSU to stop using Honorlock garnered over 5,000 signatures. The university administration responded with an FAQ website meant to assuage their concerns, but it focused on the technical questions of data security and FERPA compliance rather than address students’ privacy concerns.
The privacy and data security issues inherent with online proctoring have been well-documented by digital pedagogy scholars (see Watters and Swauger) and in higher ed journalism (see Flaherty and Kafka). These authors outline very real concerns about how to keep student data secure, who has access, and how long the data are kept. They also question the ethics of requiring students to record their bodies, official IDs, and homes onto platforms that allow these videos to be shared and downloaded by faculty, staff, and third-party contractors. These issues should be reason enough to banish proctoring software from the classroom. But on top of all of this, does online proctoring even work? Can the technology accurately detect cheating? Perhaps it is worth taking a deep dive into how Honorlock works to reveal the methodological deficiencies of online proctoring services that should convince everyone to abandon their use.
Questionable Methods
Online proctoring tools function by capturing audio and video of students and their computer screens as they take exams. These data are recorded, stored, analyzed by the companies’ algorithms or staff, and shared with university instructors or staff to adjudicate questions of academic dishonesty. Honorlock claims to use a combination of artificial intelligence (AI) and live proctors to identify and flag suspicious behavior, but how does Honorlock transform audio and video into measures of academic dishonesty? Their algorithms are proprietary, so we cannot be certain how they work. In April I reached out to Honorlock requesting more information about how the company trained the artificial intelligence (AI) used to identify and flag suspicious behavior, but did not receive a response. However, in the training video available for viewing on the FSU Testing Center’s guide to online proctoring, Honorlock associate Leo Bentovim provides some clues.
Based on Bentovim’s presentation, Honorlock’s AI infers dishonesty from body movements, clothing choices, speech, background noises, and certain computer interactions. In the training video, Bentovim identifies moving one’s head away from the camera’s center or out of frame, looking left or right [37:34], and even a yawn [39:41] as possible flags. Hats and long hair are also suspicious [16:30], as are hoodies according to the Honorlock website. The audio algorithm listens for programmed buzzwords, but is also triggered by students reading questions aloud or simply dogs barking in the background [46:04]. A single right click of the mouse will also generate a flag [39:41].
These behaviors are not just classified as suspicious, but are measured in real time on a 5-point scale from “very bad” to “very good,” as shown during Bentovim’s demonstration. While Honorlock is active, test-takers can see a live camera view of themselves. In the lower left corner of that live image is a set of bars, resembling cell phone reception, that goes up and down along with text that scores students’ movements as Very Good, Good, Average, Bad, or Very Bad. These accumulated ratings are then consolidated into a single rating for the entire testing session. In the final report sent to course instructors, every student receives a cumulative “Incident Level” rating of High, Medium or Low.
How the algorithms score any given movement on this purported scale of virtue is unclear, but it’s hard to imagine how degrees of movement, changes in facial expression, or sound waves picked up by a computer microphone can be so easily translated into culturally specific, moral judgments of behavior and intent. Anthropologist Clifford Geertz would seem to make this very point in his famous discourse on a wink. As he described in The Interpretation of Cultures (1973), a wink is only visible as a muscle contraction of the eyelid, but can be interpreted as an involuntary twitch, a blink, a friendly communication, a flirtation, or a conspiracy. Seeming to anticipate our current technological predicament, Geertz noted, “from an l-am-a-camera, ‘phenomenalistic’ observation of [eye movements] alone, one could not tell which was twitch and which was wink, or indeed whether both or either was twitch or wink. Yet the difference, however unphotographable, between a twitch and a wink is vast” (p. 6). Knowing which of these many possible interpretations is intended requires context and cultural understanding.
Suspect Algorithms
So do algorithms and artificial intelligence (AI), like those used in online proctoring systems, apply cultural understandings? The answer is yes. They are neither neutral nor objective. Rather, they are designed by human programmers and trained on datasets that reflect the categories, assumptions, and values of the societies that created them. In Algorithms of Oppression (2018), Sofiya Noble revealed numerous instances when Google’s search algorithm reproduced racism and sexism in its search results and autofill text. Ruha Benjamin, in her 2019 book Race After Technology, described the failures of facial recognition software and even digital soap dispensers to see black faces and skin because the AI was trained on datasets that classify white faces and bodies as the norm.
Why is this a concern for online proctoring software? The AI that companies like Honorlock use to identify and flag suspicious behavior may encode similar biases against test-takers. Again, we don’t know specifics on how the Honorlock AI evaluates data. However, in the training video discussed above, Honorlock associate Leo Bentovim reveals an underlying distrust of students that may be emblematic of similar suspicions built into the software. For example, Bentovim suggests that students with long hair or wearing a hat may be hiding illicit headphones [16:30]. Even while acknowledging the unprecedented demands on our digital infrastructure during the pandemic, he brushes off students’ claims to have Internet connectivity problems during exams, suggesting “some of them may be real” [33:05], but they could also be the excuses of students seeking additional time or admitting they would get a bad grade. He suggests requiring students to use Honorlock’s on-screen calculator instead of a separate device because “obviously the more devices a student has the more likely they’re storing notes; they’re putting things they shouldn’t put in those” [12:06]. The reporting system itself defaults to assigning students varying degrees of dishonesty; there are no innocent test takers in Honorlock if every student receives an “Incident Level” rating of High, Medium or Low.
If we accept that online proctoring systems may be overzealous in their classifications of academic dishonesty–the more cheating they identify, the more their own product is justified– we can perhaps be reassured that Honorlock and its competitors do not determine which students are guilty of academic honesty. That judgment is left to instructors. However, the quantification of student behavior into high, medium, or low incident levels will influence how instructors and testing center staff view student behavior. In her book, Automating Inequality (2018), Virginia Eubanks showed just such an effect in the case of social workers in Pennsylvania. There, the Allegheny Family Screening Tool (AFST) generates a score to identify “at-risk” children in need of state intervention. However, while the AFST is meant to support professional decision making, Eubanks discovered that “in practice, the algorithm seems to be training the intake workers” (p. 142). These professionals were found to override their initial judgment and make decisions more closely aligned with the numerical results generated by the AFST.
Finally, while online proctoring is a glaring example of surveillance technology, I am not convinced that Honorlock and its competitors deserve to be included with Google and Facebook as examples of surveillance capitalism. As defined by Shoshana Zuboff, surveillance capitalism is the transformation of human experience into proprietary behavioral data, which are then monetized or used to create new predictive products. Instead, online proctoring technology is perhaps more akin to the lie-detector test, or polygraph. Not unlike online proctoring’s reliance on observable body movements and sounds to identify dishonesty, the polygraph relies on physiological observations–heart rate, blood pressure, respiration, and skin conductivity–to distinguish between true and false statements. However, the American Psychological Association and the National Academies of Science have found that polygraph testing lacks validity, and in United States v. Sheffer (1998), the U. S. Supreme Court ruled polygraph evidence is inadmissible in judicial proceedings. Despite this, Mark Harris found that lie-detector tests are still used by law enforcement and other government agencies to screen job applicants, fueling a $2 billion industry. The American Polygraph Association, a lobbying group and accrediting body for polygraph training programs, cites a survey of police executives who stated that a polygraph test “reveals information that cannot be obtained by other means…it deters undesirable applicants, and it is faster than other methods of selection.” The polygraph industry is able to profit despite its practices lacking scientific validity because the users of lie-detector tests believe they produce positive results.
Online proctoring companies create a market for their products by convincing educators that dishonesty is widespread and cannot be stopped without an expensive, technological solution. They succeed when they supply the proof that confirms educators’ fears, proof defined and interpreted by these companies’ own proprietary algorithms.
But there is another way.
Instead of adopting technologies of distrust, we can choose what Catherine Denial calls a “pedagogy of kindness,” an approach to teaching that not only believes students, but also believes in students. We should trust students when they tell us that their power went out or a relative is sick. We should also trust in students to be collaborators in the learning process and in how we assess what they have learned.
The Fall 2020 semester is going to be stressful and challenging for all of us, making it more important than ever that we care for one another. This is what I believe FSU Professor Vanessa Dennen has in mind when she advises: “people first, content second, technology third.” Our learning outcomes are important, but we can achieve them without prejudging our students as potential cheaters and adversaries. Both the FSU Testing Center and the Center for the Advancement of Teaching (CAT) acknowledge the limitations of online proctoring and recommend instructors seek alternative means of assessing student learning. Consider exams or projects that ask students to apply the knowledge they are learning rather than simply repeat what they’ve committed to their short-term memories. Let students be creative in how they demonstrate the knowledge and skills you’ve taught them rather than confine their movements to the center of a computer camera lens.
Let us choose to leave surveillance and distrust out of our virtual classrooms and embrace a pedagogy of kindness. And don’t forget your friendly librarians. We’re happy to partner with you.
Adam Beauchamp is Humanities Librarian at FSU Libraries. Adam’s current research interests include critical pedagogy, ethical assessment and research in libraries, and the historical relationship of archives and libraries to colonialism.