Books on the move: Materials on the fifth floor of the Van Pelt-Dietrich Library Center are being temporarily moved to our off-site storage facility. For your safety, please avoid browsing the fifth-floor stacks. See our Service Alerts for information about requesting books during the transition. 

Penn Libraries News

Reading manuscripts in the digital space: How the Penn Libraries is advancing Handwritten Text Recognition

Staff and fellows are using the platform eScriptorium to build machine-learning models that can transcribe handwritten manuscripts from across the world.

A scanned page from a handwritten manuscript featuring ornate calligraphy at the top, several geometric diagrams on the left, including circles and labeled lines, and dense cursive text in Italian or Latin on the right.

When a scholar visits an archive or a rare book library, they might come into contact with hundreds of pages of manuscripts, letters, and ephemera that were written by hand. Depending on the time and place the historical documents come from, the handwriting may be highly stylized, difficult to read, and challenging to translate. Additionally, making connections among these disparate documents can be tricky and time-consuming.

What if, instead of parsing the handwriting themselves, these scholars could read a typed transcript of the manuscript? Better yet, what if they could quickly search the manuscript for particular words or phrases? With such a capability, a historian could quickly find every mention of Benjamin Franklin in a set of letters, a religious studies scholar could more easily determine the popularity of certain rites and rituals, or a linguist could track the use of specific vocabulary over decades and hundreds of documents.

While a future in which a written document is as easy to keyword search as this webpage is a long way off, staff and research fellows at the Penn Libraries are working towards that future. With the help of eScriptorium, a platform for the transcription of historical materials, they are using manuscripts from the Libraries' collections to build machine learning models — a type of AI — that can transcribe handwritten text. This process is called Handwritten Text Recognition, or HTR. By doing so, they are helping the Libraries expand access to Penn's cultural heritage and transform physical archives into rich data resources that can be explored and analyzed by the global research community.

Exploring Handwritten Text Recognition at Penn

Leading the project is Digitization Project Coordinator Jessie Dummer, who began exploring eScriptorium after learning how Princeton University was using the platform. She brought the possibility of doing something similar at Penn to the Libraries' Manuscript Collections as Data Research Group, run by Jajwalya Karajgikar, Applied Data Science Librarian, and Dot Porter, Curator of Digital Humanities.

"The Manuscript Collections as Data Research Group has been talking about lots of different technologies that could be applied to collections," said Dummer. "We started thinking about how HTR might be applied to our work in the library, our collections, and the research that our patrons wanted to do."

Last fall, after setting up eScriptorium with the help of computing power from the School of Arts and Sciences General Purpose Cluster, they brought on Eleanor Webb and Priyamvada Nambrath, both historians deeply familiar with handwritten manuscripts, as research fellows through the Schoenberg Institute for Manuscript Studies. Over the past eight months, they have been working on the complex task of "teaching" a computer to read two very different historical manuscripts: a 17th century Italian mathematics manual and an 18th century philosophical work from India, written in Sanskrit.

Ground truth, masks, and more

Developed at the Paris Sciences et Lettres University in 2018, eScriptorium, which is built on top of the OCR software Kraken, allows people to manually transcribe documents, automatically create transcripts of documents using HTR and OCR models, and create their own models to help them transcribe unique handwritten documents. It is one of a handful of similar tools that librarians, archivists, and scholars have been experimenting with as they try to improve handwritten text recognition capabilities. For example, PaddleOCR is a similar program that happens to be more commonly used by scholars working with Chinese manuscripts.

A zoomed-in interface view labeled “Line #11” shows a narrow strip of a historical handwritten manuscript in Italian. The handwritten line is highlighted against a faded background of adjacent text. Below the image, a transcription field displays the typed text: “è dalla massima declinazzione dl sole a quella dl Eauinazzio, cioè tanta.” Navigation arrows, a keyboard icon, and a close button appear at the top of the interface, indicating a text transcription or annotation tool.
Transcribing a 17th century Italian manuscript, Oversize Ms. Codex 1663.

As former Digital Projects Intern Evan Ditter described in a 2024 blog post, "With a platform like eScriptorium, a user essentially 'teaches' the computer to read a certain type of handwriting by presenting images of a manuscript along with transcriptions of the text formatted in a way that the computer can interpret. From this data, referred to as training data, the computer performs calculations, and identifies visual patterns, in order to produce a 'model,' a program designed to predict similar patterns in other, previously unseen texts."

Often referred to as creating "ground truth," this process of "teaching" a computer model to "read" a document became the focus of Webb and Nambrath's work.

To begin, they first had to create segmentations for their manuscripts — a sort of map that explains how each page is laid out. Segmentation answers questions like: where are the margins? how is the text oriented? how many columns or regions of text appear on the page? how can you tell when text was written in the margins versus appearing in the main body of page? where does one line of text end and another begin?

A scanned page from a historical manuscript showing handwritten Italian text arranged in horizontal lines, with colored overlays highlighting text regions and many yellow numbered circles marking specific points. In the center are two illustrated hands, palms facing up, each covered with numbered markers indicating areas of the fingers and palm. On the right is a separate panel with a typed transcription of the manuscript text in Italian, shown line by line with numbered lines.
Webb's segmentation and line ordering of Oversize Ms. Codex 1663.

Once they created their segmentations, Webb and Nambrath began the slow trial-and-error process of running an HTR model, reviewing the resulting transcript, correcting errors, and running the model again, over and over. Their goal is to eventually create models that can transcribe their manuscripts with at least 90 to 95 percent accuracy.

"It's a slow process," said Nambrath. "Even after you define the regions and the masks and the lines, the model doesn't read the manuscript perfectly every time. You have to really go through the transcript with a fine-tooth comb to make sure that it caught everything."

Limits and challenges

The project team is quick to point out that eScriptorium's ability to transcribe a historical document, especially a complex one, is only as good as the model. Like any platform built on machine learning or artificial intelligence, it is reliant on the data it has available. In practice, this means that if a lot of scholars have used it to transcribe 17th-century Italian handwriting — and have taken the time to make corrections and ensure that the transcription is accurate — the models produced by the platform will have an easier time transcribing similar manuscripts.

If, on the other hand, eScriptorium encounters a language, handwriting style, or even manuscript layout that it's less familiar with, the resulting transcript won't be very useful. This has made Nambrath's effort to transcribe a Sanskrit manuscript particularly tricky.

"We call them digitally-disadvantaged languages," explained Nambrath. Many South Asian languages don't even have very good Unicode standards, which means that computers can't always recognize them on a character-by-character level.

"There are a lot of projects out there in Sanskrit, but there wasn't really a good model we could find to import into eScriptorium," added Dummer. "So we've been working to create a model from [Nambrath's] ground truth that people could use there."

Karajgikar emphasized that creating such a model could have a massive impact on the South Asia Studies scholarly community. "Learning that these tools were not inclusive of Unicode characterizations for languages in the South Asian context really lit a fire in my belly," she said. "There are more than 3,000 South Asian manuscripts here at Penn, which is one of the largest collections in North America. That in and of itself means that there is an access-oriented need for these manuscripts." These challenges inspired her to co-organize a digital humanities workshop for South Asia Studies scholars in 2024 that explored the opportunities HTR technology offered the field.

A scanned page from a historical illustrated manuscript is shown with digital annotation overlays. On the left side there is a hand-drawn bird with a curved beak, patterned wing feathers, and a long, flowing tail. Below and around it are small areas of handwritten text. On the right, multiple lines of handwritten text in an Indic script are arranged vertically. Blue lines underline each line of text, and purple rectangular boxes outline text blocks and margins.
Nambrath's region and line segmentation of an 18th century South Asian manuscript, Ms. Coll 390 Item 1914.

eScriptorium has another limitation: its user interface isn't designed to work with languages that are written vertically. Karajgikar and her research assistant, Yifang Xia, discovered this when attempting to transcribe Yōso zusetsu (廱疽圖説), an illustrated treatise on diagnosing and treating abscesses and tumors written in Chinese.

Regardless of the type of manuscript, using a program like eScriptorium still requires significant commitment from scholars. "This process did reveal to me that we're a long way off from making these tools accessible to the typical researcher," said Webb. "Even after taking part in this fellowship, I still feel far away from integrating this tool into my everyday research life."

The future of HTR

Nambrath and Webb will be finishing their fellowships at the end of the semester, but the Penn Libraries plans to continue to invest in eScriptorium and improve its usability. Next fall, Dummer hopes to welcome two more fellows who will select two new manuscripts for transcription. She and the Schoenberg Institute for Manuscript Studies are accepting applications for these fellows through May 1.

While a future that includes keyword-searchable handwritten manuscripts that are easy to make and easy to use is still a long way off, the team is still excited for that day to come. "Eventually, we could put these transcripts into our digital collections platforms, and we could provide some semblance of searchable PDFs for all our manuscripts," Dummer suggested. "That would be really cool."

They are also imagining the long-term impacts of machine learning on librarians and scholars

“I joined this project because I was interested in what this technology means for my research," said Webb. "But the real takeaway for me is that humanists and librarians have a lot to contribute to the conversation about AI and machine learning — what it can do, what it can’t do, and how we, as a society, should be approaching it. How do we tackle these new technologies in a way that is productive? We can't just say no to being part of those conversations."

Maps and More

Campus Libraries Map

Staff Information

Resources for Staff Committees