Manuscript Collections As Data Research Group

A team from the Schoenberg Institute for Manuscript Studies (SIMS), Cultural Heritage Computing, and Research Data and Digital Scholarship (RDDS) is working to facilitate Handwritten Text Recognition (HTR) to transcribe research materials from Penn Libraries’ digitized manuscript collections using AI technologies.

Much of the world’s cultural heritage is handwritten and legible by humans but, until recently, not by machines. We are working to build a sustainable, scalable, and diverse community of practice in Handwritten Text Recognition (HTR) by bringing together interested people from across Penn and the world. HTR uses AI to transcribe handwritten texts and can make corpora of manuscript materials searchable, help us discover unique texts, and turn handwritten texts into data for linguistic, historical, and quantitative analyses.

Vision

Machine learning-driven HTR and image recognition: The RDDS and SIMS teams will advance conversations and projects that use Handwritten Text Recognition (HTR) with a particular focus on less-represented languages/scripts and on leveraging Penn's collections both on campus in the broader manuscript studies community. We will publish our models if we see an opportunity for others to reuse our work and will offer opportunities for scholars to learn and use these tools with their own corpora, thus empowering cutting-edge research in line with the Penn Libraries Strategic Priorities.

Projects

Using HTR models, two SIMS graduate fellows will be trained to produce transcriptions of texts in the eScriptorium platform from manuscripts housed in the Kislak Center for Special Collections, Rare Books and Manuscripts. These transcriptions will appear in the Global Medieval & Early Modern Digital Library. The Manuscript Collections As Data Research Group will support this work through the development of technical infrastructure and workflows, with an aim to develop a reproducible approach for students to use HTR.

We are working and consulting with Linguistic Data Consortium for multilingual HTR process workflows, shared student training material, and scalable model pipelines that empower global cultural heritage research communities to transcribe and analyze handwritten texts across diverse languages and scripts.

man pointing to a projection with two windows. The window at the back is the image of a Devanagiri script manuscript. The window in front is a black with green machine readable text of the Devanagari text. — Dr. Andrew Ollett at the SASDHW workshop showcasing Sanskrit HTR using Google Cloud Vision

In October 2024, the Penn Libraries and South Asia Studies presented a space for technology orientation, where participants will shared a nuanced and informed understanding of the possibilities and limitations of critical digital humanities tools, particularly Computational Text Analysis (CTA) of content found in manuscripts, inscriptions, maps, and other historical documents.

Workshop page Watch video recording Day 1

Watch video recording Day 2 Blog Post

Conferences Attended

platform with 2 images of a manuscript page wit 3 columns of text and the text extracted

Screenshot from eScriptorium showing a page from Vatican Syriac manuscript Vat.sir.111, displaying side-by-side page, segmentation, and transcription views.

Dates: March 25-28, 2025

Attendees: Jajwalya Karajgikar, Dot Porter, Jessie Dummer, Doug Emery, Evan Ditter, Nikitas Tampakis.

Goal: Produce data to be used for improving automated handwritten text recognition (HTR) for manuscripts written in the Syriac language.

During the event, participants corrected line and page segmentation for 1,008 page images from 106 manuscripts and completed transcriptions for 100 images from 38 manuscripts. This work contributed to a preliminary model achieving 96% accuracy (statistics courtesy of Christine Roughan, Princeton Center for Digital Humanities)

Workshop Link

Dates: September 11-13, 2024

Attendees: Jessie Dummer

Goal: Hands-on experience with historical manuscripts in Latin, Arabic, Greek, and potentially Hebrew scripts.

Workshop Link

Medieval Literary Documentation

person standing behind podium with mic next to a screen with colorful circles and text

Dates: June 12-13, 2025

Attendees: Jajwalya Karajgikar, Jessie Dummer, Doug Emery

Goal: Knowledge share amongst humanities/social science scholars, software engineers, and machine learning researchers so that technological and humanistic expertise might mutually inform one another.

Workshop Link

About Us

The Manuscript Collections as Data Research Group is a collaboration among the Schoenberg Institute for Manuscript Studies, Research Data and Digital Scholarship, and Cultural Heritage Computing.

Our Resources

Penn Libraries Presents: Handwritten Text Recognition Projects in the Library

A blog post by RDDS Data Science & Society Research Assistant 2025, Yifang Xia. It follows her experience with segmenting an illustrated treatise ca. 1600 on the diagnosis of abscesses and tumors and their treatment, mostly through acupuncture or burning substances near the skin. Copied in Japan in Chinese for Japanese practitioners, Yifang picked this LJS 433 for its irregular handwriting style, which is even more challenging for text recognition compared to other neater materials.

Penn Libraries Presents: Handwritten Text Recognition Projects in the Library

Presentation of the results of our work so far, demonstrating the research value of using HTR at Penn Libraries.

Watch the recording of the talk (PennKey required)

Reflections on South Asia Studies Digital Humanities Workshop

On October 10–11, 2024, the South Asia Studies Digital Humanities Workshop (SASDHW) convened scholars, librarians, and technologists for two days of collaborative learning on multilingual and computational text analysis of South Asian sources.

Computational Analysis of Visual Features from Digitized Manuscripts

Hussein Adnan Mohammed of Visual Manuscript Analysis Lab, Centre for the Study of Manuscript Cultures, University of Hamburg discusses the applications of Computer Vision providing opportunities beyond handwriting text recognition to address complex challenges in manuscript studies. (February 28, 2025) Recording TBA.

A Thousand Scripts, One Model: Transcribing 19th-Century Penn Medical Dissertations using Handwritten Text Recognition

With so much talk around artificial intelligence—both the challenges and the immense potential it holds for higher education—libraries and library staff are increasingly asking how to harness this technology to support research and promote access to collections. A recent project at Penn Libraries explores how AI tools can help answer new questions about centuries-old manuscripts.

Running eScriptorium on a Mac

Former Digital Scholarship Programmer, Andy Janco shares a short tutorial shows how to run eScriptorium locally on a Mac.

How to Transcribe a Million Manuscripts with eScriptorium

Peter Stokes from École Pratique des Hautes Études Université Paris Sciences et Lettres discusses the possibilities and challenges of applying machine-learning technologies to transcription of potentially millions of images of manuscripts using eScriptorium platform. (December 1, 2023)

Watch a recording of the talk

A screenshot of a manuscript page being segmented with Peter Stokes in the corner

Manuscript Collections As Data Research Group

Vision

Projects

Conferences Attended

About Us

Our Resources

Penn Libraries Presents: Handwritten Text Recognition Projects in the Library

Penn Libraries Presents: Handwritten Text Recognition Projects in the Library

Reflections on South Asia Studies Digital Humanities Workshop

Computational Analysis of Visual Features from Digitized Manuscripts

A Thousand Scripts, One Model: Transcribing 19th-Century Penn Medical Dissertations using Handwritten Text Recognition

Running eScriptorium on a Mac

How to Transcribe a Million Manuscripts with eScriptorium

Maps and More

Staff Information

Vision

Collaboration with Global Medieval and Early Modern Digital Library

Collaboration with Linguistic Data Consortium

South Asia Studies Digital Humanities Workshop

Syriac Transcribathon

TranscriboQuest

Source Codes of the Past: Launching an international ATR/HTR Network for Manuscript Analysis

About Us

Penn Libraries Presents: Handwritten Text Recognition Projects in the Library

Penn Libraries Presents: Handwritten Text Recognition Projects in the Library

Reflections on South Asia Studies Digital Humanities Workshop

Computational Analysis of Visual Features from Digitized Manuscripts

A Thousand Scripts, One Model: Transcribing 19th-Century Penn Medical Dissertations using Handwritten Text Recognition

Running eScriptorium on a Mac

How to Transcribe a Million Manuscripts with eScriptorium

Maps and More

Staff Information