Using HTR models, two SIMS graduate fellows will be trained to produce transcriptions of texts in the eScriptorium platform from manuscripts housed in the Kislak Center for Special Collections, Rare Books and Manuscripts. These transcriptions will appear in the Global Medieval & Early Modern Digital Library. The Manuscript Collections As Data Research Group will support this work through the development of technical infrastructure and workflows, with an aim to develop a reproducible approach for students to use HTR.
Manuscript Collections As Data Research Group
A team from the Schoenberg Institute for Manuscript Studies (SIMS), Cultural Heritage Computing, and Research Data and Digital Scholarship (RDDS) is working to facilitate Handwritten Text Recognition (HTR) to transcribe research materials from Penn Libraries’ digitized manuscript collections using AI technologies.

Much of the world’s cultural heritage is handwritten and legible by humans but, until recently, not by machines. We are working to build a sustainable, scalable, and diverse community of practice in Handwritten Text Recognition (HTR) by bringing together interested people from across Penn and the world. HTR uses AI to transcribe handwritten texts and can make corpora of manuscript materials searchable, help us discover unique texts, and turn handwritten texts into data for linguistic, historical, and quantitative analyses.
Vision
Machine learning-driven HTR and image recognition: The RDDS and SIMS teams will advance conversations and projects that use Handwritten Text Recognition (HTR) with a particular focus on less-represented languages/scripts and on leveraging Penn's collections both on campus in the broader manuscript studies community. We will publish our models if we see an opportunity for others to reuse our work and will offer opportunities for scholars to learn and use these tools with their own corpora, thus empowering cutting-edge research in line with the Penn Libraries Strategic Priorities.
Projects
We are working and consulting with Linguistic Data Consortium for multilingual HTR process workflows, shared student training material, and scalable model pipelines that empower global cultural heritage research communities to transcribe and analyze handwritten texts across diverse languages and scripts.

In October 2024, the Penn Libraries and South Asia Studies presented a space for technology orientation, where participants will shared a nuanced and informed understanding of the possibilities and limitations of critical digital humanities tools, particularly Computational Text Analysis (CTA) of content found in manuscripts, inscriptions, maps, and other historical documents.
Workshop page Watch video recording Day 1
Watch video recording Day 2 Blog Post
About Us
The Manuscript Collections as Data Research Group is a collaboration among the Schoenberg Institute for Manuscript Studies, Research Data and Digital Scholarship, and Cultural Heritage Computing.
Our Resources
Reflections on South Asia Studies Digital Humanities Workshop
On October 10–11, 2024, the South Asia Studies Digital Humanities Workshop (SASDHW) convened scholars, librarians, and technologists for two days of collaborative learning on multilingual and computational text analysis of South Asian sources.

Computational Analysis of Visual Features from Digitized Manuscripts
Hussein Adnan Mohammed of Visual Manuscript Analysis Lab, Centre for the Study of Manuscript Cultures, University of Hamburg discusses the applications of Computer Vision providing opportunities beyond handwriting text recognition to address complex challenges in manuscript studies. (February 28, 2025) Recording TBA.
A Thousand Scripts, One Model: Transcribing 19th-Century Penn Medical Dissertations using Handwritten Text Recognition
With so much talk around artificial intelligence—both the challenges and the immense potential it holds for higher education—libraries and library staff are increasingly asking how to harness this technology to support research and promote access to collections. A recent project at Penn Libraries explores how AI tools can help answer new questions about centuries-old manuscripts.

Running eScriptorium on a Mac
Former Digital Scholarship Programmer, Andy Janco shares a short tutorial shows how to run eScriptorium locally on a Mac.

How to Transcribe a Million Manuscripts with eScriptorium
Peter Stokes from École Pratique des Hautes Études Université Paris Sciences et Lettres discusses the possibilities and challenges of applying machine-learning technologies to transcription of potentially millions of images of manuscripts using eScriptorium platform. (December 1, 2023)
