Segmenting Ancient Chinese-Japanese Texts for HTR

The following blog post by Yifang Xia, RDDS Data Science & Society Research Assistant 2025, is a log of her work as part of the work with the Manuscript Collections as Data Research Group. It follows her experience with segmenting an illustrated treatise ca. 1600 on the diagnosis of abscesses and tumors and their treatment, mostly through acupuncture or burning substances near the skin. Copied in Japan in Chinese for Japanese practitioners, Yifang picked this LJS 433 for its irregular handwriting style, which is even more challenging for text recognition compared to other neater materials.

Ancient Chinese HTR Event Log

My project began with Yōso zusetsu (廱疽圖説), an illustrated treatise on diagnosing and treating abscesses and tumors, copied in Japan in Chinese for Japanese practitioners. Unlike other Chinese manuscripts in the Penn Libraries, this one features irregular handwriting, which makes it especially hard to read.

Phase 1: Ground truth setup

Period 1: Aug 27 — Sept 25

With guidance from Jessie Dummer, the Digitization Project Coordinator and Jaj, Applied Data Science Librarian, I first tried eScriptorium. Even with the settings on Han (traditional), centered baseline, and a vertical layout that reads top to bottom and right to left, the platform performed poorly on this manuscript. It failed to recognize vertical text and instead treated the page like horizontal text stacked vertically. So I turned to an open-source OCR model trained for ancient Chinese and chose PaddleOCR from the PaddlePaddle team. It offers many pre-trained models for more than 80 languages and supports both printed and handwritten text.

Two days of setup adventures

It took me two full days to get it running. The early problems included:

Images too large: This was not a model issue. My MacBook’s memory said no.

Version mismatches: Older tutorials and snippets target PaddleOCR 2.5–2.7, while the current release is 3.0.0. Some parameters and APIs changed names or behavior, so the same function sometimes requires different code. See Figure 1 for an example.

Figure 1_Difference

Vertical detection: Detecting and framing vertical text depends on using the right dictionaries for traditional versus simplified Chinese.

This is the basic architecture of my Python code I kept while refining:

Global OCR instance

• We keep a single global OCR instance so we don’t repeatedly initialize PaddleOCR. This saves memory and improves efficiency.

Timeout handling

• A custom timeout mechanism ensures that if OCR or initialization takes too long, the process stops instead of hanging.

• This prevents wasting time on oversized or corrupted inputs.

Safe resize

• Before OCR, oversized images are scaled down to a maximum side length.

• This reduces memory usage, speeds up processing, and prevents crashes.

Detection (core logic)

• detect_boxes() runs PaddleOCR.

It can work in two modes:

With recognition → detect boxes and also output text.

Without recognition → detect boxes only.

Each detected item has:

Polygon coordinates, Text content, Confidence score

The results are scaled back to the original image size. This step is necessary for drawing bounding boxes on the original image.

Visualization

• Bounding boxes are drawn directly on the image.

• If the image is very large, it’s resized first, and boxes are scaled accordingly.

• Saves a visualization file for quick inspection. Figuring out a way to easily inspect whether we get the expected outcome is crucial.

Vertical right-to-left sorting

• Ancient Chinese manuscript texts are written vertically, right to left. This function sorts the boxes in the correct reading order.

Convert to PAGE-XML

• Generates a PAGE-XML file, which we can input into the eScriptorium for manual adjustment.

• Each text line includes:

• Polygon coordinates

• (Optional) recognized text

Main function

• Orchestrates the entire workflow:

• Parse command-line arguments

• Detect boxes

• Save JSON results

• Save visualization

• Export PAGE-XML

• Includes error handling and always clears timeouts at the end.

Day one results

I got no usable output at first. The code looked fine, but the model did not behave. I instrumented the pipeline to log outputs at each stage. PaddleOCR’s docs say detection should return something like [[[x1,y1], [x2,y2], ...], (text, confidence)], but what I saw looked like token-like single characters without proper bounding boxes. My guess is that this came from an unstable function in the new release. The mismatch meant my parser could not proceed.

Rolling back to 2.6 or 2.7 did not help because those builds conflicted with my current Python environment. They wanted an older NumPy that did not play nicely with my other packages. End of day one scorecard: PaddleOCR not yet reliable.

Day two plot twist

I tested Tesseract and got poor results. Then I retried PaddleOCR 3.0.0 with the same code as before. It worked. No changes on my end. The model simply decided to cooperate. Figures 2 and 3 show the contrast.

chinese script top to bottom with red boxes messily everywhere

Figure_2 The Result of Tesseract

chinese script top to bottom with red boxes clearly around some of the text

Figure_3 The First Result of Paddle

At first, detection still missed many lines. I lowered the text threshold, increased the unclip ratio, and enabled dilation to widen the detection window. The second run was excellent. All lines were detected correctly. See Figures 4 and 5.

Figure_4 Adding Parameters

chinese script top to bottom with red boxes clearly around all the text

Figure_5 The Second Result of Paddle

Pages with illustrations and show-through

Illustrations interrupted text detection, and some scans suffered from show-through where content from the next page bled into the current one.

For some pages, increasing the unclip ratio helped (see Figures 6 and 7).

figure of human backside with three circles and 3 lines Chinese script top to bottom

Figure_6 Page4 with unclip = 1.6

figure of human backside with three circles and 4 lines Chinese script top to bottom in red boxes

Figure_7 Page4 with unclip = 1.8

For others, unclip no longer helped, so I used CLAHE for contrast enhancement (Figures 8, 9, and 10).

Figure_8 Conduct the preprocessing

One gotcha: PaddleOCR expects RGB images. If you do any preprocessing that changes the mode, convert back to RGB before detection.

figure of human side with 2 lines Chinese script top to bottom, one bounded in red box

Figure_9 Page5 before CLAHE

figure of human side with 3 lines Chinese script top to bottom, 3 bounded in red box

Figure_10 Page5 after CLAHE

On another page, boosting contrast increased both the text and the background noise, and further, distorting those strokes clear enough to be recognized into unrecognizable (Figures 11 and 12).

figure of human side with 3 lines Chinese script top to bottom, 2 bounded in red box

Figure_11 Page6 before CLAHE

figure of human side with 3 lines Chinese script top to bottom, 1 bounded in red box

Figure_12 Page6 after CLAHE

To handle this, I added a routine that measures global contrast and boosts only low-contrast regions (Figure 13, 14). This selective approach reduces noise, although it comes with a heavy time cost. Manual adjustment in eScriptorium may be a better trade for production work.

Figure_13 Selective preprocessing_1

Figure_14 Selective preprocessing_2

When nothing else works

I hit a page that looked normal to the human eye but confused PaddleOCR. Only a small area was detected and most text was ignored (Figure 15).

Figure_15 Page9

To reduce background interference, I tried Gaussian blur and several binarization methods. Only Sauvola gave a measurable improvement (Figure 16, 17). Even so, the result was still not good enough, so this page will need manual correction in eScriptorium (Figure 18).

Figure_16 Binarization code

binarized outlines of manuscript illustration and Chinese text

Figure_17 Debug_sauvola_binary

figure of human side with many lines Chinese script top to bottom, 3 bounded in red box

Figure_18 Page9 after binarizaton

Importing into eScriptorium

I then imported my transcription XML.

Schema issues

My PAGE-XML did not match eScriptorium’s schema location (Figure 19, 20). After fixing the PcGts root and the schema reference, I ran into file naming trouble.

Figure_19 Original to PAGE_XML code

Figure_20 The error

Figure_21 Default name

Filename mismatch

The imageFilename in PAGE-XML must match eScriptorium’s default name for the page. I had renamed files for readability when downloading, so I switched back to the platform’s default names (Figure 21).

Baselines required

eScriptorium expects a baseline for every line. Since ancient Chinese is vertical, the baseline should connect the center of the top and bottom edges of the polygon, not the left and right. I added this to the exporter (Figures 22-1, 22-2, 22-3).

Figure_22-1, 22-2, 22-3 New code for XML output

The mystery of misalignment

After all that, my imported boxes and baselines did not align with the image (Figure 23). I had carefully rescaled coordinates, so this was puzzling.

Two suspects emerged. Either the image I downloaded from eScriptorium to my end was not identical to eScriptorium’s internal copy, or my format conversion from JPG to PNG introduced a tiny scale change. If it was the first, I would need to process images straight from eScriptorium via its API. If it was the second, I could fix it by redownloading and keeping the original name and format.

I tried the simpler path first. I redownloaded the original JPGs, kept the default names, and skipped any conversion. Success (Figure 24).

Figure_23 Misalignment

escriptorium chinese manuscript text with some blue lines on top and boxes with lines on top

Figure_24 Final success in importing XML

escriptorium chinese manuscript text with blue lines on top and boxes with lines on top

Takeaways

So far, nearly all my problems are solved by intently modifying my code. There are lessons I’ve learned from this process and I want to share them with you.

Tools are only half the story

The same model can perform very differently depending on how you drive it. Learn the options and tune them.

Mind your formats

Always check the current format of the data you are working with, and confirm what the next step expects.

Every method has a price

Enhancements can introduce noise, slow processing, or both. Choose your trade with care.

Keep going

It’s always easier said than done. Every step of progress might take hours, yet with constant trials and errors, we would finally make it! Most problems yield to careful checks and targeted fixes. Persistence pays off.

Period 2: Sept 26 — Oct 1

After successfully uploading XML files on base segmentation, a vital step forward is to input manual transcription.

Reference

Yōso zusetsu; 廱疽圖説 - colenda digital repository (no date). Available at: https://colenda.library.upenn.edu/catalog/81431-p3806r (Accessed: October 1, 2025).

GitHub - PaddlePaddle/PaddleOCR: Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 80+ languages. (2025) GitHub. Available at: https://github.com/PaddlePaddle/PaddleOCR (Accessed: October 1, 2025).

Clahe histogram equalization - opencv (2020) GeeksforGeeks. Available at: https://www.geeksforgeeks.org/python/clahe-histogram-eqalization-opencv/ (Accessed: October 1, 2025).

Authors

Yifang Xia

Jajwalya Karajgikar

Applied Data Science Librarian

Date

October 16, 2025

Department

Research Data & Digital Scholarship

Segmenting Ancient Chinese-Japanese Texts for HTR

Ancient Chinese HTR Event Log

Phase 1: Ground truth setup

Period 1: Aug 27 — Sept 25

Day one results

Day two plot twist

Period 2: Sept 26 — Oct 1

Reference

Authors

Date

Tags

Department

Maps and More

Staff Information

Ancient Chinese HTR Event Log

Phase 1: Ground truth setup

Period 1: Aug 27 — Sept 25

Day one results

Day two plot twist

Period 2: Sept 26 — Oct 1

Reference

Authors

Date

Tags

Department

Share

Maps and More

Staff Information