Segmenting Ancient Chinese-Japanese Texts for HTR
The following blog post by Yifang Xia, RDDS Data Science & Society Research Assistant 2025, is a log of her work as part of the work with the Manuscript Collections as Data Research Group. It follows her experience with segmenting an illustrated treatise ca. 1600 on the diagnosis of abscesses and tumors and their treatment, mostly through acupuncture or burning substances near the skin. Copied in Japan in Chinese for Japanese practitioners, Yifang picked this LJS 433 for its irregular handwriting style, which is even more challenging for text recognition compared to other neater materials.
Ancient Chinese HTR Event Log
My project began with Yōso zusetsu (廱疽圖説), an illustrated treatise on diagnosing and treating abscesses and tumors, copied in Japan in Chinese for Japanese practitioners. Unlike other Chinese manuscripts in the Penn Libraries, this one features irregular handwriting, which makes it especially hard to read.
Phase 1: Ground truth setup
Period 1: Aug 27 — Sept 25
With guidance from Jessie Dummer, the Digitization Project Coordinator and Jaj, Applied Data Science Librarian, I first tried eScriptorium. Even with the settings on Han (traditional), centered baseline, and a vertical layout that reads top to bottom and right to left, the platform performed poorly on this manuscript. It failed to recognize vertical text and instead treated the page like horizontal text stacked vertically. So I turned to an open-source OCR model trained for ancient Chinese and chose PaddleOCR from the PaddlePaddle team. It offers many pre-trained models for more than 80 languages and supports both printed and handwritten text.
Two days of setup adventures
It took me two full days to get it running. The early problems included:
Images too large: This was not a model issue. My MacBook’s memory said no.
Version mismatches: Older tutorials and snippets target PaddleOCR 2.5–2.7, while the current release is 3.0.0. Some parameters and APIs changed names or behavior, so the same function sometimes requires different code. See Figure 1 for an example.
Figure 1_Difference
Vertical detection: Detecting and framing vertical text depends on using the right dictionaries for traditional versus simplified Chinese.
This is the basic architecture of my Python code I kept while refining:
Global OCR instance
• We keep a single global OCR instance so we don’t repeatedly initialize PaddleOCR. This saves memory and improves efficiency.
Timeout handling
• A custom timeout mechanism ensures that if OCR or initialization takes too long, the process stops instead of hanging.
• This prevents wasting time on oversized or corrupted inputs.
Safe resize
• Before OCR, oversized images are scaled down to a maximum side length.
• This reduces memory usage, speeds up processing, and prevents crashes.
Detection (core logic)
• detect_boxes() runs PaddleOCR.
It can work in two modes:
With recognition → detect boxes and also output text.
Without recognition → detect boxes only.
Each detected item has:
Polygon coordinates, Text content, Confidence score
The results are scaled back to the original image size. This step is necessary for drawing bounding boxes on the original image.
Visualization
• Bounding boxes are drawn directly on the image.
• If the image is very large, it’s resized first, and boxes are scaled accordingly.
• Saves a visualization file for quick inspection. Figuring out a way to easily inspect whether we get the expected outcome is crucial.
Vertical right-to-left sorting
• Ancient Chinese manuscript texts are written vertically, right to left. This function sorts the boxes in the correct reading order.
Convert to PAGE-XML
• Generates a PAGE-XML file, which we can input into the eScriptorium for manual adjustment.
• Each text line includes:
• Polygon coordinates
• (Optional) recognized text
Main function
• Orchestrates the entire workflow:
• Parse command-line arguments
• Detect boxes
• Save JSON results
• Save visualization
• Export PAGE-XML
• Includes error handling and always clears timeouts at the end.
Day one results
I got no usable output at first. The code looked fine, but the model did not behave. I instrumented the pipeline to log outputs at each stage. PaddleOCR’s docs say detection should return something like [[[x1,y1], [x2,y2], ...], (text, confidence)], but what I saw looked like token-like single characters without proper bounding boxes. My guess is that this came from an unstable function in the new release. The mismatch meant my parser could not proceed.
Rolling back to 2.6 or 2.7 did not help because those builds conflicted with my current Python environment. They wanted an older NumPy that did not play nicely with my other packages. End of day one scorecard: PaddleOCR not yet reliable.
Day two plot twist
I tested Tesseract and got poor results. Then I retried PaddleOCR 3.0.0 with the same code as before. It worked. No changes on my end. The model simply decided to cooperate. Figures 2 and 3 show the contrast.
Figure_2 The Result of Tesseract
Figure_3 The First Result of Paddle
At first, detection still missed many lines. I lowered the text threshold, increased the unclip ratio, and enabled dilation to widen the detection window. The second run was excellent. All lines were detected correctly. See Figures 4 and 5.
Figure_4 Adding Parameters
Figure_5 The Second Result of Paddle
Pages with illustrations and show-through
Illustrations interrupted text detection, and some scans suffered from show-through where content from the next page bled into the current one.
For some pages, increasing the unclip ratio helped (see Figures 6 and 7).
Figure_6 Page4 with unclip = 1.6
Figure_7 Page4 with unclip = 1.8
For others, unclip no longer helped, so I used CLAHE for contrast enhancement (Figures 8, 9, and 10).
Figure_8 Conduct the preprocessing
One gotcha: PaddleOCR expects RGB images. If you do any preprocessing that changes the mode, convert back to RGB before detection.
Figure_9 Page5 before CLAHE
Figure_10 Page5 after CLAHE
On another page, boosting contrast increased both the text and the background noise, and further, distorting those strokes clear enough to be recognized into unrecognizable (Figures 11 and 12).
Figure_11 Page6 before CLAHE
Figure_12 Page6 after CLAHE
To handle this, I added a routine that measures global contrast and boosts only low-contrast regions (Figure 13, 14). This selective approach reduces noise, although it comes with a heavy time cost. Manual adjustment in eScriptorium may be a better trade for production work.
Figure_13 Selective preprocessing_1
Figure_14 Selective preprocessing_2
When nothing else works
I hit a page that looked normal to the human eye but confused PaddleOCR. Only a small area was detected and most text was ignored (Figure 15).
Figure_15 Page9
To reduce background interference, I tried Gaussian blur and several binarization methods. Only Sauvola gave a measurable improvement (Figure 16, 17). Even so, the result was still not good enough, so this page will need manual correction in eScriptorium (Figure 18).
Figure_16 Binarization code
Figure_17 Debug_sauvola_binary
Figure_18 Page9 after binarizaton
Importing into eScriptorium
I then imported my transcription XML.
Schema issues
My PAGE-XML did not match eScriptorium’s schema location (Figure 19, 20). After fixing the PcGts root and the schema reference, I ran into file naming trouble.
Figure_19 Original to PAGE_XML code
Figure_20 The error
Figure_21 Default name
Filename mismatch
The imageFilename in PAGE-XML must match eScriptorium’s default name for the page. I had renamed files for readability when downloading, so I switched back to the platform’s default names (Figure 21).
Baselines required
eScriptorium expects a baseline for every line. Since ancient Chinese is vertical, the baseline should connect the center of the top and bottom edges of the polygon, not the left and right. I added this to the exporter (Figures 22-1, 22-2, 22-3).
Figure_22-1, 22-2, 22-3 New code for XML output
The mystery of misalignment
After all that, my imported boxes and baselines did not align with the image (Figure 23). I had carefully rescaled coordinates, so this was puzzling.
Two suspects emerged. Either the image I downloaded from eScriptorium to my end was not identical to eScriptorium’s internal copy, or my format conversion from JPG to PNG introduced a tiny scale change. If it was the first, I would need to process images straight from eScriptorium via its API. If it was the second, I could fix it by redownloading and keeping the original name and format.
I tried the simpler path first. I redownloaded the original JPGs, kept the default names, and skipped any conversion. Success (Figure 24).
Figure_23 Misalignment
Figure_24 Final success in importing XML
Takeaways
So far, nearly all my problems are solved by intently modifying my code. There are lessons I’ve learned from this process and I want to share them with you.
Tools are only half the story
The same model can perform very differently depending on how you drive it. Learn the options and tune them.
Mind your formats
Always check the current format of the data you are working with, and confirm what the next step expects.
Every method has a price
Enhancements can introduce noise, slow processing, or both. Choose your trade with care.
Keep going
It’s always easier said than done. Every step of progress might take hours, yet with constant trials and errors, we would finally make it! Most problems yield to careful checks and targeted fixes. Persistence pays off.
Period 2: Sept 26 — Oct 1
After successfully uploading XML files on base segmentation, a vital step forward is to input manual transcription.
Reference
Yōso zusetsu; 廱疽圖説 - colenda digital repository (no date). Available at: https://colenda.library.upenn.edu/catalog/81431-p3806r (Accessed: October 1, 2025).
GitHub - PaddlePaddle/PaddleOCR: Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 80+ languages. (2025) GitHub. Available at: https://github.com/PaddlePaddle/PaddleOCR (Accessed: October 1, 2025).
Clahe histogram equalization - opencv (2020) GeeksforGeeks. Available at: https://www.geeksforgeeks.org/python/clahe-histogram-eqalization-opencv/ (Accessed: October 1, 2025).