Description
Current Behavior
tesseract -l eng --user-patterns patterns.txt in.png out.txt hocr txt
causes an assertion failure only on a specific Document page regardless of the contents of patterns.txt.
The image is OCRd successfully when not using --user-patterns
, even when using --user-words
.
I cannot share the image.
It is reproducable and I have coredumps working in GDB, details below.
Expected Behavior
tesseract to work on any valid png image regardless of whether using a patterns file.
Suggested Fix
No response
tesseract -v
tesseract 5.5.1
leptonica-1.85.0
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.48 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.5.0 : libopenjp2 2.5.3
Found AVX
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.8.0 zlib/1.3.1 liblzma/5.8.1 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.7 openssl/3.5.0 libb2/bundled libacl/2.3.2 libattr/2.3.2
Found libcurl/8.13.0 OpenSSL/3.5.0 zlib/1.3.1 brotli/1.1.0 zstd/1.5.7 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.1 nghttp2/1.65.0 nghttp3/1.9.0
Operating System
No response
Other Operating System
Arch Linux with tesseract system package 5.5.1-1
uname -a
6.14.7-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 22 May 2025 05:37:49 +0000 x86_64 GNU/Linux
Compiler
No response
CPU
Intel Core i7-3520M CPU @ 2.90GHz
Virtualization / Containers
No response
Other Information
tesseract -l eng --user-patterns ocrpat /tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png /tmp/ocrmypdf.io.orgi4dfg/000007_ocr_hocr hocr txt
contains_unichar_id(unichar_id):Error:Assert failed:in file ./src/ccutil/unicharset.h, line 501
zsh: IOT instruction (core dumped) tesseract -l eng --user-patterns ocrpat hocr txt
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007f03e0faf813 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
#2 0x00007f03e0f55dc0 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f03e0f3d57a in __GI_abort () at abort.c:73
#4 0x00007f03e1a8cfe2 in tesseract::ERRCODE::error (this=<optimized out>, caller=<optimized out>, action=tesseract::ABORT, format=<optimized out>)
at src/ccutil/errcode.cpp:83
#5 0x00007f03e1baf003 in tesseract::UNICHARSET::get_isalpha (this=0x55e2617109c0, unichar_id=216) at ./src/ccutil/unicharset.h:501
#6 tesseract::Trie::unichar_id_to_patterns (this=0x55e2616ea8b0, unichar_id=216, unicharset=..., vec=0x7fffbbc5d9c0) at src/dict/trie.cpp:351
#7 0x00007f03e1ba2b3e in tesseract::Dict::ProcessPatternEdges (this=this@entry=0x55e261b5b120, dawg=dawg@entry=0x55e2616ea8b0, pos=...,
unichar_id=unichar_id@entry=216, word_end=word_end@entry=true, dawg_args=dawg_args@entry=0x7fffbbc5db50, curr_perm=0x7fffbbc5da9c) at src/dict/dict.cpp:579
#8 0x00007f03e1ba471c in tesseract::Dict::def_letter_is_okay (this=0x55e261b5b120, void_dawg_args=<optimized out>, unicharset=..., unichar_id=216, word_end=true)
at src/dict/dict.cpp:519
#9 0x00007f03e1ba8dca in tesseract::Dict::valid_word (this=0x55e261b5b120, word=..., numbers_ok=false) at ./src/ccstruct/ratngs.h:282
#10 0x00007f03e1b32b8f in tesseract::Tesseract::recog_word (this=0x7f03e1dc2010, word=0x55e262943e00) at src/ccmain/tfacepp.cpp:63
#11 0x00007f03e1b32e77 in tesseract::Tesseract::tess_segment_pass_n (this=0x7f03e1dc2010, pass_n=<optimized out>, word=0x55e262943e00) at src/ccmain/tessbox.cpp:47
#12 0x00007f03e1adcb42 in tesseract::Tesseract::match_word_pass_n (this=0x7f03e1dc2010, pass_n=1, word=0x55e262943e00, row=0x55e2628c71e0, block=<optimized out>)
at src/ccmain/control.cpp:1600
#13 0x00007f03e1adccf2 in tesseract::Tesseract::classify_word_pass1 (this=0x7f03e1dc2010, word_data=..., in_word=0x55e262943fe0, out_words=<optimized out>)
at src/ccmain/control.cpp:1420
#14 0x00007f03e1add21f in tesseract::Tesseract::RetryWithLanguage (this=0x7f03e1dc2010, word_data=...,
recognizer=(void (tesseract::Tesseract::*)(tesseract::Tesseract * const, const tesseract::WordData &, tesseract::WERD_RES **, tesseract::PointerVector<tesseract::WERD_RES> *)) 0x7f03e1adcc80 <tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVector<tesseract::WERD_RES>*)>,
debug=debug@entry=false, in_word=0x55e262943fe0, best_words=0x7fffbbc5df60) at src/ccmain/control.cpp:883
#15 0x00007f03e1ade0a5 in tesseract::Tesseract::classify_word_and_language (this=0x7f03e1dc2010, pass_n=<optimized out>, pr_it=0x7fffbbc5e0e0, word_data=0x55e262997870)
at ./src/ccutil/genericvector.h:510
#16 0x00007f03e1ad8b2d in tesseract::Tesseract::RecogAllWordsPassN (this=0x7f03e1dc2010, pass_n=1, monitor=0x0, pr_it=0x7fffbbc5e0e0, words=0x7fffbbc5e0c0)
at src/ccmain/control.cpp:255
#17 0x00007f03e1ae439f in tesseract::Tesseract::recog_all_words (this=0x7f03e1dc2010, page_res=0x55e2628d6ba0, monitor=0x0, target_word_box=0x0, word_config=0x0,
dopasses=0) at src/ccmain/control.cpp:345
#18 0x00007f03e1a9e45a in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffbbc5e980, monitor=monitor@entry=0x0) at src/api/baseapi.cpp:832
#19 0x00007f03e1aa1b73 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffbbc5e980, pix=0x55e2628952c0, page_index=0, filename=<optimized out>, retry_config=0x0,
timeout_millisec=<optimized out>, renderer=0x55e262895250) at src/api/baseapi.cpp:1217
#20 0x00007f03e1aa2fcc in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffbbc5e980,
filename=filename@entry=0x7fffbbc5f75b "/tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
renderer=0x55e262895250) at src/api/baseapi.cpp:1180
#21 0x00007f03e1aa3236 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffbbc5e980,
filename=filename@entry=0x7fffbbc5f75b "/tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
renderer=<optimized out>) at src/api/baseapi.cpp:997
#22 0x000055e242442939 in main1 (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/15.1.1/bits/unique_ptr.h:193
#23 0x000055e24243f582 in main (argc=<optimized out>, argv=<optimized out>) at src/tesseract.cpp:858
(gdb) f 5
#5 0x00007f03e1baf003 in tesseract::UNICHARSET::get_isalpha (this=0x55e2617109c0, unichar_id=216) at ./src/ccutil/unicharset.h:501
501 ASSERT_HOST(contains_unichar_id(unichar_id));
(gdb) l
496 // Return the isalpha property of the given unichar.
497 bool get_isalpha(UNICHAR_ID unichar_id) const {
498 if (INVALID_UNICHAR_ID == unichar_id) {
499 return false;
500 }
501 ASSERT_HOST(contains_unichar_id(unichar_id));
502 return unichars[unichar_id].properties.isalpha;
503 }
504
505 // Return the islower property of the given unichar.
(gdb) p unichar_id
$2 = 216
(gdb)