Skip to content

--user-patterns can cause assertion failure in UNICHARSET::get_isalpha #4425

Open
@krumelmonster

Description

@krumelmonster

Current Behavior

tesseract -l eng --user-patterns patterns.txt in.png out.txt hocr txt causes an assertion failure only on a specific Document page regardless of the contents of patterns.txt.

The image is OCRd successfully when not using --user-patterns, even when using --user-words.

I cannot share the image.

It is reproducable and I have coredumps working in GDB, details below.

Expected Behavior

tesseract to work on any valid png image regardless of whether using a patterns file.

Suggested Fix

No response

tesseract -v

tesseract 5.5.1
 leptonica-1.85.0
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.48 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.5.0 : libopenjp2 2.5.3
 Found AVX
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.8.0 zlib/1.3.1 liblzma/5.8.1 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.7 openssl/3.5.0 libb2/bundled libacl/2.3.2 libattr/2.3.2
 Found libcurl/8.13.0 OpenSSL/3.5.0 zlib/1.3.1 brotli/1.1.0 zstd/1.5.7 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.1 nghttp2/1.65.0 nghttp3/1.9.0

Operating System

No response

Other Operating System

Arch Linux with tesseract system package 5.5.1-1

uname -a

6.14.7-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 22 May 2025 05:37:49 +0000 x86_64 GNU/Linux

Compiler

No response

CPU

Intel Core i7-3520M CPU @ 2.90GHz

Virtualization / Containers

No response

Other Information

tesseract -l eng --user-patterns ocrpat /tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png /tmp/ocrmypdf.io.orgi4dfg/000007_ocr_hocr hocr txt

contains_unichar_id(unichar_id):Error:Assert failed:in file ./src/ccutil/unicharset.h, line 501
zsh: IOT instruction (core dumped)  tesseract -l eng --user-patterns ocrpat   hocr txt
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f03e0faf813 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
#2  0x00007f03e0f55dc0 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f03e0f3d57a in __GI_abort () at abort.c:73
#4  0x00007f03e1a8cfe2 in tesseract::ERRCODE::error (this=<optimized out>, caller=<optimized out>, action=tesseract::ABORT, format=<optimized out>)
    at src/ccutil/errcode.cpp:83
#5  0x00007f03e1baf003 in tesseract::UNICHARSET::get_isalpha (this=0x55e2617109c0, unichar_id=216) at ./src/ccutil/unicharset.h:501
#6  tesseract::Trie::unichar_id_to_patterns (this=0x55e2616ea8b0, unichar_id=216, unicharset=..., vec=0x7fffbbc5d9c0) at src/dict/trie.cpp:351
#7  0x00007f03e1ba2b3e in tesseract::Dict::ProcessPatternEdges (this=this@entry=0x55e261b5b120, dawg=dawg@entry=0x55e2616ea8b0, pos=...,
    unichar_id=unichar_id@entry=216, word_end=word_end@entry=true, dawg_args=dawg_args@entry=0x7fffbbc5db50, curr_perm=0x7fffbbc5da9c) at src/dict/dict.cpp:579
#8  0x00007f03e1ba471c in tesseract::Dict::def_letter_is_okay (this=0x55e261b5b120, void_dawg_args=<optimized out>, unicharset=..., unichar_id=216, word_end=true)
    at src/dict/dict.cpp:519
#9  0x00007f03e1ba8dca in tesseract::Dict::valid_word (this=0x55e261b5b120, word=..., numbers_ok=false) at ./src/ccstruct/ratngs.h:282
#10 0x00007f03e1b32b8f in tesseract::Tesseract::recog_word (this=0x7f03e1dc2010, word=0x55e262943e00) at src/ccmain/tfacepp.cpp:63
#11 0x00007f03e1b32e77 in tesseract::Tesseract::tess_segment_pass_n (this=0x7f03e1dc2010, pass_n=<optimized out>, word=0x55e262943e00) at src/ccmain/tessbox.cpp:47
#12 0x00007f03e1adcb42 in tesseract::Tesseract::match_word_pass_n (this=0x7f03e1dc2010, pass_n=1, word=0x55e262943e00, row=0x55e2628c71e0, block=<optimized out>)
    at src/ccmain/control.cpp:1600
#13 0x00007f03e1adccf2 in tesseract::Tesseract::classify_word_pass1 (this=0x7f03e1dc2010, word_data=..., in_word=0x55e262943fe0, out_words=<optimized out>)
    at src/ccmain/control.cpp:1420
#14 0x00007f03e1add21f in tesseract::Tesseract::RetryWithLanguage (this=0x7f03e1dc2010, word_data=...,
    recognizer=(void (tesseract::Tesseract::*)(tesseract::Tesseract * const, const tesseract::WordData &, tesseract::WERD_RES **, tesseract::PointerVector<tesseract::WERD_RES> *)) 0x7f03e1adcc80 <tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVector<tesseract::WERD_RES>*)>,
    debug=debug@entry=false, in_word=0x55e262943fe0, best_words=0x7fffbbc5df60) at src/ccmain/control.cpp:883
#15 0x00007f03e1ade0a5 in tesseract::Tesseract::classify_word_and_language (this=0x7f03e1dc2010, pass_n=<optimized out>, pr_it=0x7fffbbc5e0e0, word_data=0x55e262997870)
    at ./src/ccutil/genericvector.h:510
#16 0x00007f03e1ad8b2d in tesseract::Tesseract::RecogAllWordsPassN (this=0x7f03e1dc2010, pass_n=1, monitor=0x0, pr_it=0x7fffbbc5e0e0, words=0x7fffbbc5e0c0)
    at src/ccmain/control.cpp:255
#17 0x00007f03e1ae439f in tesseract::Tesseract::recog_all_words (this=0x7f03e1dc2010, page_res=0x55e2628d6ba0, monitor=0x0, target_word_box=0x0, word_config=0x0,
    dopasses=0) at src/ccmain/control.cpp:345
#18 0x00007f03e1a9e45a in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffbbc5e980, monitor=monitor@entry=0x0) at src/api/baseapi.cpp:832
#19 0x00007f03e1aa1b73 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffbbc5e980, pix=0x55e2628952c0, page_index=0, filename=<optimized out>, retry_config=0x0,
    timeout_millisec=<optimized out>, renderer=0x55e262895250) at src/api/baseapi.cpp:1217
#20 0x00007f03e1aa2fcc in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffbbc5e980,
    filename=filename@entry=0x7fffbbc5f75b "/tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=0x55e262895250) at src/api/baseapi.cpp:1180
#21 0x00007f03e1aa3236 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffbbc5e980,
    filename=filename@entry=0x7fffbbc5f75b "/tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=<optimized out>) at src/api/baseapi.cpp:997
#22 0x000055e242442939 in main1 (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/15.1.1/bits/unique_ptr.h:193
#23 0x000055e24243f582 in main (argc=<optimized out>, argv=<optimized out>) at src/tesseract.cpp:858
(gdb) f 5
#5  0x00007f03e1baf003 in tesseract::UNICHARSET::get_isalpha (this=0x55e2617109c0, unichar_id=216) at ./src/ccutil/unicharset.h:501
501	    ASSERT_HOST(contains_unichar_id(unichar_id));
(gdb) l
496	  // Return the isalpha property of the given unichar.
497	  bool get_isalpha(UNICHAR_ID unichar_id) const {
498	    if (INVALID_UNICHAR_ID == unichar_id) {
499	      return false;
500	    }
501	    ASSERT_HOST(contains_unichar_id(unichar_id));
502	    return unichars[unichar_id].properties.isalpha;
503	  }
504	
505	  // Return the islower property of the given unichar.
(gdb) p unichar_id
$2 = 216
(gdb)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions