Skip to content

Unicode characters in headers are not stripped from identifier when +auto_identifiers #10816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cderv opened this issue May 2, 2025 · 5 comments

Comments

@cderv
Copy link
Contributor

cderv commented May 2, 2025

❯ pandoc -t native -f markdown
# Test①
^Z
[ Header 1 ( "test\9312" , [] , [] ) [ Str "Test\9312" ] ]

This header is using U+2460 character, which gets kept into the identifier created for the Header.

And it makes Typst rendering fails for example, as it leads to this typ content

=== Test①
<test①>
xxx
❯ pandoc -t typst -f markdown -s -o test.pdf index.md
error: unclosed label
    ┌─ \\?\C:\Users\chris\Documents\DEV_OTHER\00-TESTS\test-quarto\toPB3E1.typ:122:0
    │
122 │ <test①>
    │ ^^^^^

Error producing PDF.

For typst

A label's name can contain letters, numbers, _, -, :, and .

I would have expected the +auto_identifiers extension (https://pandoc.org/MANUAL.html#extension-auto_identifiers) to also support only letters as ANSI letters, when it says

The default algorithm used to derive the identifier from the heading text is:
(...)
Remove all non-alphanumeric characters, except underscores, hyphens, and periods.

I also tried LaTeX and it will no prevent rendering as the .tex content is

\section{Test①}\label{testux2460}

However, it will not render the in document but only in the TOC link

Image

Should Unicode chars be dropped for identifier creation? Or at least for Typst where this is not supported like with LaTeX ?

Current workaround to this

Manually setting the id is the solution here.

# Test① {#test-1}

Originally posted in quarto-dev/quarto-cli#12660

@cderv cderv added the bug label May 2, 2025
@jgm
Copy link
Owner

jgm commented May 2, 2025

One workaround is to use -f markdown+ascii_identifiers.

@jgm
Copy link
Owner

jgm commented May 2, 2025

Note that the character ① does satisfy Data.Char.isAlphaNum. (It's a digit according to Unicode.)

Perhaps the analogous function typst uses does not classify it as alphanumeric? I'd have to know the details. Maybe someone who is familiar with the typst code base could point me to the right place in the code.
In any case, once we figure out what the differences are, we could implement a fix the typst writer.

@jgm
Copy link
Owner

jgm commented May 2, 2025

I note that the rust documentation actually has this example:

assert!('①'.is_alphanumeric());

So the typst code must not be using is_alphanumeric.

@laurmaedje can you advise us on the correct restrictions for labels? (Summary of the above: <test①> doesn't work as a label despite being entirely alphanumeric, according to Unicode.)

@cderv
Copy link
Contributor Author

cderv commented May 2, 2025

My search lead me to
https://github.com/typst/typst/blob/14241ec1aae43ce3bff96411f62af76a01c7f709/crates/typst-syntax/src/lexer.rs#L1091-L1095

https://github.com/typst/typst/blob/14241ec1aae43ce3bff96411f62af76a01c7f709/crates/typst-syntax/src/lexer.rs#L1075-L1077

It seems they use

use unicode_ident::{is_xid_continue, is_xid_start};

From this crate: https://docs.rs/unicode-ident/latest/unicode_ident

So only XID_Continue characters from the UNICODE spec for Default identifier are accepted
https://unicode.org/reports/tr31/#Default_Identifier_Syntax

And '①' is not one of them if I try this JS way of checking it

> console.log(/\p{XID_Continue}/u.test("①"))
false

I find also reference to this in Rust references: https://doc.rust-lang.org/reference/identifiers.html

@jgm
Copy link
Owner

jgm commented May 2, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants