Skip to content

Investigate replacement for syntect (syntax highlighting) #2758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Keats opened this issue Jan 2, 2025 · 17 comments
Open

Investigate replacement for syntect (syntax highlighting) #2758

Keats opened this issue Jan 2, 2025 · 17 comments

Comments

@Keats
Copy link
Collaborator

Keats commented Jan 2, 2025

Previous thread (#1787) was focused on tree-sitter but it might turn out to not be the best choice due to size (some syntaxes are 90MB+) and slowness to load (which is planned to be fixed eventually).

An alternative would be to write a textmate parser so we can re-use all of VSCode syntaxes/themes since it has a big momentum and should not be too hard to write. I have a branch with just the serde structs in https://github.com/getzola/giallo/tree/tm
We can use https://github.com/rust-onig/rust-onig for the regex library, like syntect is doing.

@si14
Copy link

si14 commented Jan 2, 2025

Fwiw I had a look at about a dozen grammars, and all of them were using very simple regexes. I believe the regex crate should be able to handle them, maybe with a translation layer if there are any exceptions?

@Keats
Copy link
Collaborator Author

Keats commented Jan 2, 2025

Most of the tm grammars are written with onig in mind, regex doesn't have look around or backtracking which is definitely used in some places

@si14
Copy link

si14 commented Jan 2, 2025

Fair, I double checked and realised I didn't pay enough attention, there are plenty of ?<= and ?=. Sorry for the noise, I was wrong.

@nrdxp
Copy link

nrdxp commented Jan 2, 2025

re: #1787 (comment)

Personally I haven't seen a compelling difference between tree-sitter and textmate for code snippets

This is probably more true with popular languages with well maintained grammars. A good counter-example would be the Nix language, which I haven't seen a decent highlighter for, except in the tree-sitter realm.

@Keavon
Copy link

Keavon commented Jan 2, 2025

Should this thread be given a Help Want label to attract contributors?

@Popolon
Copy link

Popolon commented Jan 12, 2025

Why not have the options between proposed highlighters, instead of only one ? Isn't some license problems with VScode syntax, due to MS politics, that would force people to create new ones, with possible legal issues?

This blog speak about pygments support in Zola, that is not the fast, but has a large support of languages and would avoid any legal issues in FLOSS usage: https://c.pgdm.ch/notes/zola-pygments/

There is also pygments for rust: https://github.com/Alignof/pygments-rs

@Keats
Copy link
Collaborator Author

Keats commented Jan 13, 2025

Why not have the options between proposed highlighters, instead of only one ?

Too much work and inconsistencies.

Isn't some license problems with VScode syntax, due to MS politics, that would force people to create new ones, with possible legal issues

Can you expand on that? Most of the syntaxes i've seen are MIT licensed.

This blog speak about pygments support in Zola
There is also pygments for rust: https://github.com/Alignof/pygments-rs

If you look at that repo, you'll see there isn't a single Rust file. Probably just a fork of the original pygments codebase. I've actually started implementing pygments in Rust years ago for Zola but stopped because the output is nowhere near as nice as syntect/vscode/tree-sitter.

IMO the right play is to port https://github.com/microsoft/vscode-textmate to Rust just to benefit from up to date syntaxes and the super active community there. It's not a crazy amount of work except for the fact they seem to really not like to use comments to explain what's going on. Honestly if no one does it before me, I will likely do it but it's not going to be in any reasonable time frame x)

@si14
Copy link

si14 commented Apr 26, 2025

I started a vscode-textmate rewrite here: https://github.com/si14/rust-textmate

It's very, very early days, so far I only succeeded to load all grammars (after backtracking from several dead ends). Assorted, unordered notes in case anyone will get to it before I do, and to remember the context if I don't get to it in a while:

  • The way textmate grammars work is cursed.
  • This page lists a bunch of useful links (none of them are "reference") https://markdown-all-in-one.github.io/docs/contributing/textmate-language-grammar.html
  • Trying to distinguish between end/while matches while parsing json with serde does not work well. I tried, but it's just too inflexible to cope with how flexible the "spec" is and how many mistakes there are in real world grammars. I intend to do this in two stages, ingest messy JSON (using same flexible structure vscode-textmate does), then clean up into better data structures, intern strings, etc.
  • end/while regexps aren't real regexps, because textmate refers to match groups from the begin regex. vscode-textmate copes with that using string substitution on the regexes. Since we want compiled regexes, they'll probably need some sort of runtime cache. Details 1, details 2.
  • Shiki has an exceptional collection of cleaned-up and optimised grammars (they use oniguruma-parser to lint/optimise regexes).
  • It was quite eye opening to run the parser on all grammars.
  • Oniguruma itself is abandoned as of a few days ago. I suppose a fork will emerge at some point.
  • Last rust-onig release is three years old and is currently broken under new gcc.
  • I looked long and hard at using fancy-regex instead of Oniguruma, but it doesn't seem unfeasible. Lookaheads and lookbehinds would work, even some missing escape sequences can be worked around, but Subexp calls ("Tanaka Akira special") appear to be a hard block.
  • Includes are fun. This writeup might leave an impression that they are just IDs; they aren't, they are complicated
  • I haven't found a good centralised reference for how injections are supposed to work yet. There are bits and pieces like this one
  • One complication for a straight rewrite is that while vscode-textmate is written in typescript, the underlying language tolerates shenanigans that won't work in rust. For example, while most grammars use capture groups with names corresponding to indices (eg captures: {"0": {...}, "1": {...}}), some grammars use arrays (captures: [{}, {}]) and it still seems to work fine in JS. Not so much in Rust.

I'll probably continue working on this very slowly.

@si14
Copy link

si14 commented Apr 27, 2025

More notes!

@Keats
Copy link
Collaborator Author

Keats commented Apr 27, 2025

Nice! I got it to loading deserializing grammars a few months ago but realised i wouldn't have the time to do anything on it. I didn't know about the oniguruma to es though that's nice.

The way textmate grammars work is cursed.

I remember spending lots of time on it maybe a year ago or so, I can confirm.

@slevithan
Copy link

That's solid research, @si14. Also CC @RedCMD who is the author of TmLanguage-Syntax-Highlighter which you highlighted above.

As the author of oniguruma-to-es and oniguruma-parser (which Shiki uses to optimize Oniguruma regexes and translate them to JS), I'd be happy to answer questions related to those libraries, if helpful.

And yeah, you can't simply use a different regex implementation like Rust's regex crate, fancy-regex, PCRE2, Onigmo, or anything else if you want to support the long tail of TextMate grammars supported by VS Code, Shiki, etc. Oniguruma has lots of features and lots of edge case differences (in both syntax and behavior) compared to other engines. Simply looking at its syntax doc won't give the whole story. Getting oniguruma-to-es to its current ~99.99% support of real-world Oniguruma regexes (resulting in all but one regex out of ~55k being supported in Shiki's 221 provided grammars) required building by far the most sophisticated regex translator in the world. GitHub Linguist made the (IMO terrible) decision to use PCRE instead of Oniguruma (with a too-basic translation layer), and that has resulted in more than a decade of incompatibility bugs and pain for some TextMate grammar authors.

@si14
Copy link

si14 commented Apr 29, 2025

@slevithan thank you for your kind words and for the massive amount of work you've done! I might take you up on the offer in the future, ha.

@RedCMD
Copy link

RedCMD commented Apr 29, 2025

Hello allo

TextMate 2.0 doesn't support captures: [{}, {}] syntax
the same also happens with "injectionSelector" js converts an array to a string with comma separators, which just happens to work

https://github.com/microsoft/vscode
https://github.com/microsoft/vscode-textmate/tree/v9.2.0
https://github.com/microsoft/vscode-oniguruma/tree/v1.7.0
https://github.com/kkos/oniguruma/tree/v6.9.8

https://github.com/textmate/textmate
https://github.com/textmate/Onigmo/tree/Onigmo-5.13.5
fork of https://github.com/k-takata/Onigmo
fork of https://github.com/kkos/oniguruma

@si14
Copy link

si14 commented May 6, 2025

Progress report: I managed to compile most of shiki grammars into a format loosely similar to what vscode does 🎉

@RedCMD @slevithan I think I broadly understand how grammar execution works, but there are still some bit I'm struggling with. I'd appreciate if you could share your thoughts:

  • what's the deal with \G and \A? \A seems trivial, but then I can see it being special cased together with \G here. Also, what does \G mean precisely?
  • why does vscode have a special case for \z?

Thanks in advance!

@RedCMD
Copy link

RedCMD commented May 6, 2025

VSCode needs a way to tell oniguruma that the current line is the first, middle or last line of the document
oniguruma added options to do so kkos/oniguruma#198
microsoft/vscode-textmate@1aec087
but it caused too big a performance problem
so it had to be reverted microsoft/vscode-textmate@eeff31f

so VSCode instead replaces \A with unicode character \uFFFF
when the string is NOT the first line of the document
(causing other problems microsoft/vscode-textmate#126)

VSCode attempts something similar with replacing \z with $(?!\n)(?<!\n)
but that regex never matches
because VSCode always places a newline at the end of the string
EDIT: it does match at the end of a (recaptured) string

the \G anchor should only match at the position directly after a begin or while
or the beginning of the next line if begin/while captures the ending newline \n

@slevithan
Copy link

slevithan commented May 6, 2025

what's the deal with \G and \A? \A seems trivial, but then I can see it being special cased together with \G here.

That code you referenced is pretty bad, and hard to understand. It seems to be replacing \A and \G, under certain conditions, with an escaped literal U+FFFF ('\\\uFFFF', which matches the same thing as just '\uFFFF' or '�'). My understanding of this is that it's trying to prevent these particular \A and \G tokens from ever matching, by replacing them with a code point that will rarely (but not never!) be in the target string. This is implied in the names allowA and allowG. There are perfectly good ways to prevent a regex from matching, such as (?!), that are not such a terrible hack. But this is just one of many things that points to the authors of vscode-textmate not being particularly knowledgeable about regexes.

If you look at tokenizeString.ts you'll see this seems to be related to using Oniguruma options ONIG_OPTION_NOT_BEGIN_STRING and ONIG_OPTION_NOT_BEGIN_POSITION (which make \A and \G fail, respectively). But if you're already using these Oniguruma options, I don't know why you need to mess with the patterns. It seems all of this might be related to debugging (see UseOnigurumaFindOptions).

I would skip reproducing any of this. I don't think it's needed, since TextMate grammars aren't able to set these Oniguruma options anyway.

Also, what does \G mean precisely?

Like ^ or \b, \G is an assertion that either matches an empty string or fails to match (forcing backtracking and possibly overall match failure), based on the match attempt position. It will match at position 0 during the first match attempt for a target string. In subsequent match attempts, it will match if the current position is the start of the match attempt (Oniguruma, Onigmo) or the end of the previous match (.NET, PCRE, Perl, Java, Boost.Regex).

The distinction in my last sentence is subtle but important. It's relevant after zero-length matches, where the read-head advance will make the "end of the previous match" one character prior to the start of the match attempt.

JavaScript regexes have flag /y which is similar but far less flexible than assertion \G, since \G can appear anywhere within a pattern (including things like (?!\G)). Oniguruma-To-ES uses several tricks to make \G work with native JavaScript regexes. Sometimes it adds flag y, sometimes it makes other pattern changes, and sometimes it pairs pattern changes with special handling in a RegExp subclass. I can tell you from my work on emulating \G that a lot of TextMate grammars use \G in a lot of different ways.

why does vscode have a special case for \z?

I can't make sense of it, after a quick look. If that actually applies in the standard case, it's inappropriately changing the definition of \z to not allow matching the position at the end of the string if it's preceded by a line feed. Maybe this was again some kind of debugging thing, or maybe the author thought they were making it equivalent to their understanding of uppercase \Z from other regex flavors (without realizing they could just use \Z in Oniguruma). But really, it's inventing a unique definition that doesn't make sense and is different than the meaning of $, \Z, or \z in any regex flavor. Like I said, lots of the code seems to indicate a poor understanding of regex subtleties and edge cases.

I'm not sure if that's actually being used. But again, I think you can just ignore it. I say that because Oniguruma-To-ES uses a correct definition when it translates \z to JS, yet it manages to match exactly the same strings for ~55,000 regexes when tested against Shiki's 220+ language samples (each grammar is tested against its language sample using both Oniguruma via WASM and JS regexes generated by Oniguruma-To-ES, to generate this report).

@slevithan
Copy link

slevithan commented May 6, 2025

Whoa, @RedCMD dropping knowledge of the actual history of these changes! 😁 (I started replying before he posted, and I'm not super knowledgeable like he is about TextMate grammars and vscode-textmate.)

I still think you can skip reproducing these hacks if you're planning to use Oniguruma-To-ES with a JS RegExp engine. I'm not deeply knowledgeable of how Shiki interacts with vscode-textmate when it uses Oniguruma-To-ES, but I do know that Oniguruma-To-ES's translations of \A, \G, and \z don't leave these original metasequences behind (since they aren't supported in native JS regexes), so vscode-textmate won't find any of them in the patterns. But it works without being able to do so.

Aside: If you're interested in understanding the precise meaning of particular Oniguruma features (and you have a good understanding of JS regexes), you can try them on the Oniguruma-To-ES demo page. Of course, if it's a complex feature you might only be seeing the translation in a particular context, but playing around with different patterns there can give you a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants