Investigate replacement for syntect (syntax highlighting) #2758

Keats · 2025-01-02T09:42:53Z

Previous thread (#1787) was focused on tree-sitter but it might turn out to not be the best choice due to size (some syntaxes are 90MB+) and slowness to load (which is planned to be fixed eventually).

An alternative would be to write a textmate parser so we can re-use all of VSCode syntaxes/themes since it has a big momentum and should not be too hard to write. I have a branch with just the serde structs in https://github.com/getzola/giallo/tree/tm
We can use https://github.com/rust-onig/rust-onig for the regex library, like syntect is doing.

si14 · 2025-01-02T14:12:35Z

Fwiw I had a look at about a dozen grammars, and all of them were using very simple regexes. I believe the regex crate should be able to handle them, maybe with a translation layer if there are any exceptions?

Keats · 2025-01-02T14:19:01Z

Most of the tm grammars are written with onig in mind, regex doesn't have look around or backtracking which is definitely used in some places

si14 · 2025-01-02T14:33:01Z

Fair, I double checked and realised I didn't pay enough attention, there are plenty of ?<= and ?=. Sorry for the noise, I was wrong.

nrdxp · 2025-01-02T18:04:19Z

re: #1787 (comment)

Personally I haven't seen a compelling difference between tree-sitter and textmate for code snippets

This is probably more true with popular languages with well maintained grammars. A good counter-example would be the Nix language, which I haven't seen a decent highlighter for, except in the tree-sitter realm.

Keavon · 2025-01-02T18:35:05Z

Should this thread be given a Help Want label to attract contributors?

Popolon · 2025-01-12T14:03:33Z

Why not have the options between proposed highlighters, instead of only one ? Isn't some license problems with VScode syntax, due to MS politics, that would force people to create new ones, with possible legal issues?

This blog speak about pygments support in Zola, that is not the fast, but has a large support of languages and would avoid any legal issues in FLOSS usage: https://c.pgdm.ch/notes/zola-pygments/

There is also pygments for rust: https://github.com/Alignof/pygments-rs

Keats · 2025-01-13T18:47:19Z

Why not have the options between proposed highlighters, instead of only one ?

Too much work and inconsistencies.

Isn't some license problems with VScode syntax, due to MS politics, that would force people to create new ones, with possible legal issues

Can you expand on that? Most of the syntaxes i've seen are MIT licensed.

This blog speak about pygments support in Zola
There is also pygments for rust: https://github.com/Alignof/pygments-rs

If you look at that repo, you'll see there isn't a single Rust file. Probably just a fork of the original pygments codebase. I've actually started implementing pygments in Rust years ago for Zola but stopped because the output is nowhere near as nice as syntect/vscode/tree-sitter.

IMO the right play is to port https://github.com/microsoft/vscode-textmate to Rust just to benefit from up to date syntaxes and the super active community there. It's not a crazy amount of work except for the fact they seem to really not like to use comments to explain what's going on. Honestly if no one does it before me, I will likely do it but it's not going to be in any reasonable time frame x)

si14 · 2025-04-26T23:31:16Z

I started a vscode-textmate rewrite here: https://github.com/si14/rust-textmate

It's very, very early days, so far I only succeeded to load all grammars (after backtracking from several dead ends). Assorted, unordered notes in case anyone will get to it before I do, and to remember the context if I don't get to it in a while:

The way textmate grammars work is cursed.
This page lists a bunch of useful links (none of them are "reference") https://markdown-all-in-one.github.io/docs/contributing/textmate-language-grammar.html
Trying to distinguish between end/while matches while parsing json with serde does not work well. I tried, but it's just too inflexible to cope with how flexible the "spec" is and how many mistakes there are in real world grammars. I intend to do this in two stages, ingest messy JSON (using same flexible structure vscode-textmate does), then clean up into better data structures, intern strings, etc.
end/while regexps aren't real regexps, because textmate refers to match groups from the begin regex. vscode-textmate copes with that using string substitution on the regexes. Since we want compiled regexes, they'll probably need some sort of runtime cache. Details 1, details 2.
Shiki has an exceptional collection of cleaned-up and optimised grammars (they use oniguruma-parser to lint/optimise regexes).
It was quite eye opening to run the parser on all grammars.
Oniguruma itself is abandoned as of a few days ago. I suppose a fork will emerge at some point.
Last rust-onig release is three years old and is currently broken under new gcc.
I looked long and hard at using fancy-regex instead of Oniguruma, but it doesn't seem unfeasible. Lookaheads and lookbehinds would work, even some missing escape sequences can be worked around, but Subexp calls ("Tanaka Akira special") appear to be a hard block.
Includes are fun. This writeup might leave an impression that they are just IDs; they aren't, they are complicated
I haven't found a good centralised reference for how injections are supposed to work yet. There are bits and pieces like this one
One complication for a straight rewrite is that while vscode-textmate is written in typescript, the underlying language tolerates shenanigans that won't work in rust. For example, while most grammars use capture groups with names corresponding to indices (eg captures: {"0": {...}, "1": {...}}), some grammars use arrays (captures: [{}, {}]) and it still seems to work fine in JS. Not so much in Rust.

I'll probably continue working on this very slowly.

si14 · 2025-04-27T13:18:51Z

More notes!

it seems feasible to use regress (seems to be a well supported pure Rust ES regex implementation) together with Shiki's grammars and oniguruma-to-es. Shiki folks managed to make their grammars 99.5% compatible with ES regexes, so this should work.
LOTS of very useful information on TextMate grammars https://github.com/RedCMD/TmLanguage-Syntax-Highlighter , especially this JSON Schema.

Keats · 2025-04-27T18:40:39Z

Nice! I got it to loading deserializing grammars a few months ago but realised i wouldn't have the time to do anything on it. I didn't know about the oniguruma to es though that's nice.

The way textmate grammars work is cursed.

I remember spending lots of time on it maybe a year ago or so, I can confirm.

slevithan · 2025-04-29T13:19:38Z

That's solid research, @si14. Also CC @RedCMD who is the author of TmLanguage-Syntax-Highlighter which you highlighted above.

As the author of oniguruma-to-es and oniguruma-parser (which Shiki uses to optimize Oniguruma regexes and translate them to JS), I'd be happy to answer questions related to those libraries, if helpful.

And yeah, you can't simply use a different regex implementation like Rust's regex crate, fancy-regex, PCRE2, Onigmo, or anything else if you want to support the long tail of TextMate grammars supported by VS Code, Shiki, etc. Oniguruma has lots of features and lots of edge case differences (in both syntax and behavior) compared to other engines. Simply looking at its syntax doc won't give the whole story. Getting oniguruma-to-es to its current ~99.99% support of real-world Oniguruma regexes (resulting in all but one regex out of ~55k being supported in Shiki's 221 provided grammars) required building by far the most sophisticated regex translator in the world. GitHub Linguist made the (IMO terrible) decision to use PCRE instead of Oniguruma (with a too-basic translation layer), and that has resulted in more than a decade of incompatibility bugs and pain for some TextMate grammar authors.

si14 · 2025-04-29T13:37:01Z

@slevithan thank you for your kind words and for the massive amount of work you've done! I might take you up on the offer in the future, ha.

RedCMD · 2025-04-29T22:43:35Z

Hello allo

TextMate 2.0 doesn't support captures: [{}, {}] syntax
the same also happens with "injectionSelector" js converts an array to a string with comma separators, which just happens to work

https://github.com/microsoft/vscode
https://github.com/microsoft/vscode-textmate/tree/v9.2.0
https://github.com/microsoft/vscode-oniguruma/tree/v1.7.0
https://github.com/kkos/oniguruma/tree/v6.9.8

https://github.com/textmate/textmate
https://github.com/textmate/Onigmo/tree/Onigmo-5.13.5
fork of https://github.com/k-takata/Onigmo
fork of https://github.com/kkos/oniguruma

si14 · 2025-05-06T10:22:43Z

Progress report: I managed to compile most of shiki grammars into a format loosely similar to what vscode does 🎉

@RedCMD @slevithan I think I broadly understand how grammar execution works, but there are still some bit I'm struggling with. I'd appreciate if you could share your thoughts:

what's the deal with \G and \A? \A seems trivial, but then I can see it being special cased together with \G here. Also, what does \G mean precisely?
why does vscode have a special case for \z?

Thanks in advance!

RedCMD · 2025-05-06T11:25:00Z

VSCode needs a way to tell oniguruma that the current line is the first, middle or last line of the document
oniguruma added options to do so kkos/oniguruma#198
microsoft/vscode-textmate@1aec087
but it caused too big a performance problem
so it had to be reverted microsoft/vscode-textmate@eeff31f

so VSCode instead replaces \A with unicode character \uFFFF �
when the string is NOT the first line of the document
(causing other problems microsoft/vscode-textmate#126)

VSCode attempts something similar with replacing \z with $(?!\n)(?<!\n)
but that regex never matches
because VSCode always places a newline at the end of the string
EDIT: it does match at the end of a (recaptured) string

the \G anchor should only match at the position directly after a begin or while
or the beginning of the next line if begin/while captures the ending newline \n

slevithan · 2025-05-06T12:03:39Z

what's the deal with \G and \A? \A seems trivial, but then I can see it being special cased together with \G here.

That code you referenced is pretty bad, and hard to understand. It seems to be replacing \A and \G, under certain conditions, with an escaped literal U+FFFF ('\\\uFFFF', which matches the same thing as just '\uFFFF' or '�'). My understanding of this is that it's trying to prevent these particular \A and \G tokens from ever matching, by replacing them with a code point that will rarely (but not never!) be in the target string. This is implied in the names allowA and allowG. There are perfectly good ways to prevent a regex from matching, such as (?!), that are not such a terrible hack. But this is just one of many things that points to the authors of vscode-textmate not being particularly knowledgeable about regexes.

If you look at tokenizeString.ts you'll see this seems to be related to using Oniguruma options ONIG_OPTION_NOT_BEGIN_STRING and ONIG_OPTION_NOT_BEGIN_POSITION (which make \A and \G fail, respectively). But if you're already using these Oniguruma options, I don't know why you need to mess with the patterns. It seems all of this might be related to debugging (see UseOnigurumaFindOptions).

I would skip reproducing any of this. I don't think it's needed, since TextMate grammars aren't able to set these Oniguruma options anyway.

Also, what does \G mean precisely?

Like ^ or \b, \G is an assertion that either matches an empty string or fails to match (forcing backtracking and possibly overall match failure), based on the match attempt position. It will match at position 0 during the first match attempt for a target string. In subsequent match attempts, it will match if the current position is the start of the match attempt (Oniguruma, Onigmo) or the end of the previous match (.NET, PCRE, Perl, Java, Boost.Regex).

The distinction in my last sentence is subtle but important. It's relevant after zero-length matches, where the read-head advance will make the "end of the previous match" one character prior to the start of the match attempt.

JavaScript regexes have flag /y which is similar but far less flexible than assertion \G, since \G can appear anywhere within a pattern (including things like (?!\G)). Oniguruma-To-ES uses several tricks to make \G work with native JavaScript regexes. Sometimes it adds flag y, sometimes it makes other pattern changes, and sometimes it pairs pattern changes with special handling in a RegExp subclass. I can tell you from my work on emulating \G that a lot of TextMate grammars use \G in a lot of different ways.

why does vscode have a special case for \z?

I can't make sense of it, after a quick look. If that actually applies in the standard case, it's inappropriately changing the definition of \z to not allow matching the position at the end of the string if it's preceded by a line feed. Maybe this was again some kind of debugging thing, or maybe the author thought they were making it equivalent to their understanding of uppercase \Z from other regex flavors (without realizing they could just use \Z in Oniguruma). But really, it's inventing a unique definition that doesn't make sense and is different than the meaning of $, \Z, or \z in any regex flavor. Like I said, lots of the code seems to indicate a poor understanding of regex subtleties and edge cases.

I'm not sure if that's actually being used. But again, I think you can just ignore it. I say that because Oniguruma-To-ES uses a correct definition when it translates \z to JS, yet it manages to match exactly the same strings for ~55,000 regexes when tested against Shiki's 220+ language samples (each grammar is tested against its language sample using both Oniguruma via WASM and JS regexes generated by Oniguruma-To-ES, to generate this report).

slevithan · 2025-05-06T12:10:00Z

Whoa, @RedCMD dropping knowledge of the actual history of these changes! 😁 (I started replying before he posted, and I'm not super knowledgeable like he is about TextMate grammars and vscode-textmate.)

I still think you can skip reproducing these hacks if you're planning to use Oniguruma-To-ES with a JS RegExp engine. I'm not deeply knowledgeable of how Shiki interacts with vscode-textmate when it uses Oniguruma-To-ES, but I do know that Oniguruma-To-ES's translations of \A, \G, and \z don't leave these original metasequences behind (since they aren't supported in native JS regexes), so vscode-textmate won't find any of them in the patterns. But it works without being able to do so.

Aside: If you're interested in understanding the precise meaning of particular Oniguruma features (and you have a good understanding of JS regexes), you can try them on the Oniguruma-To-ES demo page. Of course, if it's a complex feature you might only be seeing the translation in a particular context, but playing around with different patterns there can give you a good idea.

Keats mentioned this issue Jan 2, 2025

Investigate tree-sitter to replace syntect #1787

Closed

Keats pinned this issue Jan 2, 2025

damccull mentioned this issue Jan 15, 2025

Request: Syntax highlights for http #2773

Closed

si14 mentioned this issue May 19, 2025

How similar to Oniguruma do we aim to be? fancy-regex/fancy-regex#162

Open

Investigate replacement for syntect (syntax highlighting) #2758

Investigate replacement for syntect (syntax highlighting) #2758

Comments

Keats commented Jan 2, 2025

si14 commented Jan 2, 2025

Uh oh!

Keats commented Jan 2, 2025

Uh oh!

si14 commented Jan 2, 2025

Uh oh!

nrdxp commented Jan 2, 2025

Uh oh!

Keavon commented Jan 2, 2025

Uh oh!

Popolon commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Keats commented Jan 13, 2025

Uh oh!

si14 commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

si14 commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Keats commented Apr 27, 2025

Uh oh!

slevithan commented Apr 29, 2025

Uh oh!

si14 commented Apr 29, 2025

Uh oh!

RedCMD commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

si14 commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RedCMD commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slevithan commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slevithan commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Popolon commented Jan 12, 2025 •

edited

Loading

si14 commented Apr 26, 2025 •

edited

Loading

si14 commented Apr 27, 2025 •

edited

Loading

RedCMD commented Apr 29, 2025 •

edited

Loading

si14 commented May 6, 2025 •

edited

Loading

RedCMD commented May 6, 2025 •

edited

Loading

slevithan commented May 6, 2025 •

edited

Loading

slevithan commented May 6, 2025 •

edited

Loading