-
Notifications
You must be signed in to change notification settings - Fork 1k
Investigate replacement for syntect (syntax highlighting) #2758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Fwiw I had a look at about a dozen grammars, and all of them were using very simple regexes. I believe the |
Most of the tm grammars are written with onig in mind, regex doesn't have look around or backtracking which is definitely used in some places |
Fair, I double checked and realised I didn't pay enough attention, there are plenty of |
re: #1787 (comment)
This is probably more true with popular languages with well maintained grammars. A good counter-example would be the Nix language, which I haven't seen a decent highlighter for, except in the tree-sitter realm. |
Should this thread be given a Help Want label to attract contributors? |
Why not have the options between proposed highlighters, instead of only one ? Isn't some license problems with VScode syntax, due to MS politics, that would force people to create new ones, with possible legal issues? This blog speak about pygments support in Zola, that is not the fast, but has a large support of languages and would avoid any legal issues in FLOSS usage: https://c.pgdm.ch/notes/zola-pygments/ There is also pygments for rust: https://github.com/Alignof/pygments-rs |
Too much work and inconsistencies.
Can you expand on that? Most of the syntaxes i've seen are MIT licensed.
If you look at that repo, you'll see there isn't a single Rust file. Probably just a fork of the original pygments codebase. I've actually started implementing pygments in Rust years ago for Zola but stopped because the output is nowhere near as nice as syntect/vscode/tree-sitter. IMO the right play is to port https://github.com/microsoft/vscode-textmate to Rust just to benefit from up to date syntaxes and the super active community there. It's not a crazy amount of work except for the fact they seem to really not like to use comments to explain what's going on. Honestly if no one does it before me, I will likely do it but it's not going to be in any reasonable time frame x) |
I started a It's very, very early days, so far I only succeeded to load all grammars (after backtracking from several dead ends). Assorted, unordered notes in case anyone will get to it before I do, and to remember the context if I don't get to it in a while:
I'll probably continue working on this very slowly. |
More notes!
|
Nice! I got it to loading deserializing grammars a few months ago but realised i wouldn't have the time to do anything on it. I didn't know about the oniguruma to es though that's nice.
I remember spending lots of time on it maybe a year ago or so, I can confirm. |
That's solid research, @si14. Also CC @RedCMD who is the author of TmLanguage-Syntax-Highlighter which you highlighted above. As the author of oniguruma-to-es and oniguruma-parser (which Shiki uses to optimize Oniguruma regexes and translate them to JS), I'd be happy to answer questions related to those libraries, if helpful. And yeah, you can't simply use a different regex implementation like Rust's regex crate, fancy-regex, PCRE2, Onigmo, or anything else if you want to support the long tail of TextMate grammars supported by VS Code, Shiki, etc. Oniguruma has lots of features and lots of edge case differences (in both syntax and behavior) compared to other engines. Simply looking at its syntax doc won't give the whole story. Getting oniguruma-to-es to its current ~99.99% support of real-world Oniguruma regexes (resulting in all but one regex out of ~55k being supported in Shiki's 221 provided grammars) required building by far the most sophisticated regex translator in the world. GitHub Linguist made the (IMO terrible) decision to use PCRE instead of Oniguruma (with a too-basic translation layer), and that has resulted in more than a decade of incompatibility bugs and pain for some TextMate grammar authors. |
@slevithan thank you for your kind words and for the massive amount of work you've done! I might take you up on the offer in the future, ha. |
Hello allo TextMate 2.0 doesn't support https://github.com/microsoft/vscode https://github.com/textmate/textmate |
Progress report: I managed to compile most of @RedCMD @slevithan I think I broadly understand how grammar execution works, but there are still some bit I'm struggling with. I'd appreciate if you could share your thoughts:
Thanks in advance! |
VSCode needs a way to tell oniguruma that the current line is the first, middle or last line of the document so VSCode instead replaces VSCode attempts something similar with replacing the |
That code you referenced is pretty bad, and hard to understand. It seems to be replacing If you look at tokenizeString.ts you'll see this seems to be related to using Oniguruma options I would skip reproducing any of this. I don't think it's needed, since TextMate grammars aren't able to set these Oniguruma options anyway.
Like The distinction in my last sentence is subtle but important. It's relevant after zero-length matches, where the read-head advance will make the "end of the previous match" one character prior to the start of the match attempt. JavaScript regexes have flag
I can't make sense of it, after a quick look. If that actually applies in the standard case, it's inappropriately changing the definition of I'm not sure if that's actually being used. But again, I think you can just ignore it. I say that because Oniguruma-To-ES uses a correct definition when it translates |
Whoa, @RedCMD dropping knowledge of the actual history of these changes! 😁 (I started replying before he posted, and I'm not super knowledgeable like he is about TextMate grammars and vscode-textmate.) I still think you can skip reproducing these hacks if you're planning to use Oniguruma-To-ES with a JS RegExp engine. I'm not deeply knowledgeable of how Shiki interacts with vscode-textmate when it uses Oniguruma-To-ES, but I do know that Oniguruma-To-ES's translations of Aside: If you're interested in understanding the precise meaning of particular Oniguruma features (and you have a good understanding of JS regexes), you can try them on the Oniguruma-To-ES demo page. Of course, if it's a complex feature you might only be seeing the translation in a particular context, but playing around with different patterns there can give you a good idea. |
Previous thread (#1787) was focused on tree-sitter but it might turn out to not be the best choice due to size (some syntaxes are 90MB+) and slowness to load (which is planned to be fixed eventually).
An alternative would be to write a textmate parser so we can re-use all of VSCode syntaxes/themes since it has a big momentum and should not be too hard to write. I have a branch with just the serde structs in https://github.com/getzola/giallo/tree/tm
We can use https://github.com/rust-onig/rust-onig for the regex library, like syntect is doing.
The text was updated successfully, but these errors were encountered: