Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM guided generation with xGrammars has regressed after vLLM v0.6.5. Very slow TTFT. #156

Open
AlbertoCastelo opened this issue Jan 16, 2025 · 2 comments

Comments

@AlbertoCastelo
Copy link

See issue I opened in vLLM

@Ubospica
Copy link
Collaborator

Ubospica commented Jan 17, 2025

Hi @AlbertoCastelo, thanks for raising the issue! Could you provide the grammar you are using, so we can better find the problem?

@AlbertoCastelo
Copy link
Author

@Ubospica completely forgot I didn't answer here.

I cannot share the full thing but I've replicated the issue with the example below. This is the intended structure of my response (I know it's not ideal but I cannot change it at this point):

Preamble with free text (except this sequence of characters "<|start|>"
<|start|>
generate some structured response
<|end|>

Issue

Take a look at the definition of preamble. Is there a better way to avoid a sequence of characters?

Slow grammar

the preamble avoid the "<|" sequence

root ::= message
message ::= preamble "<|start|>\n" structured-content "\n<|end|>"

preamble ::= ([^<] | "<" [^|])*
structured-content ::= ...

Faster grammar

the preamble only avoids the char "<" and takes it as indication of <|start|> block should start.

root ::= message
message ::= preamble "<|start|>\n" structured-content "\n<|end|>"

preamble ::= [^<]* 
structured-content ::= ...

Questions

  • Is there a better way to avoid a sequence of characters?
  • Also do you think having too much free text penalises the performance? do you guys have some benchmarks on this?
    • Intuitively I think that the more structured the response the better because it can take advantage of skipping a few forward passess (decoding several tokens at once).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants