Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic schema creation #1383

Open
cpfiffer opened this issue Jan 18, 2025 · 2 comments
Open

Dynamic schema creation #1383

cpfiffer opened this issue Jan 18, 2025 · 2 comments

Comments

@cpfiffer
Copy link
Contributor

cpfiffer commented Jan 18, 2025

I recently noticed GenSON, a library for generating JSON schemas dynamically.

I think we should evaluate ways of providing tools for dynamic JSON schema generation, possibly by simply showcasing how to use GenSON or a similar library. If anyone has other ideas, they're very welcome!

Typically, the way we specify schemas in Outlines is to use Pydantic, like so:

class CalendarEvent(BaseModel):
    description: str
    date: datetime

For well structured programs, this is a great idea. Providing users a simple, clean interface in Pydantic enforces good practice. Users are required to provide a Pydantic model that can be used in a type-safe way everywhere in their application.

However, there are many cases where this can be problematic. Pydantic can make it difficult to program with Outlines when the schema must be modified in-place.

My simplest example is this:

class CalendarEvent(BaseModel):
    id: str  # Note the addition of the id field
    description: str
    date: datetime

This ID is stored in my database, or uniquely generated before the model generates the object.

When you give this to the model, it will make up an ID that may not be unique. What I would like to do instead is:

class CalendarEvent(BaseModel):
    id: Literal['abc']
    description: str
    date: datetime

This will force the model to use the ID I provide, and I won't have to do any post-generation clean up to enforce a unique UUID.

This is difficult to do, currently. I wrote an example of how to dynamically create Pydantic models, but it is quite clunky and does not have a convenient user interface. I've included an example of this in the detail block at the end of the issue.

Other examples

Here's a few other cases where dynamic schema creation might be a useful user interface feature.

  1. Function calling. Currently we do gnarly regular expressions, or set up mega-function calling objects. Here we can flexibly define functions that the model may choose from during runtime.
  2. General runtime usage. I often run into cases where I need to change the schema in standard control flow, such as changing enums within larger classes.
  3. Flexibility. Working with Pydantic dynamically is a pain in general. Pydantic is great when you have a fixed structure, but often you may wish to provide flexibly schemas conditional on a model response. Imaging that two disconnected systems send each other JSON -- if you build a schema from the message Alice sends to Bob, Bob can just replicate that schema and kick it back to Alice in a format Alice understands.
  4. Simplicity. You don't always need internal Python objects that you get from Pydantic. I would often be happy with just a dict for throwaways, especially when I don't want to have a gigantic models/ directory packed with tiny Pydantic classes.
# imports
from typing import Annotated
from annotated_types import Len
import outlines
from transformers import AutoTokenizer
from pydantic import BaseModel, Field
from rich import print
from pydantic import create_model

# Initialize the model
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = outlines.models.transformers(
    model_name,
    device="auto",
)

# load tokenizer to apply chat template
tokenizer = AutoTokenizer.from_pretrained(model_name)

def template(prompt):
    templated = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)
    return templated

class Task(BaseModel):
    task: str

def LimitedList(
        max_items: int = 5,
        min_items: int = 0,
    ):
    return create_model(
        "LimitedList",
        items=(list[Task], Field(max_length=max_items, min_length=min_items))
    )

# Create the dynamic model
limited_class = LimitedList(
    max_items=10,
    min_items=9
)

# Make a list generator function. Takes a prompt and 
# returns a list of tasks.
list_generator = outlines.generate.json(
    model,
    limited_class
)

prompt = f"""
I'm making building a house. 

Please provide a list of tasks 
that need to be completed.

Response format:
{limited_class.model_json_schema()}
"""

# Prompt the model
task_list = list_generator(template(prompt))
for idea in task_list.items:
    print(f"  - {idea.task}")
@rlouf
Copy link
Member

rlouf commented Jan 18, 2025

This is a fairly simple integration, a good first issue.

@g-prz
Copy link
Contributor

g-prz commented Jan 22, 2025

Hey 😁
Found it interesting and worked quickly on it, down for any comments! 🤓

rlouf pushed a commit that referenced this issue Jan 27, 2025
This PR aims at integrating support of the `genson` package (in
`generate.json`) to be able to use dynamic json schema generation as
proposed in #1383.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants