Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose max_recursion_depth in build_regex_from_schema #181

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dariogod
Copy link

Currently, the max_recursion_depth variable is not exposed in Python, so the default value of 3 is used.
This PR exposes the argument to handle complex json schemas.

@torymur
Copy link
Contributor

torymur commented Feb 21, 2025

Hi @dariogod

Why would you want to be able to set it? Could you share your use-case, please?

@dariogod
Copy link
Author

dariogod commented Feb 27, 2025

Hello @torymur

We're using outlines to perform NER on documents. One of our use cases is to extract (nested) personal information.

The (simplified) json schema looks something like this:

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Human(BaseModel):
    first_name: str
    last_name: str
    address: Address

class Enterprise(BaseModel):
    name: str
    address: Address

class Owner(BaseModel):
    active: bool
    type: Literal["human", "enterprise"]
    relation: Human | Enterprise

class DocumentExtraction(BaseModel):
    document_number: int
    owners: list[Owner]

We then use the following code snippet:

regex_str = build_regex_from_schema(
    schema_str, 
    whitespace_pattern=r"[ \n]{0,6}", 
    max_recursion_depth=5  # We need to increase this from default (=3)
)

regex_guide = RegexGuide.from_regex(regex_str, tokenizer)

logits_processor = GuideLogitsProcessor(
    tokenizer = self.outlines_tokenizer,
    guide = regex_guide
)

We then use logits_processor for generation with VLLM

--
Currently there is no way to increase the default. This leads to the Address object being skipped in the regex string, and therefore in generation output.

@torymur
Copy link
Contributor

torymur commented Mar 11, 2025

Thank you for the example, I understand that in these cases having deeper recursion makes sense.

Exposing this particular argument was never a problem, the problem with recursion's depth is described here:
https://docs.rs/outlines-core/latest/outlines_core/json_schema/index.html#recursion
https://github.com/dottxt-ai/outlines-core/blob/main/src/json_schema/parsing.rs#L18-L25

Unbounded recursion can easily spiral into combinatorial explosions, so this particular flexibility for users needs protecting against huge regex sizes with a very clear error message.

I imagine this mechanism could be done in many different ways:

  • some kind of regex size estimation mechanism if the regex grows beyond a safe threshold
  • allow for deeper recursion in restrictive cases, but has some hard caps for unbounded cases (challenge is in estimating complexity and differentiating between the two)
  • more sophisticated recursion mechanism, which has certain depth level per each parent, not just simple +1 on every parent
  • etc. (feel free to suggest a better way)

It will have to be thoroughly tested, particularly with unrestricted examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants