Today we release IFStruct, a new benchmark to measure how well models generate structured outputs.
A 350M model trained on it outperforms models more than 10x its size.
🧵
Today we release IFStruct, a new benchmark to measure how well models generate structured outputs.
A 350M model trained on it outperforms models more than 10x its size.
🧵
No Digg Deeper questions have been answered for this story yet.

LFM2.5-350M starts at 21.10% and reaches 44.90% after training, ahead of Qwen3.5-4B at 36.25% and granite-4.0-h-tiny at 38.75%. Frontier models near 100%.
(4/n)

IFStruct is particularly well-suited for teams building production workflows that depend on structured output and for anyone looking to train smaller models on the task via RL. IFStruct is available now.
> Benchmark: https://github.com/Liquid4All/ifstruct > Dataset: http://huggingface.co/datasets/LiquidAI/ifstruct-v1.0 > Blog: https://www.liquid.ai/blog/ifstruct-v1.0
Getting LLMs to output valid JSON is one of the most common production tasks.
But most benchmarks can't tell if your model actually does it well.
Here's how the team at @LiquidAI built IFStruct to measure exactly this (and how they trained a 350M model to beat models 10x its size). 🧵

Structured output is one of the most common things we ask models to do and still where they break.
Most benchmarks test with clean, finalized schema. Real requests use plain language, paste an annotated example, switch formats halfway, and slip in constraints like "no code fence" or "no commentary."
(2/n)

IFStruct presents requirements in all of those forms: chat requests, bullet lists with explicit paths, raw JSON Schema, annotated JSON or YAML, ASCII tables. Half are rewritten into natural prose. Scoring is binary. Every field, type, enum, bound, and count right, with no invented keys.
The same generator that builds the eval builds training data just as easily. The same yes/no check that scores the benchmark can train the model.
(3/n)

The results:
LFM2.5-350M (base): 21.10% LFM2.5-350M (+ RL): 44.90%
Qwen3.5-4B: 36.25% granite-4.0-h-tiny: 38.75%
After RL training on a held-out set, the 350M model beats models 10x its size.

Most evals either do one of two things: > force the model's output using hard rules > score content quality alongside format.
The gap IFStruct fills is to answer the question:
"Can a model follow a schema when a user asks for it in plain language?"

The dataset:
Schema requirements are presented in 6 styles (because that's how users actually write them):
• Raw JSON Schema • Annotated examples • Conversational chat requests • Flat path glossaries with field types • Bullet points with explicit field paths

The validator:
Scoring is binary: Pass only if every constraint is satisfied.
For example:
{ "vendor_name": "Acme", "invoice_total_usd": 1200, "paid_by_bank_transfer": true ← FAIL }
This would fail because the schema required paid_by_bank_transfer_allowed. (No partial credit.)

@liquidai @dan_sci_phil

Benchmark and dataset are open source.
Blog: https://www.liquid.ai/blog/ifstruct-v1.0 GitHub: https://github.com/Liquid4All/ifstruct Dataset: https://huggingface.co/datasets/LiquidAI/ifstruct-v1.0

@liquidai eval repo on github seems to be private!

@liquidai 👀 https://arxiv.org/abs/2408.11061

@liquidai @Presidentlin