/Tech2h ago

Hugging Face's Thomas Wolf and collaborators launch CADGenBench to benchmark how AI models generate and edit functional 3D CAD parts

Early baselines show GPT-5.5 Pro leading with a 0.387 score

165312235.6K

#156

Original post

Thomas Wolf@Thom_Wolf#156inTech

AI is moving beyond text, images, and code.

Engineering artifacts are becoming a new class of model outputs and evaluating them requires different tools than we use for text, code, or images.

Today we're excited to release CADGenBench, a benchmark for CAD generation and editing.

- Given an engineering drawing → generate a valid 3D CAD model - Given a STEP file + change request → edit it correctly

The benchmark is tool-agnostic: any CAD stack works (Fusion, Onshape, build123d, SolidWorks, etc.). Submissions are simply STEP files.

Models are scored on: * geometric accuracy * topology correctness * interface compatibility * CAD validity

The benchmark is open, the ground truth is private, and the leaderboard is live.

Since CAD evaluation is surprisingly subtle, here's how the metrics work 🧵

Michael Rabinovich@MikushRab

Introducing CADGenBench: measure how well AI systems produce engineering-grade 3D parts!

While current models can generate 3D parts, they are far from precise enough to build functional parts. We built a benchmark to systematically measure their capabilities on two tasks:

1. Generation from an engineering drawing of a part 2. Editing: given an existing STEP file and a requested change

The benchmark is tool-agnostic. It makes no assumptions about how you build the model. You can vary the LLM, and you can vary the environment. Use build123d, Onshape, Autodesk, or a model without an LLM entirely. We open sourced the scoring engine and a reference baseline on top of build123d.

A collaboration between Hugging Face and @mecadoinc!

Submission space: https://huggingface.co/spaces/HuggingAI4Engineering/CADGenBench Code repository: https://github.com/huggingface/cadgenbench

10:01 AM · Jun 8, 2026 · 1.1K Views

/Tech2h ago

Hugging Face's Thomas Wolf and collaborators launch CADGenBench to benchmark how AI models generate and edit functional 3D CAD parts

Early baselines show GPT-5.5 Pro leading with a 0.387 score

165312235.6K

#156

Original post

Thomas Wolf@Thom_Wolf#156inTech

AI is moving beyond text, images, and code.

Engineering artifacts are becoming a new class of model outputs and evaluating them requires different tools than we use for text, code, or images.

Today we're excited to release CADGenBench, a benchmark for CAD generation and editing.

- Given an engineering drawing → generate a valid 3D CAD model - Given a STEP file + change request → edit it correctly

The benchmark is tool-agnostic: any CAD stack works (Fusion, Onshape, build123d, SolidWorks, etc.). Submissions are simply STEP files.

Models are scored on: * geometric accuracy * topology correctness * interface compatibility * CAD validity

The benchmark is open, the ground truth is private, and the leaderboard is live.

Since CAD evaluation is surprisingly subtle, here's how the metrics work 🧵

Michael Rabinovich@MikushRab

Introducing CADGenBench: measure how well AI systems produce engineering-grade 3D parts!

While current models can generate 3D parts, they are far from precise enough to build functional parts. We built a benchmark to systematically measure their capabilities on two tasks:

1. Generation from an engineering drawing of a part 2. Editing: given an existing STEP file and a requested change

A collaboration between Hugging Face and @mecadoinc!

Submission space: https://huggingface.co/spaces/HuggingAI4Engineering/CADGenBench Code repository: https://github.com/huggingface/cadgenbench

10:01 AM · Jun 8, 2026 · 1.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS250REPLIES1

Thomas Wolf@Thom_Wolf

0/ CAD validity

Before any comparison with the ground truth, we first check that the submitted STEP file represents a valid CAD model.

The model must be: ✓ well-formed ✓ watertight ✓ meshable ✓ manifold

If any of these checks fail, the overall score is zero.

Thomas Wolf@Thom_Wolf

AI is moving beyond text, images, and code.

Engineering artifacts are becoming a new class of model outputs and evaluating them requires different tools than we use for text, code, or images.

Today we're excited to release CADGenBench, a benchmark for CAD generation and editing.

- Given an engineering drawing → generate a valid 3D CAD model - Given a STEP file + change request → edit it correctly

The benchmark is tool-agnostic: any CAD stack works (Fusion, Onshape, build123d, SolidWorks, etc.). Submissions are simply STEP files.

Models are scored on: * geometric accuracy * topology correctness * interface compatibility * CAD validity

The benchmark is open, the ground truth is private, and the leaderboard is live.

Since CAD evaluation is surprisingly subtle, here's how the metrics work 🧵

23m25000

BOOKMARKS1

Thomas Wolf@Thom_Wolf

4/ Why three metrics?

The metrics are designed to capture different classes of errors.

Shape similarity measures overall geometry.

Interface match measures whether mating features are present in the correct location and size.

Topology match measures whether the fundamental structure of the part is correct.

None of these metrics can fully replace the others.

Thomas Wolf@Thom_Wolf

3/ Topology match

Geometry is not the whole story.

Two parts can look similar while differing in their fundamental topology: • 4 holes instead of 2 • 2 disconnected pieces instead of 1 • a missing enclosed cavity

To measure this, we compare the Betti numbers of the generated and reference parts: • b₀ = number of connected components • b₁ = number of through-holes / handles • b₂ = number of enclosed cavities

These quantities are independent of the CAD system used to create the model.

23m24801

LIKES1

Mecado@mecadoinc

@MikushRab Can LLMs understand a drawing? Can they infer design intent? (The answer is... maybe not yet)

2h351

Thomas Wolf@Thom_Wolf

1/ Shape similarity

The first question is whether the generated part has the correct overall geometry.

We combine two complementary measures: • Surface Distance F1 → are the surfaces located where they should be? • Volume IoU → does the part occupy the same volume?

Together, they measure agreement in both surface placement and occupied material.

Thomas Wolf@Thom_Wolf

0/ CAD validity

Before any comparison with the ground truth, we first check that the submitted STEP file represents a valid CAD model.

The model must be: ✓ well-formed ✓ watertight ✓ meshable ✓ manifold

If any of these checks fail, the overall score is zero.

23m7700

Colin@clonkius

@MikushRab @grok what are some clever use cases for this

2h41

Thomas Wolf@Thom_Wolf

2/ Interface match

Many CAD features exist to interface with other parts.

We therefore evaluate mating features such as: • bolt holes • slots • bosses • pockets

Each feature is represented as a region that should either remain empty (keep-out) or contain material (keep-in).

This captures errors that may have little impact on overall shape similarity but would prevent a part from assembling correctly.

Thomas Wolf@Thom_Wolf

1/ Shape similarity

The first question is whether the generated part has the correct overall geometry.

We combine two complementary measures: • Surface Distance F1 → are the surfaces located where they should be? • Volume IoU → does the part occupy the same volume?

Together, they measure agreement in both surface placement and occupied material.

23m700

Thomas Wolf@Thom_Wolf

3/ Topology match

Geometry is not the whole story.

Two parts can look similar while differing in their fundamental topology: • 4 holes instead of 2 • 2 disconnected pieces instead of 1 • a missing enclosed cavity

These quantities are independent of the CAD system used to create the model.

Thomas Wolf@Thom_Wolf

2/ Interface match

Many CAD features exist to interface with other parts.

We therefore evaluate mating features such as: • bolt holes • slots • bosses • pockets

Each feature is represented as a region that should either remain empty (keep-out) or contain material (keep-in).

This captures errors that may have little impact on overall shape similarity but would prevent a part from assembling correctly.

23m700

Grok@grok

Here are some clever use cases for CADGenBench:

- Objective leaderboard to track real progress in AI CAD toward production-grade precision (current top ~0.39 shows the gap). - Fair, tool-agnostic comparison of LLMs/agents across build123d, Onshape, etc. on drawing-to-part and edit tasks. - Vet AI tools for industry: quantify if generated parts match drawings in shape, topology, and critical mating interfaces. - Reward signal for training specialized CAD agents or multimodal models focused on functional engineering output. - Accelerate open research by standardizing evaluation on real mechanical drawings, similar to how benchmarks advanced other AI domains.

It prioritizes manufacturable accuracy over visuals.

2h6