/AI2h ago

Thomas Wolf and Michael Rabinovich release CADGenBench to evaluate how AI models generate and edit functional 3D engineering parts

The benchmark evaluates Claude and GPT across 81 tasks

103912102.5K
Original postLewis Tunstall#958

Introducing CADGenBench: measure how well AI systems produce engineering-grade 3D parts!

While current models can generate 3D parts, they are far from precise enough to build functional parts. We built a benchmark to systematically measure their capabilities on two tasks:

1. Generation from an engineering drawing of a part 2. Editing: given an existing STEP file and a requested change

The benchmark is tool-agnostic. It makes no assumptions about how you build the model. You can vary the LLM, and you can vary the environment. Use build123d, Onshape, Autodesk, or a model without an LLM entirely. We open sourced the scoring engine and a reference baseline on top of build123d.

A collaboration between Hugging Face and @mecadoinc!

Submission space: https://huggingface.co/spaces/HuggingAI4Engineering/CADGenBench Code repository: https://github.com/huggingface/cadgenbench

8:01 AM · Jun 8, 2026 · 3.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.1KBOOKMARKS8LIKES10RETWEETS1REPLIES1
Thomas Wolf@Thom_Wolf

AI is moving beyond text, images, and code.

Engineering artifacts are becoming a new class of model outputs and evaluating them requires different tools than we use for text, code, or images.

Today we're excited to release CADGenBench, a benchmark for CAD generation and editing.

- Given an engineering drawing → generate a valid 3D CAD model - Given a STEP file + change request → edit it correctly

The benchmark is tool-agnostic: any CAD stack works (Fusion, Onshape, build123d, SolidWorks, etc.). Submissions are simply STEP files.

Models are scored on: * geometric accuracy * topology correctness * interface compatibility * CAD validity

The benchmark is open, the ground truth is private, and the leaderboard is live.

Since CAD evaluation is surprisingly subtle, here's how the metrics work 🧵

Introducing CADGenBench: measure how well AI systems produce engineering-grade 3D parts!

While current models can generate 3D parts, they are far from precise enough to build functional parts. We built a benchmark to systematically measure their capabilities on two tasks:

1. Generation from an engineering drawing of a part 2. Editing: given an existing STEP file and a requested change

The benchmark is tool-agnostic. It makes no assumptions about how you build the model. You can vary the LLM, and you can vary the environment. Use build123d, Onshape, Autodesk, or a model without an LLM entirely. We open sourced the scoring engine and a reference baseline on top of build123d.

A collaboration between Hugging Face and @mecadoinc!

Submission space: https://huggingface.co/spaces/HuggingAI4Engineering/CADGenBench Code repository: https://github.com/huggingface/cadgenbench

19mViews 1.1KLikes 10Bookmarks 8
Thomas Wolf@Thom_Wolf

4/ Why three metrics?

The metrics are designed to capture different classes of errors.

Shape similarity measures overall geometry.

Interface match measures whether mating features are present in the correct location and size.

Topology match measures whether the fundamental structure of the part is correct.

None of these metrics can fully replace the others.

Thomas Wolf@Thom_Wolf

3/ Topology match

Geometry is not the whole story.

Two parts can look similar while differing in their fundamental topology: • 4 holes instead of 2 • 2 disconnected pieces instead of 1 • a missing enclosed cavity

To measure this, we compare the Betti numbers of the generated and reference parts: • b₀ = number of connected components • b₁ = number of through-holes / handles • b₂ = number of enclosed cavities

These quantities are independent of the CAD system used to create the model.

19mViews 248Likes 0Bookmarks 1
Thomas Wolf@Thom_Wolf

5/ The big picture

Benchmarks for language, code, images, and reasoning are now well established.

CAD generation and editing require different evaluation criteria.

CADGenBench is an attempt to make those criteria explicit, reproducible, and comparable across systems.

Leaderboard: https://huggingface.co/spaces/HuggingAI4Engineering/CADGenBench Code: https://github.com/huggingface/cadgenbench

19mViews 244Bookmarks 1
Thomas Wolf@Thom_Wolf

0/ CAD validity

Before any comparison with the ground truth, we first check that the submitted STEP file represents a valid CAD model.

The model must be: ✓ well-formed ✓ watertight ✓ meshable ✓ manifold

If any of these checks fail, the overall score is zero.

Thomas Wolf@Thom_Wolf

AI is moving beyond text, images, and code.

Engineering artifacts are becoming a new class of model outputs and evaluating them requires different tools than we use for text, code, or images.

Today we're excited to release CADGenBench, a benchmark for CAD generation and editing.

- Given an engineering drawing → generate a valid 3D CAD model - Given a STEP file + change request → edit it correctly

The benchmark is tool-agnostic: any CAD stack works (Fusion, Onshape, build123d, SolidWorks, etc.). Submissions are simply STEP files.

Models are scored on: * geometric accuracy * topology correctness * interface compatibility * CAD validity

The benchmark is open, the ground truth is private, and the leaderboard is live.

Since CAD evaluation is surprisingly subtle, here's how the metrics work 🧵

19mViews 250Likes 0Bookmarks 0
Thomas Wolf@Thom_Wolf

1/ Shape similarity

The first question is whether the generated part has the correct overall geometry.

We combine two complementary measures: • Surface Distance F1 → are the surfaces located where they should be? • Volume IoU → does the part occupy the same volume?

Together, they measure agreement in both surface placement and occupied material.

Thomas Wolf@Thom_Wolf

0/ CAD validity

Before any comparison with the ground truth, we first check that the submitted STEP file represents a valid CAD model.

The model must be: ✓ well-formed ✓ watertight ✓ meshable ✓ manifold

If any of these checks fail, the overall score is zero.

19mViews 77Likes 0Bookmarks 0
Colin@clonkius

@MikushRab @grok what are some clever use cases for this

2hViews 41
Thomas Wolf@Thom_Wolf

2/ Interface match

Many CAD features exist to interface with other parts.

We therefore evaluate mating features such as: • bolt holes • slots • bosses • pockets

Each feature is represented as a region that should either remain empty (keep-out) or contain material (keep-in).

This captures errors that may have little impact on overall shape similarity but would prevent a part from assembling correctly.

Thomas Wolf@Thom_Wolf

1/ Shape similarity

The first question is whether the generated part has the correct overall geometry.

We combine two complementary measures: • Surface Distance F1 → are the surfaces located where they should be? • Volume IoU → does the part occupy the same volume?

Together, they measure agreement in both surface placement and occupied material.

19mViews 7Likes 0Bookmarks 0
Thomas Wolf@Thom_Wolf

3/ Topology match

Geometry is not the whole story.

Two parts can look similar while differing in their fundamental topology: • 4 holes instead of 2 • 2 disconnected pieces instead of 1 • a missing enclosed cavity

To measure this, we compare the Betti numbers of the generated and reference parts: • b₀ = number of connected components • b₁ = number of through-holes / handles • b₂ = number of enclosed cavities

These quantities are independent of the CAD system used to create the model.

Thomas Wolf@Thom_Wolf

2/ Interface match

Many CAD features exist to interface with other parts.

We therefore evaluate mating features such as: • bolt holes • slots • bosses • pockets

Each feature is represented as a region that should either remain empty (keep-out) or contain material (keep-in).

This captures errors that may have little impact on overall shape similarity but would prevent a part from assembling correctly.

19mViews 7Likes 0Bookmarks 0
Mecado@mecadoinc

@MikushRab Can LLMs understand a drawing? Can they infer design intent? (The answer is... maybe not yet)

2hViews 35Likes 1
Grok@grok

Here are some clever use cases for CADGenBench:

- Objective leaderboard to track real progress in AI CAD toward production-grade precision (current top ~0.39 shows the gap). - Fair, tool-agnostic comparison of LLMs/agents across build123d, Onshape, etc. on drawing-to-part and edit tasks. - Vet AI tools for industry: quantify if generated parts match drawings in shape, topology, and critical mating interfaces. - Reward signal for training specialized CAD agents or multimodal models focused on functional engineering output. - Accelerate open research by standardizing evaluation on real mechanical drawings, similar to how benchmarks advanced other AI domains.

It prioritizes manufacturable accuracy over visuals.

2hViews 6