/Tech2h ago

FineWeb creators Guilherme Penedo and Hynek Kydlicek launch Macrodata Labs and release Refiner, an open-source robotics data processing framework

Story Overview

The FineWeb duo steps out from Hugging Face to start Macrodata Labs and immediately ships Refiner, an open-source toolkit that turns messy robot videos, sensor streams, and demonstrations into usable training signals for the next wave of models.

20938136.2K
Original post
Guilherme Penedo@gui_penedo

Today we’re announcing Macrodata Labs.

Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs.

We are starting to see a similar takeoff in robotics.

Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies.

That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method.

We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring.

We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in.

We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies.

I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world.

If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.

1:01 AM · Jun 11, 2026 · 2.8K Views
Developer Impact

Pipelines that move from laptop to cloud without rewrites

Refiner’s Python framework keeps everything composable and multimodal, so developers can prototype locally then hand the same code to managed compute when datasets grow.

Open Question

No usage numbers yet on the richer signals

Hand-tracking, subtask labels, and reward scoring are built in, but how robotics teams actually adopt the framework or measure gains is still an open question.

Sentiment

Users are excited about Macrodata Labs launching its open-source Refiner for robotics dataset processing because they see strong value in scalable data pipelines that improve model performance and address real infrastructure bottlenecks.

Pos
100.0%
Neg
0.0%
9 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS580
Cody Blakeney@code_star

Super excited about what @gui_penedo and @HKydlicek and @macrodata_labs are building.

The quality of their track record in LLM data speaks for itself (refinedweb, fineweb, fineweb-edu, finepdfs, finephrase).

Every model is only as good as its data. Your data is only as good as your tooling.

While existing solutions to processing large training sets work, they feel incredibly clunky and unintuitive to the level of abstraction you naturally want to work at as a practitioner.

(Anyone who has tried to inspect text from a spark dataframe knows what I mean)

I’m really excited to see these masters of their craft bringing their expertise to the world.

Guilherme Penedo@gui_penedo

Today we’re announcing Macrodata Labs.

Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs.

We are starting to see a similar takeoff in robotics.

Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies.

That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method.

We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring.

We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in.

We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies.

I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world.

If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.

1hViews 580Likes 3Bookmarks 0
BOOKMARKS1REPLIES3

Guilherme and Hynek have a long track record of turning messy, unstructured data into gold for model training (FineWeb, FineTranslations, FinePDFs etc)

It’s very exciting to see them come out of stealth to target robotics, which is the next frontier in AI and arguably the hardest one to acquire good data for!

Guilherme Penedo@gui_penedo

Today we’re announcing Macrodata Labs.

Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs.

We are starting to see a similar takeoff in robotics.

Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies.

That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method.

We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring.

We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in.

We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies.

I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world.

If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.

1hViews 398Likes 5Bookmarks 1
LIKES6
Cody Blakeney@code_star

Why does it matter that the tools we have fit the work we are trying to do?

You have probably seen me beat the dead horse about looking at the data?

It’s hard to explain just how difficult it actually is to look at large samples of any training set.

This is even more true if the data isn’t something as simple as pre-labeled images, or even common crawl text.

Multi-turn agentic data, multi-modal data (esp for more than 2 modalities) makes “looking at the data” significantly harder.

That’s when things are working well too! Complex pipelines break, often silently. I’m especially excited about the observability and metric collection baked into refiner to help save you from these 1000 tiny cuts.

Cody Blakeney@code_star

Super excited about what @gui_penedo and @HKydlicek and @macrodata_labs are building.

The quality of their track record in LLM data speaks for itself (refinedweb, fineweb, fineweb-edu, finepdfs, finephrase).

Every model is only as good as its data. Your data is only as good as your tooling.

While existing solutions to processing large training sets work, they feel incredibly clunky and unintuitive to the level of abstraction you naturally want to work at as a practitioner.

(Anyone who has tried to inspect text from a spark dataframe knows what I mean)

I’m really excited to see these masters of their craft bringing their expertise to the world.

1hViews 262Likes 6Bookmarks 1
RETWEETS2
Macrodata Labs@macrodata_labs

Macrodata Labs is launching today to build infrastructure for the robotics data loop.

Robotics is starting to scale. Progress in LLMs and VLMs is making robots more capable, but the data layer behind robotics is still underbuilt.

Physical-world data is messy and fragmented. Every robot, sensor setup, and lab has its own assumptions, and teams still spend too much time writing brittle scripts just to make their data usable.

The hard part is not only collecting more demonstrations. It is turning those demonstrations into datasets teams can train on, inspect, improve, and reuse as their policies and data collection setups change.

We built Refiner as our first step toward better infrastructure for robotics data. It is an open-source framework for turning messy robotics data into scalable, inspectable, training-ready datasets.

Refiner helps teams process demonstrations, add annotations, run reward model scoring, and scale robotics data pipelines from local execution to managed cloud compute on the Macrodata Labs platform.

Starting today, you can use Refiner and the Macrodata Labs platform to make the most out of your robotics data.

We are fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, @Thom_Wolf , and business angels from leading AI labs and technology companies to make this mission possible.

@gui_penedo @HKydlicek

2hViews 1.4KLikes 22Bookmarks 5
Remi Cadene@RemiCadene

@gui_penedo @HKydlicek Website is beautiful :)

1hViews 20

Only few people are as data pilled as Guilherme and Hynek!

Among the dozens of neo-labs they are the ones building oil pipelines.

Guilherme Penedo@gui_penedo

Today we’re announcing Macrodata Labs.

Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs.

We are starting to see a similar takeoff in robotics.

Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies.

That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method.

We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring.

We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in.

We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies.

I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world.

If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.

1hViews 620Likes 5Bookmarks 0

We're launching Macrodata Labs.

Me and @gui_penedo have spent the past three years in the trenches working on data for training LLMs. This gave us a unique perspective on how the field has progressed - from GPT-3-era models capable of little more than simple completions to today, where agents are writing a substantial share of the code being shipped.

This progress was enabled by just two components: scaling data and compute while being extremely deliberate about what data to use and what not to use. Look at failed training runs and ask researchers what caused them - poor data quality is almost always at the top of the list.

While LLMs have undergone this Cambrian explosion, robotics today feels exactly like LLMs did back then. There is still no clear recipe for what will work. Every team has its own opinions on embodiment and architecture, yet they all agree on one thing: the most important problem to solve is data and how to scale it.

Nobody knows yet whether the answer lies in simulation data, egocentric data, IMU data, or something completely different.

Whatever the answer turns out to be, every team still has to go through the same process: acquiring the data, filtering problematic episodes, synchronizing sensor values, annotating episodes using VLMs, splitting episodes into subtasks, or, in the case of egocentric data, extracting 21-DOF hand annotations. Finally, all of this has to be converted into training-ready datasets before training starts as choosing a bad format for training will waste GPU cycles.

These pipelines need to run continuously. Every day, new episodes arrive from in-house data collection efforts and external vendors. Teams not only have to deal with the peculiarities of working with video data, ensuring sensor streams are error-free and avoiding unnecessary video decoding, but also need to support ingestion from whatever formats their data vendors provide.

Many teams are solving these problems today, yet you'll quickly discover that 99% of the solutions are collections of one-off scripts, which everyone hates the moment something goes wrong. Researchers end up digging through repositories trying to find the script that performed a particular operation three months ago, not even knowing whether they're looking at the version that was actually run.

What people want is something as scalable as Spark and as trackable as Weights & Biases.

That is what we created Macrodata Labs to build.

Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to help robotics teams turn raw demonstrations into training-ready datasets. Instead of maintaining collections of one-off scripts, teams can use Refiner to ingest heterogeneous robotics data, synchronize sensors, run annotation workflows, extract signals like hand tracking, split trajectories into subtasks, and continuously process new data as it arrives.

Alongside Refiner, we're also launching Refiner Cloud. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, failure recovery, lineage tracking, and observability built in—so teams can focus on what matters most: data, not infrastructure plumbing.

We're incredibly fortunate to have the support of @airstreet , @DrysdaleVC , OPRTRS Club, @kimaventures , YG (Alex Yazdi), >commit, @Thom_Wolf , and an amazing group of angels from leading AI labs and technology companies who share our belief that data will be one of the defining challenges in robotics.

If this resonates with you, give Refiner a try, and don't hesitate to shoot me a message. We'd love to chat.

2hViews 661Likes 20Bookmarks 1
Guilherme Penedo@gui_penedo

@_lewtun Thank you for the kind words Lewis :)

1hViews 9Likes 1
Guilherme Penedo@gui_penedo

Refiner, our OSS library: http://github.com/macrodata-labs/refiner

2hViews 30
Guilherme Penedo@gui_penedo

@lhoestq @HKydlicek We made sure datasets and buckets had day 0 support :)

1hViews 4
Macrodata Labs@macrodata_labs

Our OSS library: http://github.com/macrodata-labs/refiner

2hViews 1
Mayz@lunan_ai

@code_star still impressed by how much the QRT data matters to model performance honestly though.

15m
Bruno Oliveira@GustoMindAi

The same pattern holds one layer down in business adoption. The model is rarely the bottleneck; the pipeline feeding it is. The teams getting durable value from AI treat clean inputs, retrieval, permissions, and evaluation loops as core infrastructure, not a cleanup sprint before the demo. Unglamorous, but it is where the compounding happens.

18m
Velon@velonxbt

@_lewtun FineWeb datasets are basically a cheat code for training quality. robotics data is gonna be a whole different monster though

54m
Load more posts