FineWeb creators Guilherme Penedo and Hynek Kydlicek launch Macrodata Labs and release Refiner, an open-source robotics data processing framework

VIEWS1K

Super excited about what @gui_penedo and @HKydlicek and @macrodata_labs are building.

The quality of their track record in LLM data speaks for itself (refinedweb, fineweb, fineweb-edu, finepdfs, finephrase).

Every model is only as good as its data. Your data is only as good as your tooling.

While existing solutions to processing large training sets work, they feel incredibly clunky and unintuitive to the level of abstraction you naturally want to work at as a practitioner.

(Anyone who has tried to inspect text from a spark dataframe knows what I mean)

I’m really excited to see these masters of their craft bringing their expertise to the world.

Guilherme Penedo@gui_penedo

Today we’re announcing Macrodata Labs.

Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs.

We are starting to see a similar takeoff in robotics.

Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies.

That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method.

We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets.

We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring.

We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in.

We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies.

I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world.

If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.

4h1K90

BOOKMARKS2RETWEETS2REPLIES4

Lewis Tunstall@_lewtun

Guilherme and Hynek have a long track record of turning messy, unstructured data into gold for model training (FineWeb, FineTranslations, FinePDFs etc)

It’s very exciting to see them come out of stealth to target robotics, which is the next frontier in AI and arguably the hardest one to acquire good data for!