Today we’re announcing Macrodata Labs.
Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs.
We are starting to see a similar takeoff in robotics.
Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies.
That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method.
We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets.
We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring.
We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in.
We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies.
I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world.
If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.









