Introducing STACK: Learning Composable Skills by Discovering Spatial and Temporal Structure with Foundation Models.
How can robots learn skills that generalize when the world change? The default answer has been more data. It works, but it's expensive and slow.
During my final year at @Stanford, our team explored a different idea: can robots discover the right abstractions from just a handful of expert demonstrations to enable strong generalization?
STACK uses foundation models to discover spatial and temporal structure from a handful of demonstrations, then learns composable skills on top of that structure.
With just 5 - 10 real-world demonstrations per domain, across three manipulation settings, and without hand-designed task decomposition.
For more details, including real-world demos: https://icra-stack.github.io
(1/n)