4h ago

OpenCLIP Gains FSDP2 Support, Task Refactor, And CLAP Audio Integration

0
Original post

I've made extensive changes to OpenCLIP over the past few weeks, merging ideas I've been evolving in some other projects that I'm working on in parallel. If you've built anything around it, take a peek. Training has been refactored around a lightweight Task based abstraction making different model + loss/objective combos much cleaner to swap. FSDP2 support was added, improve torch.compile support, and some effort to allow combos of the FSDP2 or DDP + compile + activation checkpointing to work (mostly) nicely together. Native aspect NaFlex data pipelines and timm based NaFlexViT encoder is supported for train and eval (incl the OpenCLIP native SIGLIP2 naflex models). @mehdidc started this but I broke it and needed to rethink. The merging of CLAP (audio-clip) modelling, train task, data pipelines initiated by @JJitsev was unblocked thanks to the above cleanup. I just completed the reorganization and initial CLAP training appears to be functioning 🥳 There is still some verification to do. I will be testing distributed performance on JUPITER (Jülich Supercomputing Centre, Germany) to clear the way for some researchers to abandon their forks :)

11:57 AM · May 22, 2026 View on X

Oh yeah, and native OpenCLIP vanilla ViT encoder and timm based (EVA, PE, etc) ViT encoder based models can be 'promotoed' to NaFlexViT and run in eval (or fine-tuned) w/ NaFlex data pipelines.

Ross WightmanRoss Wightman@wightmanr

I've made extensive changes to OpenCLIP over the past few weeks, merging ideas I've been evolving in some other projects that I'm working on in parallel. If you've built anything around it, take a peek. Training has been refactored around a lightweight Task based abstraction making different model + loss/objective combos much cleaner to swap. FSDP2 support was added, improve torch.compile support, and some effort to allow combos of the FSDP2 or DDP + compile + activation checkpointing to work (mostly) nicely together. Native aspect NaFlex data pipelines and timm based NaFlexViT encoder is supported for train and eval (incl the OpenCLIP native SIGLIP2 naflex models). @mehdidc started this but I broke it and needed to rethink. The merging of CLAP (audio-clip) modelling, train task, data pipelines initiated by @JJitsev was unblocked thanks to the above cleanup. I just completed the reorganization and initial CLAP training appears to be functioning 🥳 There is still some verification to do. I will be testing distributed performance on JUPITER (Jülich Supercomputing Centre, Germany) to clear the way for some researchers to abandon their forks :)

6:57 PM · May 22, 2026 · 2.8K Views
7:04 PM · May 22, 2026 · 294 Views