itstream diffusion has major advantages over its tokenized counterpart.
First, it provides a universal encoding scheme for multimodal data. Language, audio, images, and chemical data can all be represented in a common format.
Second, binary encoding compresses vocabulary sizes exponentially! This allows the use of much larger vocabularies with a significantly smaller memory footprint.
Now that continuous diffusion for categorical data is taking off... go bitstream! ;)