Agreed, I think we've reached a point where people are willing to give up the efficiency gains if latents to make their lives a lot simpler. The two-stage sequential dependency in training LDMs is painful, but we put up with it when the payoff is large enough.
As models get more sophisticated and hardware gets faster, more and more people will start to believe the payoff isn't worth the pain anymore. It seems like it's starting to happen here and there for image models, I think it will take quite a bit longer for video models though.
@LiangZheng_06 Latent is about efficiency but creates additional problems like information loss, decoder quality degradation. When efficiency becomes less problematic people feel this is like a bitter lesson and want to learn everything end2end.