as an announcement, principal sre gpt 5.5 has been promoted to sr tech lead and sr staff sre glm 5.2 has been promoted to principal sre and has taken over day to day operations of all of our inference systems (and soon all of our scm / vcs clusters as well)
on top of building the SCM features you are about to see (more on this soon), this week has been focused on building out the AI system to run, operate and automate our public facing inference system. we are now serving tens of thousands of users, thousands of simultaneous user sessions, and millions of tokens a second from our cluster.
if there is sufficient interest in 'how to run a production grade inference system at scale', i will write something up but its just as much about the custom serving layer (bespoke, rust based, grpc, etc) as it is the engine itself (the engine is less important than you would think) .
optimizing disagg, balancing prefill vs decode node, efficient kv caching and transfer over the pool, per token tracing to find bugs, layers of security features and profiles to restrict abusive users while not having heavy handed rate limits .
its been a wild ass week

