/Tech2h ago

Pleias CTO Pierre-Carl Langlais argues large corporations have the compute but lack the specialized engineering talent to scale LLMs

Story Overview

Large organizations often own ample GPU clusters yet still hit friction when trying to run advanced models for many simultaneous users. The bottleneck highlighted is not hardware scarcity but the narrow pool of engineers who know how to tune inference stacks for high-concurrency, long-context workloads such as those posed by GLM-5.2.

61000483

#830

Original post

Alexander Doria@Dorialexander#1537inTech

@_xjdr No but that's really the key thing: average large co might have the compute, just not the skills.

xjdr@_xjdr

@Dorialexander Don't get me wrong, it's doable. It's just a lot harder than it sounds and exponentially harder when you are serving it for more than one person

4:17 AM · Jun 29, 2026 · 125 Views

Developer Impact

Concurrency Exposes Hidden Expertise Gaps

Serving GLM-5.2 beyond a handful of users at once requires specialized handling of KV-cache, memory fragmentation, and speculative decoding that most corporate teams have not yet mastered.

Open Question

Inference Talent Remains Concentrated

The skills needed appear clustered inside dedicated inference providers rather than distributed across general AI labs or enterprises, leaving an open question about how widely the newest open-weight models can actually be deployed at scale.

Sentiment

Users are enthusiastic about the business opportunities for inference providers in AI scaling, citing aligned incentives with customers and sharing technical solutions like proxy code for single-node model serving.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS194LIKES7

xjdr@_xjdr

@Dorialexander there are probably not a whole lot of people i would imagine could serve glm5.2 to more than a handful of concurrent users who's full time job isn't currently at an inference provider

Alexander Doria@Dorialexander

@_xjdr No but that's really the key thing: average large co might have the compute, just not the skills.

2h19470

REPLIES2

Alexander Doria@Dorialexander

@_xjdr verbatim a point i've made…

xjdr@_xjdr

@Dorialexander there are probably not a whole lot of people i would imagine could serve glm5.2 to more than a handful of concurrent users who's full time job isn't currently at an inference provider

2h17710

Einar Altsson@EinarAltsson

@Dorialexander @_xjdr Okay, GLM 5 was not yet tackled, so can't comment on that one. Only Qwen 397B for developers on 4 GPUs so far, but no issues there. Maybe just takes some time for vLLM/SGLang to catch up here..

1h243

Einar Altsson@EinarAltsson

@Dorialexander @_xjdr Yes, Qwen for coding, Gemma for everything else. vLLM via docker compose is surfaced vanilla (HTTP, completely open). A proxy (that also offers an Anthropic API) is taking care of rate limits and Entra IDs etc. and is the only machine that can connect to it.

1h251

Alexander Doria@Dorialexander

@EinarAltsson @_xjdr Depends on the use case but the one most people would look GLM for (local claude code) is not really solved…

1h41

Einar Altsson@EinarAltsson

@Dorialexander @_xjdr Really looking forward to the blog post, because so far I did not have any serious issues serving models on a single node for a medium sized company. Efficiency is not an issue with B2B, really 🤷🏻‍♂️

1h25

Alexander Doria@Dorialexander

@EinarAltsson @_xjdr ok not bad. and you're running it for coding? with some parallel session management?

1h15

Zach Mueller@TheZachMueller

@Dorialexander @_xjdr But what if I set /goal right. Right?

(Obvious sarcasm here)

2h362

Einar Altsson@EinarAltsson

@Dorialexander @_xjdr @Dorialexander btw, if you are interested in the proxy I can give you the code, idk 😄

50m14

adijo@amplituhedron

@Dorialexander @_xjdr But isn't this exactly why it's a good business opportunity for inference providers? Their incentives are also aligned with customers

17m7

Ferbin@Ferbin08

@Dorialexander @_xjdr voice agents need low latency. local solves it.

but you're paying for hardware instead of api calls.