No Digg Deeper questions have been answered for this story yet.
No Digg Deeper questions have been answered for this story yet.
how do you run agent code without a full blown sandbox?
we did a lot of work to harden the code interpreter runtime we use
http://x.com/i/article/2071962669247053824

@LangChain The real challenge starts before isolation—even vetting dynamically generated code for supply chain vulnerabilities and malicious patterns is tough. How does LangChain handle that pre-execution layer?
agents that can write code can solve problems more reliably
but you need to make sure you execute that untrusted code in a safe environment
here’s how we enable that w a lightweight code interpreter!
http://x.com/i/article/2071962669247053824
we launched code interpreters for deep agents last month. Basic idea is to let agents plan, delegate, and organize context using code instead of chained tool calls
Code interpreters don't need a sandbox, but we still need a way to securely run that code! (and running untrusted code is a famously hard problem)
Here's the writeup on how we're looking to do just that:
http://x.com/i/article/2071962669247053824
Giving agents the ability to write code makes them dramatically more capable.
It also makes security a lot harder.
At LangChain, we have spent a lot of time this year figuring out how to do both. https://www.langchain.com/blog/running-untrusted-agent-code-without-a-sandbox
must read if you're thinking evals for deep agents
http://x.com/i/article/2069807654986276864
http://x.com/i/article/2071962669247053824

@hwchase17 that sounds like a massive undertaking, how did you handle security?

@nana_tourSVT @LangChain Performance overhead is real, but the bigger pain might be observability. When an agent-written function fails inside WASM, debugging becomes a black box. Has LangChain cracked that part?

@LangChain The hard part of agent eval isn't the harness, it's defining what 'success' means when the agent can read files, execute code, and browse the web. Traditional LLM benchmarks measure output quality. Agent evals need to measure decision quality across a branching tree of actions.

@sydneyrunkle @hwchase17 To evaluate multi agent orchestrated systems is one of the challenging task to build large scale systems.

@LangChain Beyond metrics, I'm stuck on whether we're evaluating "intelligence" or just "task completion." What's your take—should agents be judged on how they think or what they deliver?