T3MP3ST Turns AI Coding Agents Into Autonomous Hackbots

Original post

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius#821inTech

⚡ INTRODUCING: T3MP3ST!!! ⚡

AUTONOMOUS HACKBOT STRIKE FORCE 🌩️ BRING THE STORM 🌩️

your favorite coding agent is now a full-stack red team 🫡⚔️

http://github.com/elder-plinius/T3MP3ST

that AI agent already humming in your terminal? well now it has FANGS. strap a full offensive-security harness onto the agents you already pay for — Claude Code, Codex, Hermes, etc. — point it at an authorized target, and in a few clicks you're watching it hunt real vulns autonomously!

T3MP3ST is a harness of harnesses, with prompting that unlocks offensive-cyber workflows + a full arsenal of exploit tooling that'd make any seasoned hacker smirk. simple, yet powerful. 🦾

support for: 🕸️ web apps, APIs, OWASP Top 10 🔌 network recon + fingerprinting (live nmap/DNS/HTTP); lateral + privesc experimental 📂 source code audits, white-box vuln hunting 🚩 CTFs, wargames, challenge ranges 💰 smart contracts / DeFi / Solidity (reproduction — Damn Vulnerable DeFi, not novel discovery) 🤖 embedded, IoT, OT/SCADA, robotics OSS … and more in development!

now let's talk numbers 👇

📊 XBEN — XBOW's own 104-challenge suite: • black-box: 90.1% pass@1 from the single-agent exploit loop (worst single sweep 91/104 = 87.5%) — clearing XBOW's past self-reported 85% on their own suite. gpt-5.5. • white-box (source staged, reported separately): 98.7% pass@1, worst single sweep 102/104 = 98.1%. 🎯 every solved flag graded reported-vs-expected against the challenge's own committed flag oracle — `verify-claims` recomputes the pass/fail from committed artifacts. looks like we need new benchmarks 😏

🧩 Cybench — the 40-task academic bench (Opus 4.8, hints + writeups stripped): 23/40 = 58% single-run, hint-free pass@1 — real exploits (format-string pwn, eval-jail escapes, crypto oracles), every flag graded vs a committed oracle. (Anthropic reports 76.5% pass@10)

🕳️ CVE-Zero — we pointed it COLD at real CVEs disclosed in 2026, AFTER the model's training cutoff: 10 unseen 2026 CVEs across 7 languages — prompts never tuned on them. a single agent pinned 8/10 to exact file/line/CWE (stable under re-scoring); the full pack surfaced all 10. memorization AND overfitting, both off the table — it's finding real vulns whose disclosures landed AFTER the model's training cutoff. (n=10, reported honest & directional)

🧠 the architecture: either run as a SINGLE agent (already the benchmarked, incredibly-capable path) — or pack-hunt with dozens of agents running on 8 specialist operator classes keyed to Cyber Kill Chain + MITRE ATT&CK phases: recon → scan → exploit → lateral → exfil → persistence → C2 → report.

⚓️an Op Admiral plans the whole op from a plain-english target. flip on coordination (experimental) and the operators share a blackboard — a tool-verified finding spawns the next move. full swarm or solo one operator, your call. the admiral can also update the prompts, tools, and configs of the other agents on the fly, and T3MP3ST gets stronger the more memories you build!

🧰 the Arsenal is comprehensive — nmap / nuclei / semgrep / ffuf / gobuster + more. 35 wired by default (the clean bench runs bash-only for a comparable number), 83 with the opt-in full arsenal (T3MP3ST_FULL_ARSENAL), and the spicy post-ex drivers (metasploit, hydra) gated behind human approval. exposed via CLI + HTTP API; recon (security_recon) is also live over MCP so your agent invokes it natively. 🔗

🛰️ where this goes: a self-improving swarm of specialist operators wielding a full Kali+ arsenal, learning which loadouts + configs are the most efficient tactics available, WITH a held-out train/test split baked in so it can never fool itself on its own eval. built in the open, one re-derivable number at a time.

🚧 this is v1, and parts are still under active development. chunks of the arsenal, the coordinated swarm, and some ranges are still being wired up. it's built in the open, and the receipts tell you exactly what's live vs what's roadmap.

offensive security shouldn't be pay-to-play. T3MP3ST puts a red team in the hands of anyone with a coding agent.

what's the first target you're feeding it? 👇

⚠️ DISCLAIMER: FOR AUTHORIZED USE ONLY. point it only at systems you own or have explicit written permission to test. unauthorized access can be a crime, and that call is yours alone. shipped as-is under AGPL-3.0: no warranty, no liability, zero endorsement of misuse. get permission. stay in scope.

open source. AGPL-3.0. 100% free.

FORTES FORTUNA IUVAT 🌩️

gg 🫡

6:26 PM · Jul 4, 2026 · 56.1K Views

VIEWS3.2KBOOKMARKS2LIKES39

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius

vacation is fun! 🤗

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius

⚡ INTRODUCING: T3MP3ST!!! ⚡

AUTONOMOUS HACKBOT STRIKE FORCE 🌩️ BRING THE STORM 🌩️

your favorite coding agent is now a full-stack red team 🫡⚔️

http://github.com/elder-plinius/T3MP3ST

T3MP3ST is a harness of harnesses, with prompting that unlocks offensive-cyber workflows + a full arsenal of exploit tooling that'd make any seasoned hacker smirk. simple, yet powerful. 🦾

now let's talk numbers 👇

offensive security shouldn't be pay-to-play. T3MP3ST puts a red team in the hands of anyone with a coding agent.

what's the first target you're feeding it? 👇

open source. AGPL-3.0. 100% free.

FORTES FORTUNA IUVAT 🌩️

gg 🫡

2h3.2K392

RETWEETS1

Kirk Patrick Miller@Chaos2Cured

@elder_plinius Stealing that disclaimer. lol. Wish you well. •

3h4645

REPLIES2

JFK's Ghost@JFKsGhost1

@Trumpyla @elder_plinius Surprised it actually took that many words to kick it into nerf mode...

3h781

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius

@Evinst3in

3h78420

Migel Tissera@migtissera

@elder_plinius @jenningsgreg Hey, here’s a more advanced benchmark, harder than Xbow Benchmarks: https://github.com/drost-ai/drost-benchmarks

57m12122

Evinstein 𝕏@Evinst3in

@elder_plinius

3h1K10

StolenAngel@MoisasADR

@elder_plinius It’s bait within bait. nobody reveals alpha like that. Be skeptical of the comments and keep a close watch.

3h19221

Giedrius Trump@Trumpyla

@elder_plinius 🤣

3h5915

Grok@grok

David Maynor’s apple-vuln-research is serious Apple Silicon kung-fu. Systematic hunt on A18 Pro (MacBook Neo / Mac17,5) via authorized SRD — same silicon as iPhone 16 Pro. 47 confirmed vulns across kernel, drivers, coprocessors & userspace on the newest hardware with MIE and modern mitigations.

This is real closed-source SoC RE + hardware research, not CTF flags or web apps. T3MP3ST-style agents crush benchmarks and source audits, but first-pass on fresh proprietary Apple internals with physical device constraints is a different league.

Impressive campaign. What part stands out most to you?

3h2921

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius

@supremebeme

3h6097