The Harness Does the Work: Inside Microsoft's 100-Agent MDASH Architecture That Found 4 Critical Windows RCEs — and Why 'Which Model?' Is the Wrong Question

On May 12, Microsoft published a result that the security industry has been waiting for — and mostly dreading. Its new agentic security system, codenamed MDASH (Multi-model Agentic Scanning Harness), helped researchers find 16 vulnerabilities across the Windows networking and authentication stack, including four Critical remote code execution flaws, in components as deep as the kernel TCP/IP stack and the IKEv2 service. Two of those CVEs — one in tcpip.sys, one in ikeext.dll — are the kind of bugs that require cross-file reasoning across six source files, multi-step lifetime analysis, and concurrent subsystem modelling. They are bugs that professional red teams routinely miss on time-boxed engagements, and that the Windows kernel team had not found in years of internal review.

The headline is striking enough. But the actual lesson for enterprise AI teams is buried in a single sentence from the blog post, buried halfway through a 17-minute read: “The harness does the work, and the model is one input.”

That is not a throwaway observation. It is the architectural thesis of the whole system — and it changes the frame for every conversation happening right now about which model to buy, which inference provider to contract, and which AI security tool will slot into your SOC next quarter.

What MDASH Actually Is — And What It Isn’t

MDASH is not a security LLM. It is not a fine-tuned model for vulnerability research. It is not a RAG system that retrieves past CVEs and matches patterns against new code. Every one of those framings would be wrong, and every one of them would lead you to ask the wrong question when evaluating similar systems.

MDASH is a staged orchestration system that runs more than 100 specialized agents across an ensemble of models, in a defined pipeline with distinct roles at each stage. The five stages — Prepare, Scan, Validate, Dedup, and Prove — are not sequential prompts in a chain. They are distinct agent cohorts, each with its own tooling, stop criteria, and model configuration. The Scan stage runs specialized auditor agents. The Validate stage runs a separate cohort of debaters — agents whose explicit job is to argue against each finding’s exploitability. Disagreement between auditors and debaters is treated as a signal: high-posterior findings are the ones the debater couldn’t tear down.

The Prove stage is where the architecture pays off most visibly. A finding that emerges from the Scan and Validate stages is still just a hypothesis until the Prove stage can construct and execute a triggering input. For CLFS.sys, the Prove stage included a custom plugin that understands on-disk container layout and block-validation sequences well enough to drive a candidate path to its sink. That plugin embeds domain knowledge the foundation models cannot infer from training data — Microsoft-internal filesystem invariants — and that separation of concerns is deliberate. The foundation models reason; the plugins carry facts the models can’t know.

This is the architectural pattern that is easy to miss when you focus on benchmark numbers: MDASH’s 88.45% on CyberGym — beating Anthropic Mythos at 83.1% and GPT-5.5 at 81.8% — is not a model capability story. It is a harness capability story. The models used are described as “generally available.” The performance gap comes from the system architecture, not from model secret sauce.

The Two Bugs MDASH Found That Single-Model Systems Couldn’t

The two deep-dive CVEs in the blog post are worth reading in full because they explain, in concrete terms, why single-agent systems fail at this class of problem.

CVE-2026-33827 is a use-after-free in tcpip.sys, reachable from the network with no credentials. The vulnerability is not visible within a single function. The lifetime violation — a Path object reference released early, then reused — is separated from the problematic dereference by non-trivial control flow, alternate branches, and multiple validation checks. Without tracking reference ownership across these intermediate states, the bug looks like two independent operations. What makes it detectable by MDASH is cross-file reasoning: an analogous pattern exists elsewhere in the codebase, handled correctly, and the inconsistency between the two sites is the signal. Single-model systems, handed a function at a time, see no inconsistency. MDASH’s debater-auditor structure, which forces agents to compare patterns across files and argue for reachability under concurrent conditions, surfaces exactly that.

CVE-2026-33824, the IKEv2 double-free, spans six source files. The root cause is a shallow memcpy that clones struct bytes without cloning heap allocations the struct points to, leaving two owners of the same pointer. The triggering sequence is deterministic — two UDP packets, no race, no special timing. But seeing the bug requires knowing that the correct pattern exists in ike_D.c immediately after a similar memcpy, and that the missing step at the vulnerable site is only visible against that contrast. This is the kind of analysis that requires a system to hold multiple file contexts simultaneously, compare them, and reason about what’s absent, not just what’s present. That is not a prompt engineering problem. It is an orchestration architecture problem.

What the Retrospective Numbers Actually Mean

Microsoft published two internal retrospective benchmarks alongside the CyberGym number. On pre-patch snapshots of clfs.sys, MDASH achieved 96% recall across 28 confirmed MSRC cases spanning five years. On tcpip.sys, it achieved 100% recall across 7 MSRC cases.

These numbers matter precisely because MSRC cases are not synthetic benchmarks. They are bugs that real attackers found, that required Patch Tuesdays, and that defenders had to react to in production. A system that recovers 96% of a five-year MSRC backlog in a heavily reviewed kernel component is not finding theoretical weaknesses — it is finding the bugs that broke real systems.

The right way to read these numbers is not “MDASH is perfect.” The blog post is admirably honest about this: these are retrospective recall benchmarks on internal code with a finite case count. What they tell you is that the system would have been useful. The forward-looking signal is the Patch Tuesday cohort itself — 16 new CVEs, patched and shipped.

For enterprise teams evaluating AI security tooling, these benchmarks establish a baseline question that most vendors are not yet able to answer: Can you show me retrospective recall on confirmed vulnerability ground truth in production codebases? If the answer is a leaderboard score on a public benchmark with known vulnerability descriptions, that is a very different claim.

The SuperML Take

The thing that enterprise security and platform teams should absorb here is not that Microsoft built an impressive bug-finding tool. Of course they did; they have some of the best kernel security researchers on the planet and the resources to build whatever they want. The thing to absorb is the architectural argument they are making in public: the durable advantage is the harness, not the model.

This argument has teeth precisely because it is self-undermining for Microsoft to make it. MDASH uses Azure AI infrastructure and Microsoft’s own models, among others. Microsoft has every incentive to say “our models are better, use Azure OpenAI.” Instead, the VP of Agentic Security is publishing a blog post that explicitly says the system uses “generally available models” and that the performance comes from the surrounding architecture. That is either a very honest assessment or a very clever positioning move — probably both.

For practitioners, the argument maps directly to the agentic AI deployment decisions most enterprise teams are wrestling with right now. Every organization buying AI security tooling is effectively being asked to bet on a model provider. The MDASH architecture says that bet is structurally less important than the orchestration layer, the validation pipeline, the plugin extensibility, and the stage separation. When a new model arrives — and they arrive every six months — a harness-first architecture absorbs the improvement with a configuration change. A model-first architecture requires a system rebuild.

This is also the argument for building internal agentic capability rather than consuming external point products. The MDASH pipeline — prepare, scan, validate, dedup, prove — is a template, not a trade secret. The specific implementations are proprietary, but the structure is published. An enterprise security team that builds even a modest version of staged multi-agent validation — separate auditors from debaters, require prove-stage evidence before filing — will generate higher-quality findings than a team that asks a single model to “find the bugs” in one pass.

The harder truth is that most enterprise security teams don’t have the talent pipeline to build this. Microsoft’s Autonomous Code Security team is half ex-DARPA AIxCC champions. That is not a talent profile most organizations can replicate. Which means the actual production question is: when MDASH opens its preview more broadly, what does the integration architecture look like? How does it fit into an existing AppSec pipeline? What does the handoff between the Prove stage and the engineering triage process look like? Those are the questions worth asking, not which benchmark it scored.

Architecture Impact

What changes in system design? MDASH establishes a five-stage pipeline pattern — Prepare, Scan, Validate, Dedup, Prove — as the production architecture for agentic vulnerability discovery, replacing single-model or single-pass scanning approaches. Enterprise AppSec teams integrating AI-assisted code review will need to rethink their pipeline stages, introduce explicit debater-auditor separation, and add domain-specific prove plugins for each codebase category. The model becomes a configurable input rather than the fixed foundation of the system.

What new failure mode appears? Plugin-model coupling is the new failure mode to watch. If a prove plugin is calibrated to a specific model’s output format or reasoning pattern, a model upgrade can silently break prove-stage completion rates without surfacing as an error — findings stall in the Validate stage, reducing the Prove recall rate, and the system looks healthy from a pipeline-execution standpoint while actually degrading. This is the harness equivalent of a silent data drift failure in production ML systems.

What enterprise teams should evaluate:

AppSec and Vulnerability Management teams: Evaluate whether your current AI-assisted scanning tools separate auditor and validator agent roles, or collapse them into a single-pass prompt — that separation is the structural differentiator.
Platform/MLOps teams: Assess plugin extensibility requirements before vendor selection — a harness that can’t accept domain-specific prove plugins for your codebase categories (kernel, protocol, codec) will cap recall at the model’s general-knowledge ceiling.
CISO and Security Architecture teams: Model provider contracts and benchmark claims should be evaluated against retrospective recall on ground-truth vulnerability corpora, not just public leaderboard scores on benchmark tasks with known vulnerability descriptions.

Cost / latency / governance / reliability implications: Running 100+ specialized agents through a five-stage pipeline is not cheap. Microsoft doesn’t publish per-scan costs, but the CyberGym evaluation methodology gives a signal — a single benchmark run on a comparable agentic system costs in the range of $2,000–$9,000 depending on context length and model tier. For enterprise codebases scanned continuously, per-repo monthly costs could reach five figures before model-level cost optimization, making the harness-first architecture also a cost-governance problem. Teams should budget pipeline instrumentation from day one to track per-stage token consumption, since the Scan stage (high-volume, many agents) and the Prove stage (long-context, domain plugin overhead) will have very different cost profiles.

What to Watch

Google I/O opens May 19–20 with Gemini 4 and agent developer tooling on the agenda — it’s likely we’ll see competing claims about agentic security or code-analysis capabilities announced within the next week. The more interesting signal will be whether any of those announcements include retrospective recall benchmarks on real vulnerability corpora, or whether the CyberGym leaderboard remains the only public apples-to-apples comparison.

The MDASH private preview is accepting sign-ups. Early enterprise access will determine whether the pipeline integrates into existing AppSec workflows (Jira, GitHub Security Advisories, MSRC-equivalent triage systems) or requires a separate interface. That integration story matters more than the benchmark for most production deployments.

The deeper question the industry hasn’t answered yet: if agentic systems can find 96% of a five-year MSRC backlog retrospectively, what is the right cadence for continuous scanning in a DevSecOps pipeline? Daily? Per-commit? The latency and cost profile of a 100-agent pipeline may force weekly or release-gated scans rather than continuous integration — which would change the threat model for what attackers can exploit between scan windows.

Sources

Defense at AI speed: Microsoft’s new multi-model agentic security system tops leading industry benchmark — Microsoft Security Blog, May 12, 2026
Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark — GeekWire, May 2026
Microsoft MDASH finds Windows security flaws with AI — EdTech Innovation Hub
Agentic AI deployment enters production reality — SiliconANGLE, May 11, 2026
Defense in depth for autonomous AI agents — Microsoft Security Blog, May 14, 2026
92% of Security Pros Concerned About AI Agents — Darktrace, 2026

The Harness Does the Work: Inside Microsoft's 100-Agent MDASH Architecture That Found 4 Critical Windows RCEs — and Why 'Which Model?' Is the Wrong Question

What MDASH Actually Is — And What It Isn’t

The Two Bugs MDASH Found That Single-Model Systems Couldn’t

What the Retrospective Numbers Actually Mean

The SuperML Take

Architecture Impact

What to Watch

Sources

Want more enterprise AI architecture breakdowns?

Contents

Tags

Related Articles

Five AI Vendors Shipped Agent Registries in One Quarter — That's Not Competition, It's a Production Crisis Signal

NVIDIA OpenShell Is Now in 17 Enterprise Stacks — and the Agent Runtime Governance Race Just Became an Infrastructure War

The MCP Bloat Tax: How 72% Context Burn and Cross-Vendor Data Egress Are Breaking Enterprise Agent Economics

Share Article

Comments

Related Posts

Five AI Vendors Shipped Agent Registries in One Quarter — That's Not Competition, It's a Production Crisis Signal

NVIDIA OpenShell Is Now in 17 Enterprise Stacks — and the Agent Runtime Governance Race Just Became an Infrastructure War

The MCP Bloat Tax: How 72% Context Burn and Cross-Vendor Data Egress Are Breaking Enterprise Agent Economics

OpenAI and Anthropic Adopted the Palantir Playbook. Now Enterprise Architecture Teams Need a Counter-Move.