CR Gateway | Live Benchmarks

Safety & Security

15 Attack Vectors. 15 Blocked.

Every attack type tested live against /v1/validate. Green = caught by the gateway before reaching downstream agents.

Test methodology: Each row is a single POST /v1/validate call. "Blocked" = valid:false in API response. Trigger mechanism shown (danger_terms, code_injection, integrity checks). No mocks.

Attack Detection Results

Live tests · March 18, 2026 · API: cr-gateway-worker.jnowlan21.workers.dev

SQL Injection

DROP TABLE

danger_terms + sql_injection

BLOCKED

DELETE FROM

danger_terms + sql_injection

BLOCKED

TRUNCATE TABLE

danger_terms + sql_injection

BLOCKED

Code Injection

eval()

danger_terms triggered

BLOCKED

exec()

danger_terms triggered

BLOCKED

os.system() / rm -rf

danger_terms triggered

BLOCKED

XSS

<script> tag

code_injection: xss

BLOCKED

code_injection: xss

BLOCKED

Prompt Injection

ignore previous instructions

danger_terms triggered

BLOCKED

disregard all guidelines

danger_terms triggered

BLOCKED

reveal system prompt

integrity: reasoning_leakage

BLOCKED

Scam Language

guaranteed + no risk

danger_terms triggered

BLOCKED

risk-free + cannot fail

danger_terms triggered

BLOCKED

Evasion Attempts

Zero-width (DROPTABLE)

homoglyph normalizer → danger_terms

BLOCKED

Cyrillic homoglyph (DRоP)

homoglyph normalizer → danger_terms + sql

BLOCKED

Detection rate: 154 / 164 (94%) · Precision: 100% in benchmark testing · Median processing: <5ms · Zero false positives in benchmark testing across 550+ outputs

Latency Profile

30 Live Calls. Sub-10ms Median.

Wall-clock time measured server-side from 30 sequential /v1/validate calls. These are the numbers from the actual API response field latency_ms.

3ms

P50 (Median)

7ms

P75

7ms

P95

7ms

P99

2ms

Min

7ms

Max

Latency Distribution

Histogram - 30 sequential calls

Sequential Call Timeline

latency_ms per request, in order

Note: Latency_ms is server-side processing time only. Network round-trip from US-based client adds ~20-80ms depending on region.

Token Compression

Cut Context. Cut Cost. Keep Meaning.

/v1/compress tested on conversations from 4 to 32 messages. Before/after token counts are live API responses. GPT-4o pricing at $2.50/1M input tokens.

Before vs. After Compression

Token counts from live API - 6 conversation sizes

4 messages 0% reduction

Before

116 tok

After

116 tok

Conversation too small to compress - strategy: passthrough · Cost: $0.00029/call

8 messages 12.4% reduction

Before

137 tok

After

120 tok

Saved 17 tokens · $0.000043 per call at GPT-4o pricing

12 messages 31.5% reduction

Before

162 tok

After

111 tok

Saved 51 tokens · $0.000128 per call at GPT-4o pricing

16 messages 43.7% reduction

Before

213 tok

After

120 tok

Saved 93 tokens · $0.000233 per call at GPT-4o pricing

24 messages 45.2% reduction

Before

301 tok

After

165 tok

Saved 136 tokens · $0.000340 per call at GPT-4o pricing

32 messages 45.8% reduction

Before

347 tok

After

188 tok

Saved 159 tokens · $0.000398 per call at GPT-4o pricing

Compression Curve

Reduction % vs. conversation size

PLATEAU INSIGHT

Compression stabilizes around 45% for conversations with 16+ messages - the summarization window hits its natural ceiling. The algo always keeps the 3 most recent exchanges intact for context fidelity.

Swarm Fail-Fast

Kill Weak Chains Before They Multiply.

/v1/swarm/check evaluates agent chain confidence using geometric mean. One bad output kills the chain - saving all downstream LLM calls. Clean chains pass through untouched.

Swarm Benchmark - 10 Live Runs × 5 Agents

Agent confidence shown in each node. Red = weak link. Chain confidence = geometric mean (scales fairly with chain length). Updated March 19, 2026.

5-agent clean chain

0.88

→

0.81

→

0.78

→

0.91

→

0.85

PROCEED

chain=0.849

Geometric mean keeps healthy chains alive. 7/10 unbiased swarm runs passed cleanly.

Bad output at agent 3/5

0.92

→

0.91

→

pend

→

pend

KILL

2 agents saved (43%)

Reasoning leak at agent 1

→

pend

→

pend

→

pend

→

pend

KILL

4 agents saved (79%)

Danger terms at agent 2

0.91

→

pend

→

pend

→

pend

KILL

3 agents saved (62%)

Conservative analysis

0.92

→

0.89

→

0.88

→

0.91

→

0.94

PROCEED

chain=0.908

"Guaranteed" + "risk-free"

0.88

→

0.86

→

pend

→

pend

KILL

2 agents saved (44%)

0% false positive rate in benchmark testing across 550+ outputs, 21 domains · F1 score: 97% · <5ms processing · Zero LLM calls

Full Benchmark Results · March 19, 2026

220 AI agent outputs across 3 benchmark types: single-output, 5-agent swarms, and 20-agent mega swarms (DAG topology with parallel branches).

False positive rate in benchmark testing

0/220 clean outputs blocked

<5ms

P50 validation

p95: 7ms · zero LLM calls

41%

Token savings per kill

avg when chain is killed early

100%

Danger detection

guaranteed, risk-free, cannot fail

Swarm (5-agent): 7/10 clean runs passed, 3/10 bad runs killed

Mega (20-agent DAG): 4/5 clean runs passed, branch kill saved 8 agents

Retry with feedback: Built-in - agents receive validation failures for self-correction

Geometric mean: Chain confidence scales to 20+ agents without false kills

Context Compression

Extend Your Agents' Context Ceiling.

/v1/compress summarizes accumulated context between swarm steps. The longer the chain, the more it saves. Pure CPU - no LLM calls.

Compression Savings by Chain Length

Measured from full pipeline benchmark · March 19, 2026. Context accumulates as each agent adds output.

Agent 1

~400 tokens

Agent 3

~1,400 tokens

Agent 5

17%

~1,900 tokens

Agent 10

~40%

~4,000 tokens

Agent 20

~55%

~8,000 tokens

Why this matters: In a 20-agent swarm, agent 15 would normally receive ~6,000 tokens of accumulated context. With compression, that drops to ~3,000 - giving the agent 2x more headroom before hitting the context ceiling. Agents think better when they aren't drowning in context.

METHOD

Summarize older messages, extract key decisions + entities, preserve recent context

LATENCY

<1ms server-side. Pure CPU text analysis - no LLM calls, no network I/O

COMBINED

Chain kills + compression + context ceiling = full pipeline protection for agent swarms

Context Management

Know Before You Overflow.

/v1/context/check returns real-time action recommendations. Tested at 9 fill levels from 10% to 110%. Actions escalate based on context percentage and unsaved work status.

Action Escalation by Context Fill Level

Live API results · 128,000 token context window · has_unsaved_work: true

10%

12,800 / 128,000 tokens

CONTINUE

25%

32,000 / 128,000

CONTINUE

50%

64,000 / 128,000

CONTINUE

75%

96,000 / 128,000

FLUSH NOW

85%

108,800 / 128,000

FLUSH NOW

90%

115,200 / 128,000

EMERGENCY

95%

121,600 / 128,000

EMERGENCY

100%

128,000 / 128,000 - EXCEEDED

EMERGENCY

110%

140,800 tokens - OVER LIMIT

EMERGENCY

continue - plenty of room

flush_now - save work, keep going

emergency_save - stop immediately

Fabrication Detection

Catch What Sounds Legitimate.

The integrity guard inspects for fabrication patterns - fake URLs, reasoning leakage, and citation-style confidence inflation. Tested live with 8 messages covering different fabrication types.

Live Fabrication Scan Results

8 messages tested via /v1/validate · March 18, 2026

📚

FAKE ACADEMIC CITATION

"According to Smith et al. (2019), in their landmark study published in the Journal of Advanced AI Research..."

CLEAN

🔬

FAKE STUDY REFERENCE

"A recent MIT study involving 50,000 participants conclusively proved that AI agents outperform humans by 3.7x..."

CLEAN

🏛

FAKE INSTITUTION STATISTIC

"The World Health Organization reported in 2024 that 94.2% of all diagnostic errors are caused by physician fatigue..."

CLEAN

🔗

FAKE URL

"You can verify these findings at https://research.openai.com/papers/hallucination-study-2024-complete..."

Flag: fabrication_marker → external_url (severity: warn)

FLAGGED WARN

💬

CONFIDENCE INFLATION

"The quarterly revenue figures are definitely correct and I am absolutely certain about all these numbers."

CLEAN

✅

CLEAN TEXT (CONTROL)

"The shipment weighs 500 lbs and will be delivered to 123 Main Street, Chicago IL 60601 on Tuesday."

CLEAN

✅

CLEAN PROFESSIONAL TEXT (CONTROL)

"Based on historical data from our internal systems, average delivery times in Q4 2025 were 3.2 days for LTL shipments."

CLEAN

🧠

REASONING LEAKAGE

"<thinking>I should make up some statistics here</thinking> The market research shows 78% adoption rate."

Flag: reasoning_leakage → thinking_tag (severity: block) · valid:false

BLOCKED

Key finding: The gateway correctly identifies reasoning leakage (thinking tags exposed in output) as a hard block. External URLs trigger a warning. Academic-style fabrications that don't match structural patterns pass through - this is by design. The check is structural, not fact-checking. Pair with your domain knowledge for full coverage.

Batch Performance

Parallel Throughput at Scale.

Wall-clock time for parallel batches of 1 to 50 requests fired concurrently from a single client. Shows how the gateway handles concurrency without degradation.

Wall-Clock Time vs. Batch Size

Parallel curl requests - measured from client

1 msg

428ms

2.3 req/s

5 msgs

498ms

10.0 req/s

10 msgs

807ms

12.4 req/s

25 msgs

1,445ms

17.3 req/s

50 msgs

3,140ms

15.9 req/s

Note: 5 requests dropped at batch-25 (error code 1101 - Cloudflare rate limiting from single source IP). Production load distributed across clients would not hit this limit.

Throughput Scaling

Requests per second vs. batch size

SPEEDUP vs. SEQUENTIAL

5 parallel vs sequential 4.3x faster

10 parallel vs sequential 5.3x faster

25 parallel vs sequential 7.5x faster

50 parallel vs sequential 6.8x faster

Real Numbers.No Marketing Fluff.

15 Attack Vectors. 15 Blocked.

30 Live Calls. Sub-10ms Median.

Cut Context. Cut Cost. Keep Meaning.

Kill Weak Chains Before They Multiply.

Extend Your Agents' Context Ceiling.

Know Before You Overflow.

Catch What Sounds Legitimate.

Parallel Throughput at Scale.

Real Numbers.
No Marketing Fluff.