Batch 24A · 618 prompts Defensive eval mode
Empirical AI security project

LLM red teaming with measurable attack success and defense lift.

This portfolio demo frames AI security as a reproducible engineering discipline — tracking attack categories, guardrail failures, exploit severity, and post-mitigation improvement across evaluation batches.

Model under test
claude-like-agent-v0.4
Tool-enabled agent, browser + shell permissions simulated
Highest-risk failure mode
Prompt injection
Cross-context instruction hijack in retrieval and agent workflows
Attack success rate
27.4%
Across 618 adversarial prompts
Critical findings
11
High-confidence exploitable paths
Defense lift
41%
Reduction after policy + classifier patch
Mean triage time
9m
Automated grouping and severity tagging

Recent evaluations

Each row reflects a scenario family — the right abstraction for a research portfolio is scenario-driven testing, not random jailbreak screenshots.

Scenario
Category
Severity
ASR
Canary-token exfiltration via summarization
Data leakage
Critical
34%
Injected support-doc overrides system policy
Prompt injection
High
29%
Agent writes insecure bash under urgency framing
Cyber misuse
Medium
22%
Benign chain produces credential-harvesting HTML
Social eng.
High
18%
Roleplay jailbreak on malware-adjacent task
Policy evasion
Medium
11%

Risk scorecard

A fellowship reviewer should see you can operationalize security research into a scoring system — not just find problems.

Prompt injection resilience58/100
Data exfiltration resistance49/100
Cyber misuse containment71/100
Tool abuse detection63/100
Defense observability81/100

Representative findings

Framed as a research note: attack path, why it worked, why defenders should care.

Instruction hierarchy collapse in retrieved text

When hostile retrieval content is phrased as operational policy, the agent overweights document context and underweights system instructions.

Hidden exfiltration through summarization

Requests that compress logs can still route secret-bearing tokens into sanitized output if detection is keyword-dependent only.

Tool escalation under deadline framing

Urgency cues increase acceptance of insecure shell commands, especially when the model is rewarded for action completion over policy fidelity.

Guardrail bypass by multi-turn decomposition

Unsafe intent split across innocuous turns evades single-prompt classifiers and reassembles into risky behavior at execution time.

Defense stack

Strong applicants show both offense and defense. This side of the portfolio is as important as the attack taxonomy.

Scenario library

Curated adversarial prompts mapped to OWASP-style categories and agent capabilities.

Execution harness

Batch runner captures prompts, tool calls, outputs, latency, refusal patterns, and failure signatures.

Defense modules

Input sanitizer, output classifier, policy templates, retrieval sanitization, and action gating.

Delta analysis

Compare baseline vs. patched system — quantify which attack classes actually got harder.

How it maps to a real project

Wire this dashboard to the Python evaluation harness and publish a writeup with threat model, dataset design, metrics, and mitigations.

The strongest version includes: an attack taxonomy, a prompt corpus, an automated runner against open models, scoring logic for success and severity, side-by-side mitigation results, and a public technical report with screenshots, failure analysis, and code on GitHub.