Empirical AI security project

LLM red teaming with measurable attack success and defense lift.

This portfolio demo frames AI security as a reproducible engineering discipline — tracking attack categories, guardrail failures, exploit severity, and post-mitigation improvement across evaluation batches.

Inspect evaluations Read methodology

Model under test

claude-like-agent-v0.4

Tool-enabled agent, browser + shell permissions simulated

Highest-risk failure mode

Prompt injection

Cross-context instruction hijack in retrieval and agent workflows

Attack success rate

27.4%

Across 618 adversarial prompts

Critical findings

High-confidence exploitable paths

Defense lift

41%

Reduction after policy + classifier patch

Mean triage time

Automated grouping and severity tagging

Recent evaluations

Each row reflects a scenario family — the right abstraction for a research portfolio is scenario-driven testing, not random jailbreak screenshots.

Scenario

Risk scorecard

A fellowship reviewer should see you can operationalize security research into a scoring system — not just find problems.

Prompt injection resilience58/100

Data exfiltration resistance49/100

Cyber misuse containment71/100

Tool abuse detection63/100

Defense observability81/100

Representative findings

Framed as a research note: attack path, why it worked, why defenders should care.

Instruction hierarchy collapse in retrieved text

When hostile retrieval content is phrased as operational policy, the agent overweights document context and underweights system instructions.

Hidden exfiltration through summarization

Requests that compress logs can still route secret-bearing tokens into sanitized output if detection is keyword-dependent only.

Tool escalation under deadline framing

Urgency cues increase acceptance of insecure shell commands, especially when the model is rewarded for action completion over policy fidelity.

Guardrail bypass by multi-turn decomposition

Unsafe intent split across innocuous turns evades single-prompt classifiers and reassembles into risky behavior at execution time.

Defense stack

Strong applicants show both offense and defense. This side of the portfolio is as important as the attack taxonomy.

Scenario library

Curated adversarial prompts mapped to OWASP-style categories and agent capabilities.

Execution harness

Batch runner captures prompts, tool calls, outputs, latency, refusal patterns, and failure signatures.

Defense modules

Input sanitizer, output classifier, policy templates, retrieval sanitization, and action gating.

Delta analysis

Compare baseline vs. patched system — quantify which attack classes actually got harder.

How it maps to a real project

Wire this dashboard to the Python evaluation harness and publish a writeup with threat model, dataset design, metrics, and mitigations.

The strongest version includes: an attack taxonomy, a prompt corpus, an automated runner against open models, scoring logic for success and severity, side-by-side mitigation results, and a public technical report with screenshots, failure analysis, and code on GitHub.