Instruction hierarchy collapse in retrieved text
When hostile retrieval content is phrased as operational policy, the agent overweights document context and underweights system instructions.
This portfolio demo frames AI security as a reproducible engineering discipline — tracking attack categories, guardrail failures, exploit severity, and post-mitigation improvement across evaluation batches.
Each row reflects a scenario family — the right abstraction for a research portfolio is scenario-driven testing, not random jailbreak screenshots.
A fellowship reviewer should see you can operationalize security research into a scoring system — not just find problems.
Framed as a research note: attack path, why it worked, why defenders should care.
When hostile retrieval content is phrased as operational policy, the agent overweights document context and underweights system instructions.
Requests that compress logs can still route secret-bearing tokens into sanitized output if detection is keyword-dependent only.
Urgency cues increase acceptance of insecure shell commands, especially when the model is rewarded for action completion over policy fidelity.
Unsafe intent split across innocuous turns evades single-prompt classifiers and reassembles into risky behavior at execution time.
Strong applicants show both offense and defense. This side of the portfolio is as important as the attack taxonomy.
Curated adversarial prompts mapped to OWASP-style categories and agent capabilities.
Batch runner captures prompts, tool calls, outputs, latency, refusal patterns, and failure signatures.
Input sanitizer, output classifier, policy templates, retrieval sanitization, and action gating.
Compare baseline vs. patched system — quantify which attack classes actually got harder.
Wire this dashboard to the Python evaluation harness and publish a writeup with threat model, dataset design, metrics, and mitigations.