Skip to content

feat(core): allow guardian prompt overrides from model metadata#13915

Open
charley-oai wants to merge 1 commit into
mainfrom
guardian-prompt-override
Open

feat(core): allow guardian prompt overrides from model metadata#13915
charley-oai wants to merge 1 commit into
mainfrom
guardian-prompt-override

Conversation

@charley-oai

Copy link
Copy Markdown
Contributor

Summary

  • add a guardian-specific developer-instructions field to model metadata
  • teach core guardian prompt assembly to prefer the selected guardian model's override while keeping the JSON contract appended in code
  • update affected test fixtures and add coverage for the override path

Testing

  • cargo test -p codex-protocol openai_models::tests::model_info_defaults_availability_nux_to_none_when_omitted
  • cargo test -p codex-core guardian_subagent_config
  • cargo test -p codex-api models_client_hits_models_endpoint
  • cargo test -p codex-app-server --test all get_auth_status_no_auth

@charley-oai

Copy link
Copy Markdown
Contributor Author

@codex review this

@chatgpt-codex-connector

Copy link
Copy Markdown
Contributor

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@charley-oai charley-oai force-pushed the guardian-assessment-ui branch from 05060d9 to 04f5f35 Compare March 8, 2026 00:19
@charley-oai charley-oai force-pushed the guardian-prompt-override branch from a759c69 to 7c09e3b Compare March 8, 2026 00:19
Co-authored-by: Codex <noreply@openai.com>
@nidhishgajjar

Copy link
Copy Markdown

Orb Code Review (powered by GLM-4.7 on Orb Cloud)

Summary

This PR introduces the ability to override guardian prompts from model metadata, allowing different models to have custom guardian instructions instead of using a single hardcoded prompt. The implementation adds a new guardian_developer_instructions field to ModelInfo and modifies the guardian prompt assembly to prefer model-specific overrides when available.

Architecture

New Component:

  1. Model metadata field:
pub struct ModelInfo {
    // ... existing fields
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub guardian_developer_instructions: Option<String>,
    // ... more fields
}
  1. Guardian prompt assembly with override:
fn guardian_policy_prompt(prompt_override: Option<&str>) -> String {
    let prompt = prompt_override
        .unwrap_or(include_str!("guardian_prompt.md"))
        .trim_end();
    format!("{prompt}\n\n{}", guardian_output_contract_prompt())
}
  1. Retrieval from model metadata:
let guardian_model_info = session
    .services
    .models_manager
    .get_model_info(&guardian_model, turn.config.as_ref())
    .await;

let guardian_config = build_guardian_subagent_config(
    // ... other params
    guardian_model_info
        .guardian_developer_instructions
        .as_deref(),
)?;

Analysis

Correctness ✓

The feature implementation:

  • Adds optional field to model metadata
  • Retrieves override from model when available
  • Falls back to default prompt when not specified
  • Preserves JSON contract appending in code

Override logic:

let prompt = prompt_override
    .unwrap_or(include_str!("guardian_prompt.md"))
    .trim_end();

This correctly implements preference for model-specific prompts while maintaining backward compatibility.

Code Quality ✓

Minimal and focused changes:

  • Single new field addition to ModelInfo
  • Modified guardian prompt assembly function
  • Updated guardian subagent config building
  • Comprehensive test coverage

Good use of Rust idioms:

  • Uses Option<&str> for optional overrides
  • as_deref() for safe option handling
  • #[serde(default)] for backward compatibility
  • Test fixtures properly updated

Testing ✓

Comprehensive test coverage:

  1. New functionality test:
#[test]
fn guardian_subagent_config_prefers_model_prompt_override() {
    let guardian_config = build_guardian_subagent_config(
        &test_config(),
        None,
        "active-model",
        None,
        Some("override prompt"),
    )
    .expect("guardian config");

    let instructions = guardian_config
        .developer_instructions
        .expect("guardian instructions");

    assert!(instructions.starts_with("override prompt"));
    assert!(instructions.contains("\"risk_level\": \"low\" | \"medium\" | \"high\""));
}
  1. Updated all existing fixtures:
  • 11 test files updated to include guardian_developer_instructions: None
  • Ensures backward compatibility
  • Maintains existing test behavior

Backward Compatibility ✓

Preserves existing behavior:

  • Field is optional with default None
  • When not specified, uses existing hardcoded prompt
  • All existing tests pass with None value
  • JSON contract is still appended

Why Option is correct:

  • New field is optional for existing models
  • Existing models don't need this field
  • Gradual migration path for model authors

Security ⚠️

Prompt injection concerns:

  • Model metadata is typically controlled by the platform
  • Guardian prompts are critical security controls
  • Consider validating guardian prompts for injection attempts
  • JSON contract appending provides some safety

Cross-file Impact

Moderate impact:

  • Modified files: 12 (protocol, core, tests)
  • Added field to ModelInfo struct
  • Modified guardian prompt assembly
  • Updated all test fixtures
  • No breaking changes to existing APIs

Assessment

Approve - This is a well-implemented feature that provides model flexibility:

Pros:

  • Enables model-specific guardian prompts
  • Clean, minimal implementation
  • Good backward compatibility
  • Comprehensive test coverage
  • Follows existing code patterns
  • Preserves JSON contract

⚠️ Considerations:

  • Security: Model-specific guardian prompts could potentially weaken security if not properly validated
  • Complexity: Adding per-model configuration increases system complexity
  • Documentation: Need to ensure model authors understand the security implications

Recommendations:

  1. Security validation: Consider adding validation for guardian prompt overrides to ensure they maintain security contracts
  2. Documentation: Document the security implications and best practices for model-specific guardian prompts
  3. Monitoring: Add logging or metrics when custom guardian prompts are used
  4. Audit trail: Consider keeping track of which guardian prompts are used for auditing purposes
  5. User visibility: Consider showing users when a custom guardian prompt is being used

Verdict: This is a well-implemented feature that provides necessary flexibility for model-specific guardian prompts. The implementation is clean, maintains backward compatibility, and has good test coverage. The main concern is ensuring that custom guardian prompts maintain appropriate security boundaries. This should be merged with the understanding that proper validation and monitoring should be implemented for the custom prompts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants