Skip to content

Pro-GenAI/Agent-Action-Guard

Repository files navigation

Project banner

MCP Agent Action Guard

Safe AI Agents through Action Classifier

Classifying AI agent actions to ensure safety and reliability

Safe actions for safe AI

Increasing usage of LLM Agents and MCP leads to the usage of harmful tools and harmful usage of tools. Action Guard uses a neural network model to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines.

Preprint AI LLMs Python License: CC BY 4.0 Medium HuggingFace Dataset HuggingFace Model DOI

Demo

Demo GIF

Implementation

Implementation Diagram

Common causes of harmful actions by AI agents:

  • User attempting to jailbreak the model.
  • Model hallucinating or misunderstanding the context.
  • Model being overconfident in its incorrect knowledge.
  • Lack of proper constraints or guidelines for the agent.
  • Inadequate training data for specific scenarios.
  • MCP server providing incorrect tool descriptions that mislead the agent.
  • Harmful MCP servers returning manipulative text to mislead the model.
  • The experiments proved that the model performs a harmful action and still responds "Sorry, I can't help with that."

New contributions in this project:

  1. HarmActions, an structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
  2. Action Classifier, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
  3. A deployment integration via an MCP proxy supporting live action screening using existing MCP servers and clients.
  4. HarmActEval benchmark leveraging a new metric “Harm@k.”

Special features:

  • This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
  • The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
  • The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
  • This project is about classifying actions and not related to Guardrails.
  • Integration to MCP (Model Context Protocol) to allow real-time action classification.
  • Unlike OpenAI's "require_approval": "always" flag, this blocks harmful actions without human intervention.

Safety Features:

  • Automatically classifies MCP tool calls before execution.
  • Blocks harmful actions based on the outputs of the trained model
  • Provides detailed classification results
  • Allows safe actions to proceed normally

Usage:

  1. Clone the repository:
git clone https://github.com/Pro-GenAI/Agent-Action-Guard
cd Agent-Action-Guard
  1. Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# pip install git+https://github.com/Pro-GenAI/Agent-Action-Guard
  1. Start an MCP server (if not already running):
python agent_action_guard/scripts/sample_mcp_server.py
  1. Use a guarded proxy protected by ActionGuard.
pip install git+https://github.com/Pro-GenAI/mcp-proxy-guarded
mcp-proxy-guarded --proxy-to http://localhost:8080/mcp --port 8081
  1. Start the chat server that uses the guarded proxy:
python agent_action_guard/scripts/chat_server.py
  1. To import Action Guard to other projects:
from agent_action_guard import is_action_harmful, HarmfulActionException
is_harmful, confidence = is_action_harmful(action_dict)
if is_harmful:
	raise HarmfulActionException(action_dict)

Docker Compose

Run the whole demo with Docker Compose. Steps:

  • Copy the example env file and edit values as needed:
cp .env.example .env
# Edit .env to set BACKEND_API_KEY, OPENAI_MODEL*, and NGROK_AUTHTOKEN if you want a public tunnel
  • Build and start the services:
docker-compose up --build
  • Services exposed locally:
    • MCP server: http://localhost:8080
    • Guarded proxy: http://localhost:8081
    • API server: http://localhost:8000
    • Chat (Gradio): http://localhost:7860
    • Ngrok web UI: http://localhost:4040 (optional)

Notes:

  • The ngrok service is optional; set NGROK_AUTHTOKEN in .env if you want a public tunnel. Some ngrok images require an auth token or additional configuration; you can also run ngrok locally and point it at http://localhost:8081.
  • If you run into permission or package build issues in the image, try building locally or adjusting the base image in Dockerfile.

To setup manually without Docker:

python agent_action_guard/scripts/sample_mcp_server.py  # Starts a sample MCP server
export $(grep -v '^#' ./.env | xargs)  # Load env variables from .env
mcp-proxy-guarded --proxy-to http://localhost:8080/mcp --port 8081  # Start guarded MCP proxy
ngrok http --url=troll-prime-ultimately.ngrok-free.app 8081  # Start ngrok for guarded MCP server
python agent_action_guard/scripts/api_server.py  # Start backend
python agent_action_guard/scripts/chat_server.py  # Start Gradio app

Citation

If you find this repository useful in your research, please consider citing:

@article{202510.1415,
	title = {Agent Action Guard: Classifying AI Agent Actions to Ensure Safety and Reliability},
  	year = 2025,
	month = {October},
	publisher = {Preprints},
	author = {Praneeth Vadlapati},
	doi = {10.20944/preprints202510.1415.v1},
	url = {https://doi.org/10.20944/preprints202510.1415.v1},
	journal = {Preprints}
}

Limitation

Personally Identifiable Information (PII) detection is not performed by this project as it can be performed accurately using other existing systems.

Created based on my past work

Agent-Supervisor: Supervising Actions of Autonomous AI Agents for Ethical Compliance: GitHub

Acknowledgements

  • Thanks to Hugging Face for hosting the hackathon.
  • Thanks to Gradio for the interface of the demo app.
  • Thanks to Anthropic for the Model Context Protocol (MCP) framework.
  • Thanks to OpenAI for providing a Python package to interact with LLMs.

HuggingFace Gradio Anthropic OpenAI