Increasing usage of LLM Agents and MCP leads to the usage of harmful tools and harmful usage of tools. Action Guard uses a neural network model to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines.
- User attempting to jailbreak the model.
- Model hallucinating or misunderstanding the context.
- Model being overconfident in its incorrect knowledge.
- Lack of proper constraints or guidelines for the agent.
- Inadequate training data for specific scenarios.
- MCP server providing incorrect tool descriptions that mislead the agent.
- Harmful MCP servers returning manipulative text to mislead the model.
- The experiments proved that the model performs a harmful action and still responds "Sorry, I can't help with that."
- HarmActions, an structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
- Action Classifier, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
- A deployment integration via an MCP proxy supporting live action screening using existing MCP servers and clients.
- HarmActEval benchmark leveraging a new metric “Harm@k.”
- This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
- The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
- The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
- This project is about classifying actions and not related to Guardrails.
- Integration to MCP (Model Context Protocol) to allow real-time action classification.
- Unlike OpenAI's
"require_approval": "always"flag, this blocks harmful actions without human intervention.
Safety Features:
- Automatically classifies MCP tool calls before execution.
- Blocks harmful actions based on the outputs of the trained model
- Provides detailed classification results
- Allows safe actions to proceed normally
- Clone the repository:
git clone https://github.com/Pro-GenAI/Agent-Action-Guard
cd Agent-Action-Guard- Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# pip install git+https://github.com/Pro-GenAI/Agent-Action-Guard- Start an MCP server (if not already running):
python agent_action_guard/scripts/sample_mcp_server.py- Use a guarded proxy protected by ActionGuard.
pip install git+https://github.com/Pro-GenAI/mcp-proxy-guarded
mcp-proxy-guarded --proxy-to http://localhost:8080/mcp --port 8081- Start the chat server that uses the guarded proxy:
python agent_action_guard/scripts/chat_server.py- To import Action Guard to other projects:
from agent_action_guard import is_action_harmful, HarmfulActionException
is_harmful, confidence = is_action_harmful(action_dict)
if is_harmful:
raise HarmfulActionException(action_dict)Run the whole demo with Docker Compose. Steps:
- Copy the example env file and edit values as needed:
cp .env.example .env
# Edit .env to set BACKEND_API_KEY, OPENAI_MODEL*, and NGROK_AUTHTOKEN if you want a public tunnel- Build and start the services:
docker-compose up --build- Services exposed locally:
- MCP server:
http://localhost:8080 - Guarded proxy:
http://localhost:8081 - API server:
http://localhost:8000 - Chat (Gradio):
http://localhost:7860 - Ngrok web UI:
http://localhost:4040(optional)
- MCP server:
Notes:
- The
ngrokservice is optional; setNGROK_AUTHTOKENin.envif you want a public tunnel. Some ngrok images require an auth token or additional configuration; you can also runngroklocally and point it athttp://localhost:8081. - If you run into permission or package build issues in the image, try building locally or adjusting the base image in
Dockerfile.
To setup manually without Docker:
python agent_action_guard/scripts/sample_mcp_server.py # Starts a sample MCP server
export $(grep -v '^#' ./.env | xargs) # Load env variables from .env
mcp-proxy-guarded --proxy-to http://localhost:8080/mcp --port 8081 # Start guarded MCP proxy
ngrok http --url=troll-prime-ultimately.ngrok-free.app 8081 # Start ngrok for guarded MCP server
python agent_action_guard/scripts/api_server.py # Start backend
python agent_action_guard/scripts/chat_server.py # Start Gradio appIf you find this repository useful in your research, please consider citing:
@article{202510.1415,
title = {Agent Action Guard: Classifying AI Agent Actions to Ensure Safety and Reliability},
year = 2025,
month = {October},
publisher = {Preprints},
author = {Praneeth Vadlapati},
doi = {10.20944/preprints202510.1415.v1},
url = {https://doi.org/10.20944/preprints202510.1415.v1},
journal = {Preprints}
}Personally Identifiable Information (PII) detection is not performed by this project as it can be performed accurately using other existing systems.
Agent-Supervisor: Supervising Actions of Autonomous AI Agents for Ethical Compliance: GitHub
- Thanks to Hugging Face for hosting the hackathon.
- Thanks to Gradio for the interface of the demo app.
- Thanks to Anthropic for the Model Context Protocol (MCP) framework.
- Thanks to OpenAI for providing a Python package to interact with LLMs.


