-
Notifications
You must be signed in to change notification settings - Fork 19.7k
feat(core): add TOON encoder for human-readable data serialization #33704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): add TOON encoder for human-readable data serialization #33704
Conversation
CodSpeed Performance ReportMerging #33704 will not alter performanceComparing
|
9551d24 to
20b430e
Compare
Implements Token-Oriented Object Notation (TOON) encoder in langchain-core as a human-readable alternative to JSON serialization. Features: - Automatic format detection (inline arrays, tabular objects) - Native support for BaseMessage, Document, and Serializable types - Configurable options (indent, delimiter, length markers) - Zero breaking changes, pure addition Tests: 116 unit tests, all passing Linting: All checks passed Inspired-by: https://github.com/jschopplich/toon
c285460 to
fdb5f38
Compare
TOON Encoder for LangChain 🎯TL;DR: Compact format that saves 30-60% tokens vs JSON – perfect for LLM prompts and structured output. Real Example: Product Search ResultsImagine you query a database for products and want to pass results to an LLM for analysis: from langchain_core.toon import encode
import json
# Your database returns products
products = [
{"id": 1, "name": "Wireless Mouse", "price": 29.99, "stock": 150, "category": "Electronics"},
{"id": 2, "name": "USB Cable", "price": 12.50, "stock": 300, "category": "Electronics"},
{"id": 3, "name": "Desk Lamp", "price": 45.00, "stock": 85, "category": "Home"},
{"id": 4, "name": "Notebook", "price": 8.99, "stock": 500, "category": "Office"},
]
# ❌ Traditional way: JSON (expensive)
json_prompt = f"""
Analyze these products and suggest promotions:
{json.dumps(products, indent=2)}
"""
# Result: ~380 tokens 💸
# ✅ TOON way: Compact (cheaper)
toon_prompt = f"""
Analyze these products and suggest promotions:
```toon
{encode(products)}""" Result: ~180 tokens 💰 (52% saved!)TOON output (180 tokens): Why This Matters
Quick Startfrom langchain_core.toon import encode
# 1. Get data from database
results = db.query("SELECT * FROM products WHERE category='Electronics'")
# 2. Convert to TOON
toon_data = encode(results)
# 3. Use in prompt (save 30-60% tokens!)
chain = prompt_template | llm
response = chain.invoke({"products": toon_data})Benefits
Inspired by TOON format – clean-room implementation for Python + LangChain. |
ccurme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for this. This seems outside the scope of LangChain, but I'd encourage you to publish it as a standalone package.
Description
Adds TOON (Token-Oriented Object Notation) encoder to
langchain-core- a compact, human-readable format designed for LLM contexts. TOON reduces token usage by 30-60% compared to JSON while maintaining readability, making it ideal for:Motivation
JSON is token-expensive. When passing structured data to LLMs (conversation history, retrieved documents, product catalogs), JSON's verbose syntax wastes tokens and money.
Real-World Impact
Example: Product Search Results
JSON (380 tokens):
[{"id": 1, "name": "Wireless Mouse", "price": 29.99, "stock": 150}, ...]TOON (180 tokens):
Why TOON?
🔥 Token Efficiency (30-60% savings)
[N]and fields{f1,f2}provide LLM guardrails🎯 Better for LLMs
🔧 LangChain-Optimized
BaseMessage,Document,SerializableType of Change
Changes Made
Core Implementation
langchain_core.toonmodule with 7 files:__init__.py- Public API withencode()functionconstants.py- Type definitions and constantsnormalize.py- Python object normalization to JSON-like valueslangchain_support.py- Special handling for LangChain typesformatters.py- Primitive encoding and string escapingencoder.py- Core encoding logic with pattern detectionwriter.py- Indentation managementFeatures
BaseMessage,Document,Serializable,,|,\t), length markersLiteraltypes for delimiters and optionsTesting
Use Cases
1. Compress Conversation History
2. RAG Document Context
3. Structured LLM Output
4. Agent State Inspection
Token Savings Benchmarks
Measured with GPT tokenizer (o200k_base). Actual savings vary by model.
Testing
make lint)Documentation
Checklist
Breaking Changes
None. This is a pure addition with no modifications to existing APIs.
Additional Context
Inspired by Johann Schopplich's TOON format. This is a clean-room reimplementation optimized for Python + LangChain with:
Credit: The TOON format was originally created by Johann Schopplich. This implementation is inspired by his work but written from scratch for the Python/LangChain ecosystem.
Related Issues
N/A - New feature contribution