Skip to content

SGR vs Function Calling_ When to Use Each Approach

maksimov maksim edited this page Oct 17, 2025 · 1 revision

πŸ‘ The Problem with Function Calling on Local Models (ReAct Agents)

Reality Check: Function Calling works great on OpenAI/Anthropic (80+ BFCL scores) but fails dramatically on local models <32B parameters when using true ReAct agents with tool_mode="auto", where the model itself decides when to call tools.

BFCL Benchmark Results for Qwen3 Models:

  • Qwen3-8B (FC): Only 15% accuracy in Agentic Web Search mode (BFCL benchmark)
  • Qwen3-4B (FC): Only 2% accuracy in Agentic Web Search mode
  • Qwen3-1.7B (FC): Only 4.5% accuracy in Agentic Web Search mode
  • Even with native FC support, smaller models struggle with deciding WHEN to call tools
  • Common result: {"tool_calls": null, "content": "Text instead of tool call"}

Note: Our team is currently working on creating a specialized benchmark for SGR vs ReAct performance on smaller models. Initial testing confirms that the SGR pipeline enables even smaller models to follow complex task workflows.

πŸ‘ SGR Solution: Forced Reasoning β†’ Deterministic Execution

# Phase 1: Structured Output reasoning (100% reliable)
reasoning = model.generate(format="json_schema")
# {"action": "search", "query": "BMW X6 prices", "reason": "need current data"}

# Phase 2: Deterministic execution (no model uncertainty)
result = execute_plan(reasoning.actions)

πŸ‘ Architecture by Model Size

Model Size Recommended Approach FC Accuracy Why Choose This
<14B Pure SGR + Structured Output 15-25% FC practically unusable
14-32B SGR + FC hybrid 45-65% Best of both worlds
32B+ Native FC with SGR fallback 85%+ FC works reliably

πŸ‘ When to Use SGR vs Function Calling

Use Case Best Approach Why
Data analysis & structuring SGR Controlled reasoning with visibility
Document processing SGR Step-by-step analysis with justification
Local models (<32B) SGR Forces reasoning regardless of model limitations
Multi-agent systems Function Calling Native agent interruption support
External API interactions Function Calling Direct tool access pattern
Production monitoring SGR All reasoning steps visible and loggable

πŸ‘ Real-World Results

Initial Testing Results:

  • SGR enables even small models to follow structured workflows
  • SGR pipeline provides deterministic execution regardless of model size
  • SGR forces reasoning steps that ReAct leaves to model discretion

Planned Benchmarking:

  • We're developing a comprehensive benchmark comparing SGR vs ReAct across model sizes
  • Initial testing shows promising results for SGR on models as small as 4B parameters
  • Full metrics and performance comparison coming soon

πŸ‘ Hybrid Approach: The Best of Both Worlds

The optimal solution for many production systems is a hybrid approach:

  1. SGR for decision making - Determine which tools to use
  2. Function Calling for execution - Get data and provide agent-like experience
  3. SGR for final processing - Structure and format results

This hybrid approach works particularly well for models in the 14-32B range, where Function Calling works sometimes but isn't fully reliable.

Bottom Line: Don't force <32B models to pretend they're GPT-4o in ReAct-style agentic workflows with tool_mode="auto". Let them think structurally through SGR, then execute deterministically.

Clone this wiki locally