-
-
Notifications
You must be signed in to change notification settings - Fork 23k
Description
Feature Description
Expose FireCrawl's complete API configuration options (scrape, crawl, search, extract modes) and add dynamic metadata control for vector database operations. Currently, the FireCrawl loader only exposes basic options, limiting its usefulness for production RAG systems and agent workflows.
Feature Category
Integration
Problem Statement
The current FireCrawl document loader has two critical limitations:
-
No Dynamic Metadata Control
When upserting to vector stores (Pinecone, Qdrant), we need document-specific metadata for agent routing and filtering. Current "Additional Metadata" field applies the same metadata to all documents in a crawl.Example: For a sustainability standards knowledge base, each document needs:
- document_id, version, language
- Arrays: scope, standards, tags
- Booleans: is_latest
- Nested structures for classification
Current workaround requires post-processing outside Flowise, breaking the visual workflow paradigm.
-
Missing FireCrawl API Features
Only basic scrape options exposed. Missing:- Scrape: formats, onlyMainContent, waitFor, headers, removeBase64Images, includePaths/excludePaths
- Crawl: limit, maxDepth, allowBackwardLinks, allowExternalLinks, scrapeOptions
- Extract: schema, prompt, systemPrompt (structured extraction)
- Search: query filters, limits
- Advanced: batch scraping, stealth mode, proxy config, document parsing (PDF/DOCX/XLSX)
These options are essential for:
- Clean knowledge bases (onlyMainContent removes navbars/footers)
- Large site management (depth/limit controls)
- Dynamic content (waitFor)
- Structured data extraction (Extract mode)
Proposed Solution
Phase 1: Enhanced Metadata (MVP)
Add metadata configuration options:
- Template variables: {"version": "{{urlSegment[2]}}", "type": "{{extractFromUrl(...)}}"}
- Transform function: JavaScript to process metadata per document
- Merge strategy: override/merge/extend with FireCrawl defaults
- Content extraction: LLM-based field extraction from document content
Phase 2: Expose API Configuration
Add collapsible "Advanced Options" per mode with JSON editor:
Scrape Mode:
{
"formats": ["markdown", "html", "links"],
"onlyMainContent": true,
"waitFor": 2000,
"headers": {...},
"removeBase64Images": true,
"includePaths": ["/docs/**"],
"excludePaths": ["/blog/**"]
}
Crawl Mode:
{
"limit": 100,
"maxDepth": 3,
"allowBackwardLinks": false,
"allowExternalLinks": false,
"scrapeOptions": {
"onlyMainContent": true
}
}
Extract Mode:
{
"schema": {...},
"prompt": "Extract key metrics...",
"systemPrompt": "You are a specialist..."
}
Phase 3: Additional Features
- Batch scraping UI
- Stealth mode toggle
- Proxy configuration
- Document parsing options
Implementation:
- Keep a simple interface as the default (backwards compatible)
- Add "Advanced Configuration" toggle
- Provide schema validation
- Include common presets (docs sites, blogs, standards databases)
Mockups or References
FireCrawl API Documentation:
- Scrape: https://docs.firecrawl.dev/features/scrape
- Crawl: https://docs.firecrawl.dev/features/crawl
- Extract: https://docs.firecrawl.dev/features/extract
- Search: https://docs.firecrawl.dev/features/search
- Batch: https://docs.firecrawl.dev/features/batch-scrape
- Document Parsing: https://docs.firecrawl.dev/features/document-parsing
Example additional meta data
{
"document_id": "ghg-corporate-standard-2004",
"title": "GHG Protocol Corporate Accounting and Reporting Standard",
"version": "2004",
"language": "en",
"scope": ["scope1", "scope2", "scope3", "all"],
"standards": ["ghg_protocol"],
"tags": [
"ghg-protocol",
"carbon-accounting",
"emissions-classification",
"core-standard",
"scope-boundaries",
"scope-definition",
"gl-classification",
"supplier-classification",
"validation",
"direct-emissions",
"purchased-energy",
"value-chain"
],
"region": ["GLOBAL"],
"is_latest": true,
"source_url": "https://ghgprotocol.org/corporate-standard"
}Related Issues/PRs:
- chore: update Firecrawl version and add FirecrawlExtractTool #4073 - FireCrawl v1 API update
- feat: add search functionality to FireCrawl with customizable parameters #4535 - Search functionality addition
- Discussion Firecrawl Documentation for Flowise #3028 - Dify's FireCrawl implementation reference
Additional Context
Use cases this solves:
- Multi-document knowledge bases - Different metadata per document based on URL structure
- Agent systems - Rich metadata for routing and filtering decisions
- Compliance tracking - Version, region, standard classification
- Content categorisation - Automatic tagging from document properties
- Clean ingestion - only MainContent prevents navigation pollution in vector stores
Current workaround is to:
- Manually use FireCrawl
- Export markdown or json documents from FireCrawl
- Upload the documents into a text document loader
- Process metadata in the text document loader
- Re-import to vector store.
This defeats the purpose of visual workflow building.