Skip to content

Expand FireCrawl Document Loader with Full API Configuration and Custom Metadata Support #5385

@TravisP-Greener

Description

@TravisP-Greener

Feature Description

Expose FireCrawl's complete API configuration options (scrape, crawl, search, extract modes) and add dynamic metadata control for vector database operations. Currently, the FireCrawl loader only exposes basic options, limiting its usefulness for production RAG systems and agent workflows.

Feature Category

Integration

Problem Statement

The current FireCrawl document loader has two critical limitations:

  1. No Dynamic Metadata Control
    When upserting to vector stores (Pinecone, Qdrant), we need document-specific metadata for agent routing and filtering. Current "Additional Metadata" field applies the same metadata to all documents in a crawl.

    Example: For a sustainability standards knowledge base, each document needs:

    • document_id, version, language
    • Arrays: scope, standards, tags
    • Booleans: is_latest
    • Nested structures for classification

    Current workaround requires post-processing outside Flowise, breaking the visual workflow paradigm.

  2. Missing FireCrawl API Features
    Only basic scrape options exposed. Missing:

    • Scrape: formats, onlyMainContent, waitFor, headers, removeBase64Images, includePaths/excludePaths
    • Crawl: limit, maxDepth, allowBackwardLinks, allowExternalLinks, scrapeOptions
    • Extract: schema, prompt, systemPrompt (structured extraction)
    • Search: query filters, limits
    • Advanced: batch scraping, stealth mode, proxy config, document parsing (PDF/DOCX/XLSX)

    These options are essential for:

    • Clean knowledge bases (onlyMainContent removes navbars/footers)
    • Large site management (depth/limit controls)
    • Dynamic content (waitFor)
    • Structured data extraction (Extract mode)

Proposed Solution

Phase 1: Enhanced Metadata (MVP)
Add metadata configuration options:

  • Template variables: {"version": "{{urlSegment[2]}}", "type": "{{extractFromUrl(...)}}"}
  • Transform function: JavaScript to process metadata per document
  • Merge strategy: override/merge/extend with FireCrawl defaults
  • Content extraction: LLM-based field extraction from document content

Phase 2: Expose API Configuration
Add collapsible "Advanced Options" per mode with JSON editor:

Scrape Mode:

{
  "formats": ["markdown", "html", "links"],
  "onlyMainContent": true,
  "waitFor": 2000,
  "headers": {...},
  "removeBase64Images": true,
  "includePaths": ["/docs/**"],
  "excludePaths": ["/blog/**"]
}

Crawl Mode:

{
  "limit": 100,
  "maxDepth": 3,
  "allowBackwardLinks": false,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "onlyMainContent": true
  }
}

Extract Mode:

{
  "schema": {...},
  "prompt": "Extract key metrics...",
  "systemPrompt": "You are a specialist..."
}

Phase 3: Additional Features

  • Batch scraping UI
  • Stealth mode toggle
  • Proxy configuration
  • Document parsing options

Implementation:

  • Keep a simple interface as the default (backwards compatible)
  • Add "Advanced Configuration" toggle
  • Provide schema validation
  • Include common presets (docs sites, blogs, standards databases)

Mockups or References

FireCrawl API Documentation:

Example additional meta data

{
  "document_id": "ghg-corporate-standard-2004",
  "title": "GHG Protocol Corporate Accounting and Reporting Standard",
  "version": "2004",
  "language": "en",
  "scope": ["scope1", "scope2", "scope3", "all"],
  "standards": ["ghg_protocol"],
  "tags": [
    "ghg-protocol",
    "carbon-accounting",
    "emissions-classification",
    "core-standard",
    "scope-boundaries",
    "scope-definition",
    "gl-classification",
    "supplier-classification",
    "validation",
    "direct-emissions",
    "purchased-energy",
    "value-chain"
  ],
  "region": ["GLOBAL"],
  "is_latest": true,
  "source_url": "https://ghgprotocol.org/corporate-standard"
}

Related Issues/PRs:

Additional Context

Use cases this solves:

  1. Multi-document knowledge bases - Different metadata per document based on URL structure
  2. Agent systems - Rich metadata for routing and filtering decisions
  3. Compliance tracking - Version, region, standard classification
  4. Content categorisation - Automatic tagging from document properties
  5. Clean ingestion - only MainContent prevents navigation pollution in vector stores

Current workaround is to:

  1. Manually use FireCrawl
  2. Export markdown or json documents from FireCrawl
  3. Upload the documents into a text document loader
  4. Process metadata in the text document loader
  5. Re-import to vector store.

This defeats the purpose of visual workflow building.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions