Expand FireCrawl Document Loader with Full API Configuration and Custom Metadata Support

### Feature Description

Expose FireCrawl's complete API configuration options (scrape, crawl, search, extract modes) and add dynamic metadata control for vector database operations. Currently, the FireCrawl loader only exposes basic options, limiting its usefulness for production RAG systems and agent workflows.

### Feature Category

Integration

### Problem Statement

The current FireCrawl document loader has two critical limitations:

1. **No Dynamic Metadata Control**
    When upserting to vector stores (Pinecone, Qdrant), we need document-specific metadata for agent routing and filtering. Current "Additional Metadata" field applies the same metadata to all documents in a crawl.

    Example: For a sustainability standards knowledge base, each document needs:
    - document_id, version, language
    - Arrays: scope, standards, tags
    - Booleans: is_latest
    - Nested structures for classification

    Current workaround requires post-processing outside Flowise, breaking the visual workflow paradigm.

2. **Missing FireCrawl API Features**
    Only basic scrape options exposed. Missing:
    - Scrape: formats, onlyMainContent, waitFor, headers, removeBase64Images, includePaths/excludePaths
    - Crawl: limit, maxDepth, allowBackwardLinks, allowExternalLinks, scrapeOptions
    - Extract: schema, prompt, systemPrompt (structured extraction)
    - Search: query filters, limits
    - Advanced: batch scraping, stealth mode, proxy config, document parsing (PDF/DOCX/XLSX)

    These options are essential for:
    - Clean knowledge bases (onlyMainContent removes navbars/footers)
    - Large site management (depth/limit controls)
    - Dynamic content (waitFor)
    - Structured data extraction (Extract mode)

### Proposed Solution

**Phase 1: Enhanced Metadata (MVP)**
Add metadata configuration options:
- Template variables: {"version": "{{urlSegment[2]}}", "type": "{{extractFromUrl(...)}}"}
- Transform function: JavaScript to process metadata per document
- Merge strategy: override/merge/extend with FireCrawl defaults
- Content extraction: LLM-based field extraction from document content

**Phase 2: Expose API Configuration**
Add collapsible "Advanced Options" per mode with JSON editor:

**Scrape Mode:**
```
{
  "formats": ["markdown", "html", "links"],
  "onlyMainContent": true,
  "waitFor": 2000,
  "headers": {...},
  "removeBase64Images": true,
  "includePaths": ["/docs/**"],
  "excludePaths": ["/blog/**"]
}
```

**Crawl Mode:**
```
{
  "limit": 100,
  "maxDepth": 3,
  "allowBackwardLinks": false,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "onlyMainContent": true
  }
}
```

**Extract Mode:**
```
{
  "schema": {...},
  "prompt": "Extract key metrics...",
  "systemPrompt": "You are a specialist..."
}
```

**Phase 3: Additional Features**
- Batch scraping UI
- Stealth mode toggle
- Proxy configuration
- Document parsing options

**Implementation:**
- Keep a simple interface as the default (backwards compatible)
- Add "Advanced Configuration" toggle
- Provide schema validation
- Include common presets (docs sites, blogs, standards databases)

### Mockups or References

**FireCrawl API Documentation:**
- Scrape: https://docs.firecrawl.dev/features/scrape
- Crawl: https://docs.firecrawl.dev/features/crawl
- Extract: https://docs.firecrawl.dev/features/extract
- Search: https://docs.firecrawl.dev/features/search
- Batch: https://docs.firecrawl.dev/features/batch-scrape
- Document Parsing: https://docs.firecrawl.dev/features/document-parsing

**Example additional meta data**
```json
{
  "document_id": "ghg-corporate-standard-2004",
  "title": "GHG Protocol Corporate Accounting and Reporting Standard",
  "version": "2004",
  "language": "en",
  "scope": ["scope1", "scope2", "scope3", "all"],
  "standards": ["ghg_protocol"],
  "tags": [
    "ghg-protocol",
    "carbon-accounting",
    "emissions-classification",
    "core-standard",
    "scope-boundaries",
    "scope-definition",
    "gl-classification",
    "supplier-classification",
    "validation",
    "direct-emissions",
    "purchased-energy",
    "value-chain"
  ],
  "region": ["GLOBAL"],
  "is_latest": true,
  "source_url": "https://ghgprotocol.org/corporate-standard"
}
```

Related Issues/PRs:
- #4073 - FireCrawl v1 API update
- #4535 - Search functionality addition
- Discussion #3028 - Dify's FireCrawl implementation reference

### Additional Context

**Use cases this solves:**
1. Multi-document knowledge bases - Different metadata per document based on URL structure
2. Agent systems - Rich metadata for routing and filtering decisions
3. Compliance tracking - Version, region, standard classification
4. Content categorisation - Automatic tagging from document properties
5. Clean ingestion - only MainContent prevents navigation pollution in vector stores

**Current workaround is to:**
1. Manually use FireCrawl
2. Export markdown or json documents from FireCrawl
3. Upload the documents into a text document loader
4. Process metadata in the text document loader
5. Re-import to vector store. 

This defeats the purpose of visual workflow building.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Expand FireCrawl Document Loader with Full API Configuration and Custom Metadata Support #5385

Feature Description

Feature Category

Problem Statement

Proposed Solution

Mockups or References

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Expand FireCrawl Document Loader with Full API Configuration and Custom Metadata Support #5385

Description

Feature Description

Feature Category

Problem Statement

Proposed Solution

Mockups or References

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions