Web Scrape

Web Scrape stage showing Firecrawl content extraction from URLs

The Web Scrape stage extracts content from web URLs using Firecrawl. It handles JavaScript rendering, content extraction, and structured parsing to add web content to your retrieval pipeline.

Stage Category: APPLY (Enriches documents with scraped content)Transformation: N documents → N documents (with web content added)

When to Use

Use Case	Description
URL enrichment	Extract content from URLs in documents
Reference expansion	Scrape linked references for context
Content aggregation	Pull in external content sources
Real-time content	Access current webpage content

When NOT to Use

Scenario	Recommended Alternative
Searching the web	`external_web_search` (Exa)
Static content already indexed	Use indexed content
High-volume scraping	Pre-index content in Mixpeek

Parameters

Parameter	Type	Default	Description
`url_field`	string	Required	Document field containing URL to scrape
`result_field`	string	`scraped_content`	Field for extracted content
`include_markdown`	boolean	`true`	Return content as markdown
`include_html`	boolean	`false`	Return raw HTML
`include_links`	boolean	`false`	Extract all page links
`include_images`	boolean	`false`	Extract image URLs
`wait_for`	integer	`0`	Wait ms for JS rendering
`timeout_ms`	integer	`30000`	Request timeout

Configuration Examples

{
  "stage_type": "apply",
  "stage_id": "web_scrape",
  "parameters": {
    "url_field": "metadata.source_url",
    "result_field": "page_content"
  }
}

Output Schema

Markdown Output (default)

{
  "document_id": "doc_123",
  "metadata": {
    "source_url": "https://example.com/article"
  },
  "scraped_content": {
    "markdown": "# Article Title\n\nArticle content here...",
    "title": "Article Title",
    "description": "Meta description",
    "language": "en",
    "status": "success"
  }
}

Full Extraction

{
  "document_id": "doc_123",
  "scraped_content": {
    "markdown": "# Article Title\n\n...",
    "html": "<html>...</html>",
    "links": [
      {"text": "Link 1", "href": "https://example.com/link1"},
      {"text": "Link 2", "href": "https://example.com/link2"}
    ],
    "images": [
      {"alt": "Image 1", "src": "https://example.com/img1.jpg"},
      {"alt": "Image 2", "src": "https://example.com/img2.png"}
    ],
    "title": "Article Title",
    "status": "success"
  }
}

Error Case

{
  "document_id": "doc_123",
  "scraped_content": {
    "status": "error",
    "error": "Timeout exceeded",
    "markdown": null
  }
}

Firecrawl Features

Feature	Description
JavaScript rendering	Full browser rendering for SPAs
Content extraction	Intelligent main content detection
Markdown conversion	Clean, structured output
Anti-bot handling	Bypasses common protections

Use wait_for when scraping JavaScript-heavy sites. Start with 2000-3000ms and adjust based on page complexity.

Performance

Metric	Value
Latency	1-10s (depends on page complexity)
Concurrent requests	Up to 5 per pipeline
Timeout default	30 seconds
Retry behavior	2 retries on failure

Web scraping adds significant latency. Use sparingly and consider pre-indexing frequently accessed content.

Common Pipeline Patterns

Enrich Documents with Referenced Content

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 10
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.has_url",
        "operator": "eq",
        "value": true
      }
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "web_scrape",
    "parameters": {
      "url_field": "metadata.source_url",
      "result_field": "source_content",
      "wait_for": 2000
    }
  }
]

Scrape and Summarize

[
  {
    "stage_type": "apply",
    "stage_id": "web_scrape",
    "parameters": {
      "url_field": "metadata.url",
      "result_field": "page_content"
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "json_transform",
    "parameters": {
      "template": {
        "content": "{{ DOC.page_content.markdown }}",
        "source": "{{ DOC.metadata.url }}"
      }
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Summarize the key points from this webpage"
    }
  }
]

Error Handling

Error	Behavior
Invalid URL	`status: "error"`, continues pipeline
Timeout	`status: "error"`, null content
404/403	`status: "error"`, HTTP status in error
Rate limited	Retry with backoff

Rate Limits and Best Practices

Batch wisely: Limit to 5-10 URLs per pipeline run
Cache results: Consider storing scraped content
Respect robots.txt: Firecrawl handles this automatically
Use timeouts: Set appropriate timeout_ms for your use case

External Web Search - Search the web (Exa)
API Call - General HTTP enrichment
Document Enrich - Collection joins

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

When to Use

When NOT to Use

Parameters

Configuration Examples

Output Schema

Markdown Output (default)

Full Extraction

Error Case

Firecrawl Features

Performance

Common Pipeline Patterns

Enrich Documents with Referenced Content

Scrape and Summarize

Error Handling

Rate Limits and Best Practices

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​When to Use

​When NOT to Use

​Parameters

​Configuration Examples

​Output Schema

​Markdown Output (default)

​Full Extraction

​Error Case

​Firecrawl Features

​Performance

​Common Pipeline Patterns

​Enrich Documents with Referenced Content

​Scrape and Summarize

​Error Handling

​Rate Limits and Best Practices

​Related

When to Use

When NOT to Use

Parameters

Configuration Examples

Output Schema

Markdown Output (default)

Full Extraction

Error Case

Firecrawl Features

Performance

Common Pipeline Patterns

Enrich Documents with Referenced Content

Scrape and Summarize

Error Handling

Rate Limits and Best Practices

Related