Skip to main content
Web Scrape stage showing Firecrawl content extraction from URLs
The Web Scrape stage extracts content from web URLs using Firecrawl. It handles JavaScript rendering, content extraction, and structured parsing to add web content to your retrieval pipeline.
Stage Category: APPLY (Enriches documents with scraped content)Transformation: N documents → N documents (with web content added)

When to Use

Use CaseDescription
URL enrichmentExtract content from URLs in documents
Reference expansionScrape linked references for context
Content aggregationPull in external content sources
Real-time contentAccess current webpage content

When NOT to Use

ScenarioRecommended Alternative
Searching the webexternal_web_search (Exa)
Static content already indexedUse indexed content
High-volume scrapingPre-index content in Mixpeek

Parameters

ParameterTypeDefaultDescription
url_fieldstringRequiredDocument field containing URL to scrape
result_fieldstringscraped_contentField for extracted content
include_markdownbooleantrueReturn content as markdown
include_htmlbooleanfalseReturn raw HTML
include_linksbooleanfalseExtract all page links
include_imagesbooleanfalseExtract image URLs
wait_forinteger0Wait ms for JS rendering
timeout_msinteger30000Request timeout

Configuration Examples

{
  "stage_type": "apply",
  "stage_id": "web_scrape",
  "parameters": {
    "url_field": "metadata.source_url",
    "result_field": "page_content"
  }
}

Output Schema

Markdown Output (default)

{
  "document_id": "doc_123",
  "metadata": {
    "source_url": "https://example.com/article"
  },
  "scraped_content": {
    "markdown": "# Article Title\n\nArticle content here...",
    "title": "Article Title",
    "description": "Meta description",
    "language": "en",
    "status": "success"
  }
}

Full Extraction

{
  "document_id": "doc_123",
  "scraped_content": {
    "markdown": "# Article Title\n\n...",
    "html": "<html>...</html>",
    "links": [
      {"text": "Link 1", "href": "https://example.com/link1"},
      {"text": "Link 2", "href": "https://example.com/link2"}
    ],
    "images": [
      {"alt": "Image 1", "src": "https://example.com/img1.jpg"},
      {"alt": "Image 2", "src": "https://example.com/img2.png"}
    ],
    "title": "Article Title",
    "status": "success"
  }
}

Error Case

{
  "document_id": "doc_123",
  "scraped_content": {
    "status": "error",
    "error": "Timeout exceeded",
    "markdown": null
  }
}

Firecrawl Features

FeatureDescription
JavaScript renderingFull browser rendering for SPAs
Content extractionIntelligent main content detection
Markdown conversionClean, structured output
Anti-bot handlingBypasses common protections
Use wait_for when scraping JavaScript-heavy sites. Start with 2000-3000ms and adjust based on page complexity.

Performance

MetricValue
Latency1-10s (depends on page complexity)
Concurrent requestsUp to 5 per pipeline
Timeout default30 seconds
Retry behavior2 retries on failure
Web scraping adds significant latency. Use sparingly and consider pre-indexing frequently accessed content.

Common Pipeline Patterns

Enrich Documents with Referenced Content

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 10
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.has_url",
        "operator": "eq",
        "value": true
      }
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "web_scrape",
    "parameters": {
      "url_field": "metadata.source_url",
      "result_field": "source_content",
      "wait_for": 2000
    }
  }
]

Scrape and Summarize

[
  {
    "stage_type": "apply",
    "stage_id": "web_scrape",
    "parameters": {
      "url_field": "metadata.url",
      "result_field": "page_content"
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "json_transform",
    "parameters": {
      "template": {
        "content": "{{ DOC.page_content.markdown }}",
        "source": "{{ DOC.metadata.url }}"
      }
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Summarize the key points from this webpage"
    }
  }
]

Error Handling

ErrorBehavior
Invalid URLstatus: "error", continues pipeline
Timeoutstatus: "error", null content
404/403status: "error", HTTP status in error
Rate limitedRetry with backoff

Rate Limits and Best Practices

  1. Batch wisely: Limit to 5-10 URLs per pipeline run
  2. Cache results: Consider storing scraped content
  3. Respect robots.txt: Firecrawl handles this automatically
  4. Use timeouts: Set appropriate timeout_ms for your use case