The Web Scrape stage extracts content from web URLs using Firecrawl. It handles JavaScript rendering, content extraction, and structured parsing to add web content to your retrieval pipeline.
Stage Category : APPLY (Enriches documents with scraped content)Transformation : N documents → N documents (with web content added)
When to Use
Use Case Description URL enrichment Extract content from URLs in documents Reference expansion Scrape linked references for context Content aggregation Pull in external content sources Real-time content Access current webpage content
When NOT to Use
Scenario Recommended Alternative Searching the web external_web_search (Exa)Static content already indexed Use indexed content High-volume scraping Pre-index content in Mixpeek
Parameters
Parameter Type Default Description url_fieldstring Required Document field containing URL to scrape result_fieldstring scraped_contentField for extracted content include_markdownboolean trueReturn content as markdown include_htmlboolean falseReturn raw HTML include_linksboolean falseExtract all page links include_imagesboolean falseExtract image URLs wait_forinteger 0Wait ms for JS rendering timeout_msinteger 30000Request timeout
Configuration Examples
Basic URL Scraping
With JavaScript Rendering
Full Content Extraction
HTML Extraction
{
"stage_type" : "apply" ,
"stage_id" : "web_scrape" ,
"parameters" : {
"url_field" : "metadata.source_url" ,
"result_field" : "page_content"
}
}
Output Schema
Markdown Output (default)
{
"document_id" : "doc_123" ,
"metadata" : {
"source_url" : "https://example.com/article"
},
"scraped_content" : {
"markdown" : "# Article Title \n\n Article content here..." ,
"title" : "Article Title" ,
"description" : "Meta description" ,
"language" : "en" ,
"status" : "success"
}
}
{
"document_id" : "doc_123" ,
"scraped_content" : {
"markdown" : "# Article Title \n\n ..." ,
"html" : "<html>...</html>" ,
"links" : [
{ "text" : "Link 1" , "href" : "https://example.com/link1" },
{ "text" : "Link 2" , "href" : "https://example.com/link2" }
],
"images" : [
{ "alt" : "Image 1" , "src" : "https://example.com/img1.jpg" },
{ "alt" : "Image 2" , "src" : "https://example.com/img2.png" }
],
"title" : "Article Title" ,
"status" : "success"
}
}
Error Case
{
"document_id" : "doc_123" ,
"scraped_content" : {
"status" : "error" ,
"error" : "Timeout exceeded" ,
"markdown" : null
}
}
Firecrawl Features
Feature Description JavaScript rendering Full browser rendering for SPAs Content extraction Intelligent main content detection Markdown conversion Clean, structured output Anti-bot handling Bypasses common protections
Use wait_for when scraping JavaScript-heavy sites. Start with 2000-3000ms and adjust based on page complexity.
Metric Value Latency 1-10s (depends on page complexity) Concurrent requests Up to 5 per pipeline Timeout default 30 seconds Retry behavior 2 retries on failure
Web scraping adds significant latency. Use sparingly and consider pre-indexing frequently accessed content.
Common Pipeline Patterns
Enrich Documents with Referenced Content
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 10
}
},
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.has_url" ,
"operator" : "eq" ,
"value" : true
}
}
},
{
"stage_type" : "apply" ,
"stage_id" : "web_scrape" ,
"parameters" : {
"url_field" : "metadata.source_url" ,
"result_field" : "source_content" ,
"wait_for" : 2000
}
}
]
Scrape and Summarize
[
{
"stage_type" : "apply" ,
"stage_id" : "web_scrape" ,
"parameters" : {
"url_field" : "metadata.url" ,
"result_field" : "page_content"
}
},
{
"stage_type" : "apply" ,
"stage_id" : "json_transform" ,
"parameters" : {
"template" : {
"content" : "{{ DOC.page_content.markdown }}" ,
"source" : "{{ DOC.metadata.url }}"
}
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "summarize" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Summarize the key points from this webpage"
}
}
]
Error Handling
Error Behavior Invalid URL status: "error", continues pipelineTimeout status: "error", null content404/403 status: "error", HTTP status in errorRate limited Retry with backoff
Rate Limits and Best Practices
Batch wisely : Limit to 5-10 URLs per pipeline run
Cache results : Consider storing scraped content
Respect robots.txt : Firecrawl handles this automatically
Use timeouts : Set appropriate timeout_ms for your use case