Skip to main content
Web scraper extractor pipeline showing crawling, content extraction, chunking, and multimodal embeddings
The web scraper extractor recursively crawls websites to extract multimodal content with semantic embeddings. Automatically discovers and extracts text, code blocks, images, and asset links from web pages. Each extracted document receives E5-Large text embeddings (1024D) for semantic search, Jina Code embeddings (768D) for code snippets, and optional SigLIP visual embeddings (768D) for images. Supports JavaScript-rendered SPAs, includes resilience features like retry logic, proxy rotation, and captcha detection.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/web_scraper_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Filter Dataset (if collection_id provided)
    • Filter to specified collection
  2. Crawl Configuration & Setup
    • Parse seed URL and configure crawl parameters
    • Set up URL filtering rules, rendering strategy, resilience options
  3. Recursive Web Crawling
    • BFS-based link traversal with depth limit
    • JavaScript rendering support (auto-detect or explicit)
    • URL filtering (include/exclude patterns)
    • Resilience: retry logic, proxy rotation, captcha detection
  4. Content Extraction Per Page
    • Extract text content, title, metadata
    • Identify and extract code blocks with language detection
    • Discover images with alt text, dimensions
    • Find asset links (PDFs, documents, archives)
    • Optional: Structured extraction via LLM (response_shape)
  5. Content Chunking (optional)
    • Split page content by strategy: sentences, paragraphs, words, characters
    • Configurable chunk size and overlap
    • Track chunk metadata for joined results
  6. Document Expansion
    • Create separate documents for page content, each code block, each image
    • Preserve parent URL and crawl depth metadata
  7. Multi-Modal Embedding Generation
    • E5-Large (1024D) for page text content
    • Jina Code (768D) for code blocks
    • SigLIP (768D) for images (if generate_image_embeddings=true)
  8. Output
    • Documents with text content, code blocks, images
    • Asset links discovered but not crawled
    • Multiple embeddings per document for hybrid search

When to Use

Use CaseDescription
API documentationIndex technical documentation with code examples and diagrams
Knowledge base crawlingExtract FAQs, guides, and tutorials from support sites
Job board scrapingFind job listings with parsed content and structured fields
News aggregationCollect and index articles with multimodal content
Competitive analysisMonitor competitor websites for content changes
Open source docsIndex project documentation from GitHub Pages, ReadTheDocs
Product researchGather product information from multiple websites

When NOT to Use

ScenarioRecommended Alternative
Protected/authenticated contentConfigure via custom_headers with auth tokens
PDF-only extractiondocument_graph_extractor (better OCR, layout detection)
Social media scrapingUse platform-specific APIs (Twitter API, Instagram Graph API)
E-commerce product catalogsUse platform APIs when available (better data structure)
Very large sites (10K+ pages)Increase max_pages, implement crawl goal filtering

Input Schema

FieldTypeRequiredDescription
urlstringYesSeed URL to start crawling from. Example: https://docs.example.com/api/
{
  "url": "https://docs.example.com/getting-started"
}
Input Examples:
TypeExample
API documentationhttps://docs.openai.com/api/
Knowledge basehttps://help.example.com/
Bloghttps://blog.example.com/
Job boardhttps://jobs.example.com/listings

Output Schema

Each crawled page produces one or more documents depending on content extraction and expansion settings:
FieldTypeDescription
contentstringExtracted text content from page
titlestringPage title (from <title> tag or heading)
page_urlstringFull URL of crawled page
code_blocksarrayCode blocks found on page (structure: [{language, code, line_start, line_end}])
imagesarrayImages found on page (structure: [{src, alt, title, width, height}])
asset_linksarrayDownloadable assets discovered (structure: [{url, file_type, link_text, file_extension}])
chunk_indexintegerPosition within page chunks (if chunking enabled)
total_chunksintegerTotal chunks from this page (if chunking enabled)
crawl_depthintegerDepth from seed URL (0 = seed, 1 = links from seed, etc.)
parent_urlstringReferrer URL (previous page in crawl path)
intfloat__multilingual_e5_large_instructfloat[1024]E5-Large text embedding, L2 normalized
jinaai__jina_embeddings_v2_base_codefloat[768]Jina Code embedding (if code blocks extracted)
google__siglip_base_patch16_224float[768]SigLIP visual embedding (if generate_image_embeddings=true)
{
  "content": "The REST API provides endpoints for creating, reading, updating, and deleting resources...",
  "title": "REST API Overview - Example Docs",
  "page_url": "https://docs.example.com/api/overview",
  "code_blocks": [
    {
      "language": "python",
      "code": "import requests\nresponse = requests.get('https://api.example.com/users')",
      "line_start": 1,
      "line_end": 2
    }
  ],
  "images": [
    {
      "src": "https://docs.example.com/images/api-flow.png",
      "alt": "API request flow diagram",
      "width": 800,
      "height": 600
    }
  ],
  "asset_links": [
    {
      "url": "https://docs.example.com/downloads/openapi.yaml",
      "file_type": "openapi",
      "link_text": "Download OpenAPI Spec",
      "file_extension": "yaml"
    }
  ],
  "crawl_depth": 2,
  "parent_url": "https://docs.example.com/api/",
  "intfloat__multilingual_e5_large_instruct": [0.023, -0.041, 0.018, ...],
  "jinaai__jina_embeddings_v2_base_code": [0.045, -0.023, ...],
  "google__siglip_base_patch16_224": [0.078, -0.091, ...]
}

Parameters

Crawl Configuration Parameters

ParameterTypeDefaultRangeDescription
max_depthinteger20-1000Maximum link depth from seed (0 = seed URL only, higher = deeper crawl)
max_pagesinteger501-1000000Maximum pages to crawl in single run
crawl_timeoutinteger30010-3600Maximum time for crawl in seconds (10s - 1h)
crawl_modeenum"deterministic"deterministic, semanticBFS deterministic or LLM-guided semantic crawling
crawl_goalstringnull-Goal for semantic crawling (e.g., “find all API endpoints”). Used with crawl_mode: semantic

Rendering Parameters

ParameterTypeDefaultDescription
render_strategyenum"auto"Rendering method: static (HTML only), javascript (Puppeteer), auto (auto-detect)

URL Filtering Parameters

ParameterTypeDefaultDescription
include_patternsarraynullRegex patterns for URLs to include (whitelist). Example: ["/docs/.*", "/api/.*"]
exclude_patternsarraynullRegex patterns for URLs to exclude (blacklist). Example: ["/admin/.*", ".*logout.*"]

Content Chunking Parameters

ParameterTypeDefaultRangeDescription
chunk_strategyenum"none"none, sentences, paragraphs, words, charactersHow to split page content
chunk_sizeinteger5001-10000Target size per chunk (units depend on strategy)
chunk_overlapinteger500-5000Overlap between consecutive chunks

Document Identity Parameters

ParameterTypeDefaultDescription
document_id_strategyenum"url"How to generate document IDs: url (unique per page), position (sequential), content (hash-based)

Embedding Parameters

ParameterTypeDefaultDescription
generate_text_embeddingsbooleantrueGenerate E5-Large text embeddings for page content
generate_code_embeddingsbooleantrueGenerate Jina Code embeddings for code blocks
generate_image_embeddingsbooleantrueGenerate SigLIP embeddings for discovered images

LLM Structured Extraction Parameters

ParameterTypeDefaultDescription
response_shapestring or objectnullDefine structured extraction: natural language description or JSON schema
llm_providerstringnullLLM provider: openai, google, anthropic (required if using response_shape)
llm_modelstringnullSpecific LLM model (e.g., gpt-4o-mini, gemini-2.5-flash)
llm_api_keystringnullAPI key (supports secret vault references like ${vault:openai-key})

Resilience: Retry Parameters

ParameterTypeDefaultRangeDescription
max_retriesinteger30-10Maximum retry attempts on request failure
retry_base_delaynumber1.00.1-30.0Base delay for exponential backoff (seconds)
retry_max_delaynumber30.01.0-300.0Maximum delay between retries (seconds)
respect_retry_afterbooleantrue-Respect Retry-After header from server

Resilience: Proxy Parameters

ParameterTypeDefaultDescription
proxiesarraynullProxy URLs for rotation. Example: ["http://proxy1:8080", "http://proxy2:8080"]
rotate_proxy_on_errorbooleantrueRotate proxy when request fails
rotate_proxy_every_n_requestsinteger0Rotate proxy every N requests (0 = no periodic rotation)

Resilience: Captcha Parameters

ParameterTypeDefaultDescription
captcha_service_providerstringnullCaptcha solving service: 2captcha, anti-captcha, capsolver
captcha_service_api_keystringnullAPI key for captcha service (supports secret vault references)
detect_captchabooleantrueAuto-detect captcha challenges and attempt to solve

Resilience: Session Parameters

ParameterTypeDefaultDescription
persist_cookiesbooleantruePersist cookies across requests within single crawl
custom_headersobjectnullCustom HTTP headers. Example: {"Authorization": "Bearer token", "User-Agent": "Custom"}

Politeness Parameters

ParameterTypeDefaultRangeDescription
delay_between_requestsnumber0.00.0-60.0Delay between consecutive requests (seconds)

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "web_scraper",
    "version": "v1",
    "input_mappings": {
      "url": "payload.docs_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.vendor" },
      { "source_path": "metadata.product" }
    ],
    "parameters": {
      "max_depth": 2,
      "max_pages": 50,
      "crawl_timeout": 300,
      "render_strategy": "auto",
      "generate_text_embeddings": true,
      "generate_code_embeddings": true,
      "generate_image_embeddings": false,
      "delay_between_requests": 0.5
    }
  }
}

Performance & Costs

MetricValue
Average page load2-5 seconds (depends on page complexity and rendering)
Pages per minute12-30 pages (with delays and retries)
Code block extraction~10ms per 1KB of code
Image extraction~50ms per 10 images
Embedding latency~5ms per text page (E5), ~10ms per code block (Jina), ~50ms per image (SigLIP)
Cost (Tier 3)5 credits per page crawled, 1 credit per code block, 2 credits per image
Memory usage~100MB base + ~1MB per 100 pages in crawl queue

Vector Indexes

All three embeddings are stored as Qdrant named vectors for hybrid search:
PropertyValue
Index 1 nameintfloat__multilingual_e5_large_instruct
Dimensions1024
TypeDense
Distance metricCosine
Datatypefloat32
NormalizationL2 normalized
PropertyValue
Index 2 namejinaai__jina_embeddings_v2_base_code
Dimensions768
TypeDense
Distance metricCosine
Datatypefloat32
NormalizationL2 normalized
PropertyValue
Index 3 namegoogle__siglip_base_patch16_224
Dimensions768
TypeDense
Distance metricCosine
Datatypefloat32
StatusOptional (if generate_image_embeddings=true)

Comparison with Other Extractors

Featureweb_scrapertext_extractormultimodal_extractordocument_graph_extractor
Input typesURLs (crawling)Text onlyVideo, Image, TextPDF only
Recursive crawling✅ Yes
Code extraction✅ Yes
Image extraction✅ Yes✅ Yes
Multimodal embeddings✅ YesText only✅ YesText only
LLM extraction✅ Yes✅ Yes
Resilience features✅ Yes
Best forWeb crawlingText searchVideo/image/textPDF analysis
Cost per page5-15 creditsFree (text)10-50 credits5-50 credits

Resilience & Robustness

The web scraper includes enterprise-grade resilience features:

Retry Strategy

  • Exponential backoff with configurable base and max delays
  • Respects server Retry-After headers
  • Retries on network errors, timeouts, and temporary failures (5xx)

Proxy Rotation

  • Support for multiple proxies with automatic rotation
  • Rotate on error or periodic rotation every N requests
  • Helps avoid rate limiting and IP bans

Captcha Detection & Solving

  • Auto-detect common captcha types (reCAPTCHA, hCaptcha)
  • Integration with 2captcha, Anti-Captcha, CapSolver services
  • Fallback to manual review if solving fails

Session Management

  • Persistent cookies across requests within a single crawl
  • Custom HTTP headers for authentication
  • Support for API key and bearer token injection

URL Filtering

  • Include patterns (whitelist): Only crawl matching URLs
  • Exclude patterns (blacklist): Skip URLs matching patterns
  • Prevent crawling auth/admin pages, search results, etc.

Limitations

  • Content-only crawling: Does not execute custom JavaScript actions (clicking, form submission, scrolling)
  • Authentication: Limited to HTTP headers (Bearer tokens, API keys). No interactive login flows.
  • Dynamic content: JavaScript-rendering adds 2-3x latency per page
  • Large sites: 10K+ page sites may require high max_pages and long timeouts
  • Robots.txt: Does not parse robots.txt; respect via delay_between_requests and max_pages
  • Rate limiting: May be blocked by aggressive rate limiting; use proxies and delays