View extractor details at api.mixpeek.com/v1/collections/features/extractors/web_scraper_v1 or fetch programmatically with
GET /v1/collections/features/extractors/{feature_extractor_id}.Pipeline Steps
- Filter Dataset (if collection_id provided)
- Filter to specified collection
- Crawl Configuration & Setup
- Parse seed URL and configure crawl parameters
- Set up URL filtering rules, rendering strategy, resilience options
- Recursive Web Crawling
- BFS-based link traversal with depth limit
- JavaScript rendering support (auto-detect or explicit)
- URL filtering (include/exclude patterns)
- Resilience: retry logic, proxy rotation, captcha detection
- Content Extraction Per Page
- Extract text content, title, metadata
- Identify and extract code blocks with language detection
- Discover images with alt text, dimensions
- Find asset links (PDFs, documents, archives)
- Optional: Structured extraction via LLM (
response_shape)
- Content Chunking (optional)
- Split page content by strategy: sentences, paragraphs, words, characters
- Configurable chunk size and overlap
- Track chunk metadata for joined results
- Document Expansion
- Create separate documents for page content, each code block, each image
- Preserve parent URL and crawl depth metadata
- Multi-Modal Embedding Generation
- E5-Large (1024D) for page text content
- Jina Code (768D) for code blocks
- SigLIP (768D) for images (if
generate_image_embeddings=true)
- Output
- Documents with text content, code blocks, images
- Asset links discovered but not crawled
- Multiple embeddings per document for hybrid search
When to Use
| Use Case | Description |
|---|---|
| API documentation | Index technical documentation with code examples and diagrams |
| Knowledge base crawling | Extract FAQs, guides, and tutorials from support sites |
| Job board scraping | Find job listings with parsed content and structured fields |
| News aggregation | Collect and index articles with multimodal content |
| Competitive analysis | Monitor competitor websites for content changes |
| Open source docs | Index project documentation from GitHub Pages, ReadTheDocs |
| Product research | Gather product information from multiple websites |
When NOT to Use
| Scenario | Recommended Alternative |
|---|---|
| Protected/authenticated content | Configure via custom_headers with auth tokens |
| PDF-only extraction | document_graph_extractor (better OCR, layout detection) |
| Social media scraping | Use platform-specific APIs (Twitter API, Instagram Graph API) |
| E-commerce product catalogs | Use platform APIs when available (better data structure) |
| Very large sites (10K+ pages) | Increase max_pages, implement crawl goal filtering |
Input Schema
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Seed URL to start crawling from. Example: https://docs.example.com/api/ |
| Type | Example |
|---|---|
| API documentation | https://docs.openai.com/api/ |
| Knowledge base | https://help.example.com/ |
| Blog | https://blog.example.com/ |
| Job board | https://jobs.example.com/listings |
Output Schema
Each crawled page produces one or more documents depending on content extraction and expansion settings:| Field | Type | Description |
|---|---|---|
content | string | Extracted text content from page |
title | string | Page title (from <title> tag or heading) |
page_url | string | Full URL of crawled page |
code_blocks | array | Code blocks found on page (structure: [{language, code, line_start, line_end}]) |
images | array | Images found on page (structure: [{src, alt, title, width, height}]) |
asset_links | array | Downloadable assets discovered (structure: [{url, file_type, link_text, file_extension}]) |
chunk_index | integer | Position within page chunks (if chunking enabled) |
total_chunks | integer | Total chunks from this page (if chunking enabled) |
crawl_depth | integer | Depth from seed URL (0 = seed, 1 = links from seed, etc.) |
parent_url | string | Referrer URL (previous page in crawl path) |
intfloat__multilingual_e5_large_instruct | float[1024] | E5-Large text embedding, L2 normalized |
jinaai__jina_embeddings_v2_base_code | float[768] | Jina Code embedding (if code blocks extracted) |
google__siglip_base_patch16_224 | float[768] | SigLIP visual embedding (if generate_image_embeddings=true) |
Parameters
Crawl Configuration Parameters
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
max_depth | integer | 2 | 0-1000 | Maximum link depth from seed (0 = seed URL only, higher = deeper crawl) |
max_pages | integer | 50 | 1-1000000 | Maximum pages to crawl in single run |
crawl_timeout | integer | 300 | 10-3600 | Maximum time for crawl in seconds (10s - 1h) |
crawl_mode | enum | "deterministic" | deterministic, semantic | BFS deterministic or LLM-guided semantic crawling |
crawl_goal | string | null | - | Goal for semantic crawling (e.g., “find all API endpoints”). Used with crawl_mode: semantic |
Rendering Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
render_strategy | enum | "auto" | Rendering method: static (HTML only), javascript (Puppeteer), auto (auto-detect) |
URL Filtering Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
include_patterns | array | null | Regex patterns for URLs to include (whitelist). Example: ["/docs/.*", "/api/.*"] |
exclude_patterns | array | null | Regex patterns for URLs to exclude (blacklist). Example: ["/admin/.*", ".*logout.*"] |
Content Chunking Parameters
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
chunk_strategy | enum | "none" | none, sentences, paragraphs, words, characters | How to split page content |
chunk_size | integer | 500 | 1-10000 | Target size per chunk (units depend on strategy) |
chunk_overlap | integer | 50 | 0-5000 | Overlap between consecutive chunks |
Document Identity Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
document_id_strategy | enum | "url" | How to generate document IDs: url (unique per page), position (sequential), content (hash-based) |
Embedding Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
generate_text_embeddings | boolean | true | Generate E5-Large text embeddings for page content |
generate_code_embeddings | boolean | true | Generate Jina Code embeddings for code blocks |
generate_image_embeddings | boolean | true | Generate SigLIP embeddings for discovered images |
LLM Structured Extraction Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
response_shape | string or object | null | Define structured extraction: natural language description or JSON schema |
llm_provider | string | null | LLM provider: openai, google, anthropic (required if using response_shape) |
llm_model | string | null | Specific LLM model (e.g., gpt-4o-mini, gemini-2.5-flash) |
llm_api_key | string | null | API key (supports secret vault references like ${vault:openai-key}) |
Resilience: Retry Parameters
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
max_retries | integer | 3 | 0-10 | Maximum retry attempts on request failure |
retry_base_delay | number | 1.0 | 0.1-30.0 | Base delay for exponential backoff (seconds) |
retry_max_delay | number | 30.0 | 1.0-300.0 | Maximum delay between retries (seconds) |
respect_retry_after | boolean | true | - | Respect Retry-After header from server |
Resilience: Proxy Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
proxies | array | null | Proxy URLs for rotation. Example: ["http://proxy1:8080", "http://proxy2:8080"] |
rotate_proxy_on_error | boolean | true | Rotate proxy when request fails |
rotate_proxy_every_n_requests | integer | 0 | Rotate proxy every N requests (0 = no periodic rotation) |
Resilience: Captcha Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
captcha_service_provider | string | null | Captcha solving service: 2captcha, anti-captcha, capsolver |
captcha_service_api_key | string | null | API key for captcha service (supports secret vault references) |
detect_captcha | boolean | true | Auto-detect captcha challenges and attempt to solve |
Resilience: Session Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
persist_cookies | boolean | true | Persist cookies across requests within single crawl |
custom_headers | object | null | Custom HTTP headers. Example: {"Authorization": "Bearer token", "User-Agent": "Custom"} |
Politeness Parameters
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
delay_between_requests | number | 0.0 | 0.0-60.0 | Delay between consecutive requests (seconds) |
Configuration Examples
Performance & Costs
| Metric | Value |
|---|---|
| Average page load | 2-5 seconds (depends on page complexity and rendering) |
| Pages per minute | 12-30 pages (with delays and retries) |
| Code block extraction | ~10ms per 1KB of code |
| Image extraction | ~50ms per 10 images |
| Embedding latency | ~5ms per text page (E5), ~10ms per code block (Jina), ~50ms per image (SigLIP) |
| Cost (Tier 3) | 5 credits per page crawled, 1 credit per code block, 2 credits per image |
| Memory usage | ~100MB base + ~1MB per 100 pages in crawl queue |
Vector Indexes
All three embeddings are stored as Qdrant named vectors for hybrid search:| Property | Value |
|---|---|
| Index 1 name | intfloat__multilingual_e5_large_instruct |
| Dimensions | 1024 |
| Type | Dense |
| Distance metric | Cosine |
| Datatype | float32 |
| Normalization | L2 normalized |
| Property | Value |
|---|---|
| Index 2 name | jinaai__jina_embeddings_v2_base_code |
| Dimensions | 768 |
| Type | Dense |
| Distance metric | Cosine |
| Datatype | float32 |
| Normalization | L2 normalized |
| Property | Value |
|---|---|
| Index 3 name | google__siglip_base_patch16_224 |
| Dimensions | 768 |
| Type | Dense |
| Distance metric | Cosine |
| Datatype | float32 |
| Status | Optional (if generate_image_embeddings=true) |
Comparison with Other Extractors
| Feature | web_scraper | text_extractor | multimodal_extractor | document_graph_extractor |
|---|---|---|---|---|
| Input types | URLs (crawling) | Text only | Video, Image, Text | PDF only |
| Recursive crawling | ✅ Yes | ✗ | ✗ | ✗ |
| Code extraction | ✅ Yes | ✗ | ✗ | ✗ |
| Image extraction | ✅ Yes | ✗ | ✅ Yes | ✗ |
| Multimodal embeddings | ✅ Yes | Text only | ✅ Yes | Text only |
| LLM extraction | ✅ Yes | ✅ Yes | ✗ | ✗ |
| Resilience features | ✅ Yes | ✗ | ✗ | ✗ |
| Best for | Web crawling | Text search | Video/image/text | PDF analysis |
| Cost per page | 5-15 credits | Free (text) | 10-50 credits | 5-50 credits |
Resilience & Robustness
The web scraper includes enterprise-grade resilience features:Retry Strategy
- Exponential backoff with configurable base and max delays
- Respects server
Retry-Afterheaders - Retries on network errors, timeouts, and temporary failures (5xx)
Proxy Rotation
- Support for multiple proxies with automatic rotation
- Rotate on error or periodic rotation every N requests
- Helps avoid rate limiting and IP bans
Captcha Detection & Solving
- Auto-detect common captcha types (reCAPTCHA, hCaptcha)
- Integration with 2captcha, Anti-Captcha, CapSolver services
- Fallback to manual review if solving fails
Session Management
- Persistent cookies across requests within a single crawl
- Custom HTTP headers for authentication
- Support for API key and bearer token injection
URL Filtering
- Include patterns (whitelist): Only crawl matching URLs
- Exclude patterns (blacklist): Skip URLs matching patterns
- Prevent crawling auth/admin pages, search results, etc.
Limitations
- Content-only crawling: Does not execute custom JavaScript actions (clicking, form submission, scrolling)
- Authentication: Limited to HTTP headers (Bearer tokens, API keys). No interactive login flows.
- Dynamic content: JavaScript-rendering adds 2-3x latency per page
- Large sites: 10K+ page sites may require high
max_pagesand long timeouts - Robots.txt: Does not parse
robots.txt; respect viadelay_between_requestsandmax_pages - Rate limiting: May be blocked by aggressive rate limiting; use proxies and delays

