Skip to main content
Course content extractor pipeline showing video segmentation, PDF extraction, code decomposition, and multimodal embeddings
The course content extractor decomposes educational content into atomic learning units optimized for semantic retrieval. Processes video lectures with automatic transcription, PDF slides with layout awareness, and code archives with function-level granularity. Each unit receives E5-Large text embeddings (1024D), Jina Code embeddings (768D) for code snippets, and optional SigLIP visual embeddings (768D) for figures and screenshots.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/course_content_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Filter Dataset (if collection_id provided)
    • Filter to specified collection
  2. Content Detection & Routing
    • Auto-detect content type: video, PDF, or code archive
    • Route to appropriate processor
  3. Video Segmentation (if video input)
    • Scene-based segmentation or SRT subtitle-based segmentation
    • Extract transcripts via Whisper ASR (or use provided SRT)
    • OCR video frames for screen text detection
  4. PDF Decomposition (if PDF input)
    • Layout detection: paragraphs, headers, tables, lists, figures, code blocks
    • Layout-aware extraction per element or per page
    • Extract images and figures with bounding boxes
  5. Code Archive Processing (if code input)
    • Extract source files from ZIP archive
    • Segment code into individual functions/classes
    • Auto-detect programming language
  6. Multi-Modal Embedding Generation
    • E5-Large (1024D) for transcripts, PDF text, and captions
    • Jina Code v2 (768D) for code snippets and functions
    • SigLIP (768D) for figures, screenshots, diagrams (optional)
  7. LLM Enrichment (optional: if enrich_with_llm=true)
    • Generate summaries using Gemini
    • Add semantic context and key concepts
  8. Output
    • Learning units with text_content, code_content, screen_text
    • Layout types, timing info, language tags
    • Multiple embeddings per unit for diverse search scenarios

When to Use

Use CaseDescription
Online coursesExtract lectures, slides, and code into searchable learning units
Technical documentationDecompose guides with code examples into semantic chunks
Code tutorialsSegment video + PDF + code into aligned learning units
Educational archivesIndex historical lecture materials with multiple content types
Multilingual learningProcess educational content across 100+ languages
API documentationExtract text, code examples, and diagrams with visual search

When NOT to Use

ScenarioRecommended Alternative
Simple text documents onlytext_extractor (faster, simpler)
Images and photos onlyimage_extractor
Single PDF documentsdocument_graph_extractor (better OCR, confidence scoring)
Pre-transcribed videostext_extractor (use transcripts directly)

Input Schema

FieldTypeRequiredDescription
videostring(one of three)URL or S3 path to video file (MP4, WebM, MOV). Maximum: 4 hours. Auto-detect format.
srtstringoptional (with video)URL or S3 path to SRT subtitle file. Used if present; otherwise Whisper ASR generates transcripts.
pdfstring(one of three)URL or S3 path to PDF document. Multi-page supported. Maximum: 500 pages.
code_archivestring(one of three)URL or S3 path to ZIP archive containing source code. Maximum: 100MB.
Exactly one of video, pdf, or code_archive must be provided.
{
  "video": "s3://my-bucket/lectures/intro-to-ml.mp4",
  "srt": "s3://my-bucket/lectures/intro-to-ml.srt"
}
Input Examples:
TypeExample
Video with subtitles{"video": "https://cdn.example.com/lecture.mp4", "srt": "https://cdn.example.com/lecture.srt"}
PDF slides{"pdf": "s3://courses/machine-learning/slides-week-1.pdf"}
Code archive{"code_archive": "s3://tutorials/python-algorithms.zip"}

Output Schema

Each learning unit produces one or more documents depending on content type and expand_to_granular_docs setting:
FieldTypeDescription
unit_typestringType of unit: video_segment, pdf_element, code_function, screen_text, figure
doc_typestringGranular type: transcript, code, screen_text, visual, paragraph, table, list, header, figure
text_contentstringExtracted text content
code_contentstringSource code (if applicable)
code_languagestringProgramming language (Python, JavaScript, Java, etc.)
screen_textstringOCR text from video frames or PDF screenshots
titlestringUnit title (lecture title, function name, figure caption)
start_timenumberVideo start time in seconds (video units only)
end_timenumberVideo end time in seconds (video units only)
page_numberintegerPDF page number (0-indexed, PDF units only)
element_indexintegerElement position within page (PDF units only)
start_lineintegerStart line number (code units only)
end_lineintegerEnd line number (code units only)
segment_indexintegerSegment position within source (video units only)
element_typestringPDF layout type: paragraph, header, list, table, figure, code, footer
bboxobjectBounding box {x, y, width, height} (PDF elements with visual positioning)
thumbnail_urlstringS3 URL of thumbnail image (video frames, figure screenshots)
intfloat__multilingual_e5_large_instructfloat[1024]E5-Large text embedding, L2 normalized
jinaai__jina_embeddings_v2_base_codefloat[768]Jina Code embedding (code units only)
google__siglip_base_patch16_224float[768]SigLIP visual embedding (if run_visual_embedding=true)
llm_summarystringLLM-generated summary (if enrich_with_llm=true)
{
  "unit_type": "video_segment",
  "doc_type": "transcript",
  "text_content": "In this section, we explore supervised learning algorithms...",
  "screen_text": "SUPERVISED LEARNING\n- Regression\n- Classification",
  "title": "Intro to ML: Supervised Learning",
  "start_time": 120.5,
  "end_time": 245.3,
  "segment_index": 3,
  "thumbnail_url": "s3://mixpeek/ns_123/thumbnails/seg_3.jpg",
  "intfloat__multilingual_e5_large_instruct": [0.023, -0.041, 0.018, ...],
  "llm_summary": "Introduction to supervised learning covering regression and classification techniques"
}

Parameters

Video Segmentation Parameters

ParameterTypeDefaultRangeDescription
target_segment_duration_msinteger12000030000-600000Target duration for each video segment (30 sec - 10 min)
min_segment_duration_msinteger3000010000+Minimum segment duration to create
segmentation_methodstring"scene"scene, srt, timeSegmentation strategy: scene detection, SRT markers, or fixed time intervals
scene_detection_thresholdfloat0.30.1-0.9Scene change sensitivity (lower = more scenes detected)
use_whisper_asrbooleantrue-Use Whisper ASR for transcription if SRT not provided
expand_to_granular_docsbooleantrue-Create separate documents for transcript, screen_text, and visual (one per granularity type)
ocr_frames_per_segmentinteger31-10Number of frames to OCR per segment

Segmentation Methods

MethodDescriptionBest For
sceneML-based scene detection (PySceneDetect)Lectures with natural topic breaks
srtUse SRT subtitle markers as boundariesPrepared materials with timing metadata
timeFixed time intervalsUniform segment length regardless of content

PDF Extraction Parameters

ParameterTypeDefaultDescription
pdf_extraction_modestring"per_element"per_page (one doc per page) or per_element (one doc per detected element)
pdf_render_dpiinteger150DPI for rendering PDF pages (72-300). Higher = better OCR quality, slower
detect_code_in_pdfbooleantrueAutomatically detect and tag code blocks in PDF text

Code Extraction Parameters

ParameterTypeDefaultDescription
segment_functionsbooleantrueSegment code files into individual functions/classes
supported_languagesarray["python", "javascript", "java", "go", "rust", "c", "cpp"]Programming languages to extract and embed

Feature Extraction Parameters

ParameterTypeDefaultDescription
run_text_embeddingbooleantrueGenerate E5-Large text embeddings for transcripts and text content
run_code_embeddingbooleantrueGenerate Jina Code embeddings for code snippets
run_visual_embeddingbooleanfalseGenerate SigLIP visual embeddings for figures and screenshots
visual_embedding_use_casestring"lecture"Context for visual embedding: lecture, code_demo, tutorial, presentation, dynamic
extract_screen_textbooleantrueRun OCR on video frames to extract on-screen text
generate_thumbnailsbooleantrueGenerate and store thumbnail images
use_cdnbooleanfalseUse CDN for thumbnail delivery (if available)

LLM Enrichment Parameters

ParameterTypeDefaultDescription
enrich_with_llmbooleanfalseEnable LLM-generated summaries and key concept extraction
llm_promptstring"Summarize this educational content, highlighting key concepts, learning objectives, and main takeaways"Custom prompt for LLM enrichment

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "course_content_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "payload.lecture_url",
      "srt": "payload.subtitle_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.course_id" },
      { "source_path": "metadata.lesson_number" }
    ],
    "parameters": {
      "target_segment_duration_ms": 120000,
      "segmentation_method": "scene",
      "scene_detection_threshold": 0.3,
      "use_whisper_asr": true,
      "expand_to_granular_docs": true,
      "ocr_frames_per_segment": 3,
      "run_text_embedding": true,
      "run_code_embedding": true,
      "run_visual_embedding": false,
      "generate_thumbnails": true
    }
  }
}

Performance & Costs

MetricValue
Video processing~1 minute per 10 minutes of video (depends on segmentation)
PDF processing~2-5 seconds per page (depends on DPI and layout complexity)
Code processing~50-100ms per 1KB of code
Embedding latency~5ms per text unit (E5), ~10ms per code unit (Jina), ~50ms per visual unit (SigLIP)
Cost (Tier 2)20 credits per video minute, 5 credits per PDF page, 2 credits per 1K code tokens
GPU accelerationRecommended for 10+ videos; 2-3x speedup

Vector Indexes

All three embeddings are stored as Qdrant named vectors for hybrid search:
PropertyValue
Index 1 nameintfloat__multilingual_e5_large_instruct
Dimensions1024
TypeDense
Distance metricCosine
Datatypefloat32
NormalizationL2 normalized
PropertyValue
Index 2 namejinaai__jina_embeddings_v2_base_code
Dimensions768
TypeDense
Distance metricCosine
Datatypefloat32
NormalizationL2 normalized
PropertyValue
Index 3 namegoogle__siglip_base_patch16_224
Dimensions768
TypeDense
Distance metricCosine
Datatypefloat32
Inference modelgoogle_siglip_base_v1
StatusOptional (if run_visual_embedding=true)

Comparison with Other Extractors

Featurecourse_content_extractortext_extractormultimodal_extractordocument_graph_extractor
Input typesVideo, PDF, CodeText onlyVideo, Image, TextPDF only
SegmentationScene/SRT/timeWord/sentence/paragraphN/ALayout-based
Text embeddingsE5-Large (1024D)E5-Large (1024D)Vertex AI (1408D)E5-Large (1024D)
Code embeddingsJina Code (768D)
Visual embeddingsSigLIP (768D) optionalVertex AI (1408D)
Best forEducational contentText searchUnified multimodalComplex PDF layouts
Cost per unitMedium (5-50 credits)Low (free)High (10-50 credits)Medium (5-50 credits)

Limitations

  • Video length: Optimized for videos up to 4 hours. Longer videos may require segmentation.
  • Transcription quality: Whisper ASR works best with clear audio; noisy lectures may have reduced accuracy.
  • Code extraction: Requires valid ZIP archives; loose files not supported.
  • Language support: Code embedding works with common languages; domain-specific DSLs have reduced accuracy.
  • PDF complexity: Complex layouts with nested tables may have reduced extraction quality.
  • Visual embeddings: Optional and add significant processing cost.