Ingest

Videlicet Ingest Process Overview

Ingest Process Overview

  1. Discovery Phase (runs hourly via cron)
  • IngestNewPages job lists all files in the S3 source bucket (prefix “)source/”)
  • Filters for PDF and JPG/JPEG files only
  • Checks against existing pages in database to avoid duplicates
  • Creates individual IngestPages jobs for each new file found
  1. Ingest Phase (per-page processing)
  • Each IngestPage job:
    • Downloads the file from S3

    • Sends it via HTTP PUT to /videlicet/pages/{name}endpoint

    • The endpoint then:

      i. Saves the raw file to the destination S3 bucket

      ii. Converts to web-displayable JPG using ImageMagick

        * Density: 300 DPI
        * Trims whitespace
        * Resizes to max 2400x3200 pixels
        * Quality: 90%
      

      iii. Saves the JPG to S3 bucket

      iv. Creates page record in database with SHA256 has as ID

Parallelization
The system uses two separate job queues with different concurrency limits:

  1. Regular Jobs Queue (JobsQ):

    • Default: 8 concurrent workers
    • Production (fly.toml): 16 concurrent workers
    • Handles: IngestNewPages, EmbedPages, ExtractTextFromPages, Evaluate, etc
  2. CPU Jobs Queue (JobsQCPU):

    • Default 1 concurrent worker
    • Production: 1 concurrent worker (no change)
    • Handles: IngestPage jobs only
    • Limited to 1 to prevent overwelming the system with ImageMagick conversions