Ingest
Videlicet Ingest Process Overview
Ingest Process Overview
- Discovery Phase (runs hourly via cron)
- IngestNewPages job lists all files in the S3 source bucket (prefix “)source/”)
- Filters for PDF and JPG/JPEG files only
- Checks against existing pages in database to avoid duplicates
- Creates individual IngestPages jobs for each new file found
- Ingest Phase (per-page processing)
- Each IngestPage job:
-
Downloads the file from S3
-
Sends it via HTTP PUT to /videlicet/pages/{name}endpoint
-
The endpoint then:
i. Saves the raw file to the destination S3 bucket
ii. Converts to web-displayable JPG using ImageMagick
* Density: 300 DPI * Trims whitespace * Resizes to max 2400x3200 pixels * Quality: 90%iii. Saves the JPG to S3 bucket
iv. Creates page record in database with SHA256 has as ID
-
Parallelization
The system uses two separate job queues with different concurrency limits:
-
Regular Jobs Queue (JobsQ):
- Default: 8 concurrent workers
- Production (fly.toml): 16 concurrent workers
- Handles: IngestNewPages, EmbedPages, ExtractTextFromPages, Evaluate, etc
-
CPU Jobs Queue (JobsQCPU):
- Default 1 concurrent worker
- Production: 1 concurrent worker (no change)
- Handles: IngestPage jobs only
- Limited to 1 to prevent overwelming the system with ImageMagick conversions