Ingest

Videlicet Ingest Process Overview

Ingest Process Overview

Discovery Phase (runs hourly via cron)

IngestNewPages job lists all files in the S3 source bucket (prefix “)source/”)
Filters for PDF and JPG/JPEG files only
Checks against existing pages in database to avoid duplicates
Creates individual IngestPages jobs for each new file found

Ingest Phase (per-page processing)

Each IngestPage job:
- Downloads the file from S3
- Sends it via HTTP PUT to /videlicet/pages/{name}endpoint
- The endpoint then:
  
  i. Saves the raw file to the destination S3 bucket
  
  ii. Converts to web-displayable JPG using ImageMagick
```
  * Density: 300 DPI
  * Trims whitespace
  * Resizes to max 2400x3200 pixels
  * Quality: 90%
```
  iii. Saves the JPG to S3 bucket
  
  iv. Creates page record in database with SHA256 has as ID

Parallelization
The system uses two separate job queues with different concurrency limits:

Regular Jobs Queue (JobsQ):
- Default: 8 concurrent workers
- Production (fly.toml): 16 concurrent workers
- Handles: IngestNewPages, EmbedPages, ExtractTextFromPages, Evaluate, etc
CPU Jobs Queue (JobsQCPU):
- Default 1 concurrent worker
- Production: 1 concurrent worker (no change)
- Handles: IngestPage jobs only
- Limited to 1 to prevent overwelming the system with ImageMagick conversions