That is, $0.0000037 per query. Or: $3.70 per million matched lines. Flat. Linear. No bulk discount needed because the baseline is already negligible.
This is the cost of a fully-loaded production query: nginx ingress, gunicorn dispatch, FastAPI request parsing, Snake SAT vote across the relevant field model, fuzzy fallback if needed, JSON response with audit trail, full TLS round-trip. End to end. Same number whether the query hits Tier 1 (SAT 100%), Tier 2 (fuzzy), or returns no match.
The whole stack lives on one EC2 instance. No queue, no Lambda, no serverless cold start.
| Component | Spec | Monthly |
|---|---|---|
| EC2 r6i.4xlarge | 16 vCPU, 128 GB RAM, eu-west-3 | $590 |
| S3 storage (snake-models-monce) | ~3 GB models + backups | ~$1 |
| Route53 hosted zone | monce.ai & aws.monce.ai | $0.50 |
| Data transfer out | typical workload | ~$5 |
| Total fixed | ~$597 / mo |
r6i.4xlarge is overspec for the current load (we've seen 21 GB used out of 123 GB at peak with 4 workers). Could downsize to r6i.2xlarge (~$295/mo) at any time. We don't because headroom buys us margin for OCR batch storms and rebuild_all without thrashing.
The instance is provisioned 24/7. Variable cost = (instance $/sec) ÷ (sustained q/s).
At sustained 260 q/s (single-instance benchmark), that's ~$0.00000086 per query amortized. We use a more conservative 50% utilization assumption to get the headline $3.7·10−9: roughly 130 q/s average, or ~10 million queries per day.
The natural alternative is a Claude Haiku call to interpret each line. Same task, different substrate.
| Approach | Latency | Cost / line | Determinism | Audit trail | Multi-tenant |
|---|---|---|---|---|---|
| Snake API | 3.7 ms | $3.7×10−9 | Yes | SAT clauses | 13 factories isolated |
| Claude Haiku 4.5 | ~400 ms | ~$0.0001 | No | Free-text only | Per-call prompting |
| GPT-4o-mini | ~600 ms | ~$0.00015 | No | Free-text only | Per-call prompting |
| Embedding similarity (OpenAI) | ~80 ms | ~$0.00002 | ~Yes | Cosine score | Per-tenant index |
| SageMaker endpoint | ~50 ms | ~$0.0001 | Yes | Model-dependent | Per-endpoint $/hr |
The fixed-cost crossover: if you serve k queries per month, Snake API beats Claude Haiku when:
Solving: k > ~6 million queries/month. Below that, an LLM is cheaper if you have no other reason to run an EC2. Above 6M/month, Snake's economics dominate.
Production load on this EC2 is currently ~10–30M queries/month. We're firmly in Snake territory.
| Factory | Articles | Synonyms | Global model | Field models | Client model | Total |
|---|---|---|---|---|---|---|
| F1 (VIT) | 1,996 | 7,398 | 32 MB | 32 MB | 22 MB | ~86 MB |
| F3 (Monce) | 1,996 | 7,398 | 32 MB | 32 MB | 22 MB | ~86 MB |
| F4 (VIP) | 6,647 | 21,826 | 120 MB | 117 MB | 115 MB | ~352 MB |
| F9 (Eurovitrage) | 1,929 | 6,106 | 30 MB | 11 MB | 11 MB | ~52 MB |
| F10 (TGVI) | 17,173 | 51,639 | 440 MB | ~270 MB | 149 MB | ~860 MB |
| F13–F18 (FR) | 1.7K–6.8K | 5K–22K | 24–117 MB | ~50–100 MB | 3–9 MB | varies |
| F22 (ES) | 3,988 | 15,714 | 76 MB | ~30 MB | 20 MB | ~126 MB |
| F23 (ES) | 4,255 | 12,752 | 69 MB | ~30 MB | 1 MB | ~100 MB |
| All 13 | ~60K | ~195K | ~1.4 GB | ~700 MB | ~370 MB | ~2.5 GB on disk |
Per-worker resident memory: ~5 GB (4 workers × ~5 GB = 20 GB total, fits in 128 GB with massive headroom). The 2.5 GB disk size includes deserialized population for fuzzy matching; resident is larger because of Python object overhead.
A full rebuild_all retrains every factory's article + client models from scratch. Triggered nightly via cron, on-demand by ops, or after a bulk synonym import.
| Factor | Value |
|---|---|
| Wall-clock duration | ~9 min 20 s (32 min including the 13-factory ES expansion) |
| EC2 cost (during rebuild) | ~$0.31 (32 min × $0.94/hr at on-demand r6i.4xlarge) |
| Disk I/O | ~5 GB write to /tmp + ~3 GB upload to S3 |
| S3 PUT requests | ~270 (13 factories × ~21 model files each) |
| S3 cost | ~$0.001 per rebuild |
| Worker reload (SIGHUP) | ~30 s per worker, all 4 in parallel |
Onboarding a new factory (F14 VIO is the most recent example, F22/F23 the most recent locale expansion):
| Step | Cost | Time |
|---|---|---|
| Auto-onboarding cron picks up new factory_id from monce_db | $0 | 4h+15m polling cycle |
| populate_models.py training run | ~$0.05 (5–10 min EC2) | 5–10 min |
| S3 upload | ~$0.001 | 30 s |
| SIGHUP reload | $0 | 30 s |
| Total | ~$0.05 marginal | < 15 min from new tenant signup to live matching |
No code changes needed. Factory IDs are discovered dynamically via S3 listing
(model_manager._load_all_models()) and via monce_db.list_factories().
Lambda would be tempting: scale-to-zero, pay-per-invoke. But:
SnakeBatch (the v6 distributed trainer at snakebatch.aws.monce.ai) does
use Lambda — because training is bursty and parallel. Inference is
steady and stateful. Different workload, different infra.
Charles Dana · Monce SAS · snake.aws.monce.ai · deployed 2026-05-20
Co-Authored-By: Claude (Anthropic)