Replicas & Autoscaling
Run multiple instances of a service, spread across regions, scale automatically on load, and roll out safely.
A replica is one running instance of your service (web/worker/cron). Replicas give you capacity, redundancy, and zero-downtime deploys. Forgeon can autoscale replicas up/down based on CPU, memory, RPS, or latency.
Start with min=2 replicas for production web traffic—one to serve, one to roll—then add autoscaling.
Where to manage replicas
- Project → Services → [service] → Scaling
- Manual scale: set Min/Max replicas
- Autoscaling policy: choose signals (CPU, Memory, RPS, Latency, Custom)
- Regions: select where replicas run
Manual scaling (always available)
- Min replicas — the floor (kept running 24/7 in chosen regions)
- Max replicas — the ceiling (autoscaler won’t exceed this)
- Spread — choose one or more regions; traffic is steered to nearest healthy
min=1 can’t do zero-downtime rollouts. Use min≥2 for production web services.
Autoscaling policies
Turn on Autoscale to add/remove replicas automatically:
- CPU target — keep average CPU ~60% (good default for APIs)
- Memory target — keep RSS under 75–80% (heap-heavy runtimes)
- RPS target — bound requests per replica (e.g., 200/instance)
- Latency target — keep p95 below SLO (e.g., 300 ms)
- Custom metric — point to an app-emitted gauge (queue depth, concurrency)
Controls
- Cooldown — wait before scaling again (e.g., 60–120s)
- Step size — add/remove N replicas per decision (e.g., +2/−1)
- Max surge — extra replicas allowed during deploy (e.g., 25%)
- Max unavailable — replicas allowed to be temporarily out (e.g., 0 for web)
Regional replicas
- Pick multiple regions for availability and latency.
- Forgeon routes users to the nearest healthy replica; failover happens automatically.
- Databases: prefer a primary region close to your DB, or use read-replicas if your engine supports it.
Start single-region near your database. Add a second region once p95 latency or regional reliability becomes a concern.
Zero-downtime deploys (rolling)
When you deploy a new revision:
- Surge up (optionally) to keep capacity
- New replicas must pass readiness (e.g.,
/readyz→ 200) - Traffic shifts to new replicas
- Old replicas drain and stop
→ 200 only when DB/cache ready & migrations done
Sticky sessions & websockets
- Sticky sessions (optional) keep a user pinned to one replica (cookie-based). Use if you have in-memory sessions.
- WebSockets/Server-sent events are supported; use graceful shutdown so connections drain before a replica stops.
Graceful shutdowns (don’t cut users off)
- Handle SIGTERM to start draining, close listeners, and finish in-flight work.
- Honor a termination grace period (set on the service) so the platform doesn’t force-kill long requests.
- Keep
/healthzOK until you’re truly dying; keep/readyzfalse once you start draining.
// Node example
const server = app.listen(process.env.PORT, '0.0.0.0')
process.on('SIGTERM', async () => {
ready = false // your /readyz should now return 503
server.close(() => process.exit(0))
setTimeout(() => process.exit(1), 15000) // grace period fallback
})