OneSummer Infrastructure
Scale-to-zero architecture for extreme seasonal load. COPPA-compliant from day one.
Architecture Diagram
The architecture is split into three capability zones: a static CDN edge for the SvelteKit frontend, a serverless/container API tier that scales to zero in the off-season, and a managed database tier with connection pooling to absorb burst traffic during Feb–May.
80% of all traffic arrives February through May. The architecture must cost near nothing in summer and fall while being capable of handling concurrent application spikes during peak admission season — without manual intervention.
Design Principles
Scale to Zero
Every component that can auto-scale to zero must do so. No standing infrastructure during Jun–Jan except the database (minimum tier).
Security by Default
COPPA compliance is non-negotiable. Children's PII must be encrypted at rest and in transit, with parental consent gating all data collection.
Managed Over Self-Hosted
Prefer managed services to minimize operational burden. A small team should not be running Postgres or Redis servers manually.
Vendor Agnostic Design
Capabilities are defined first; vendor names are recommendations. Avoid proprietary lock-in at the data and compute layers.
Frontend Hosting
The SvelteKit frontend compiles to static assets (HTML, CSS, JS) at build time via adapter-static or SSR edge functions. These assets are served from a global CDN with no compute cost per request. This is the most cost-effective option and scales to any traffic level automatically.
For OneSummer, use static prerendering for all public-facing pages (discovery, camp profiles, marketing) and client-side rendering behind authentication. This eliminates serverless function invocations for the vast majority of page loads, reducing cost and latency.
Recommended: Netlify
Netlify's free tier covers 100 GB bandwidth and 300 build minutes per month — sufficient for off-season and early growth. Their SvelteKit adapter is well-maintained and build previews per PR are included at no cost.
Alternatives
| Provider | Free Tier | SvelteKit Support | Trade-offs |
|---|---|---|---|
| Netlify Recommended | 100 GB BW, 300 build min | Official adapter | Best DX, easy forms/functions |
| Vercel | 100 GB BW, hobby unlimited | First-class | Better if using edge functions heavily |
| Cloudflare Pages | Unlimited BW | Via adapter-cloudflare | Best performance/cost at scale; DX slightly rougher |
| AWS Amplify / S3+CloudFront | 12-mo free tier | Manual config | Most control; most setup overhead |
Seasonal Configuration
- No seasonal tuning required — CDN cost scales linearly with traffic and is effectively zero during low season on free/pro tiers.
- Set aggressive cache headers:
Cache-Control: public, max-age=31536000, immutableon hashed assets. - Enable HTTP/2 and Brotli compression (default on all recommended providers).
- Use
_headers/netlify.tomlto setX-Frame-Options,Content-Security-Policy, andPermissions-Policyat the CDN edge — zero compute cost.
Backend / API
The API is a Docker container running a standard HTTP server (Fastify, Hono, or similar). The critical requirement is scale-to-zero: during the off-season, the container should cost nothing when idle. This narrows the field to platforms with built-in zero-scaling.
Scale-to-zero platforms impose cold start latency (typically 500ms–2s for Node containers). For OneSummer's use case — asynchronous application submissions, not real-time gaming — this is an acceptable trade-off. Implement a lightweight health-check keep-alive for the peak Feb–May window only if cold starts cause user-visible delays.
Recommended: Fly.io
Fly.io supports Docker containers with native scale-to-zero via min_machines_running = 0. Machines wake on the first inbound request. The free tier includes 3 shared-CPU VMs and 3 GB storage — enough for the API + background workers.
Alternatives
| Provider | Scale-to-Zero | Free Tier | Notes |
|---|---|---|---|
| Fly.io Recommended | Yes | 3 VMs, 3 GB | Low cold starts; great CLI; global regions |
| Railway | Yes | $5 credit/mo | Excellent DX; simpler ops than Fly |
| Google Cloud Run | Yes | 2M req/mo free | Best at true serverless scale; GCP ecosystem |
| AWS App Runner | Min 1 instance | None | Does not fully scale to zero; ~$7/mo minimum |
| Azure Container Apps | Yes | 180,000 vCPU-sec | Good scale-to-zero; KEDA-based autoscaling |
Container Configuration
# fly.toml — peak season config [http_service] internal_port = 3000 force_https = true [http_service.concurrency] type = "requests" soft_limit = 200 # scale up above this hard_limit = 250 [[vm]] size = "shared-cpu-1x" memory = "512mb" [autoscale] min_machines_running = 0 # ← scale to zero off-season max_machines_running = 10 # ← burst capacity for Feb–May
Object Storage
Application documents, uploaded profile photos, and camp media are stored in S3-compatible object storage separate from the container. Cloudflare R2 is recommended for zero egress fees. Alternatives: AWS S3 Backblaze B2.
Email Delivery
Transactional emails (application confirmations, parental consent requests, status updates) require a dedicated provider. Resend offers 3,000 free emails/month with React Email template support. Alternative: Postmark.
Database
OneSummer requires a managed PostgreSQL database. The database cannot scale to zero — it must persist all user data year-round — but it can run on a minimal instance during the off-season and scale up for Feb–May. Connection pooling is critical: serverless/container platforms open many short-lived connections; without pooling, Postgres will exhaust its connection limit under modest load.
A single Fly.io machine at 250 concurrent requests, each holding a Postgres connection, will hit a max_connections limit almost immediately on a small instance. PgBouncer in transaction mode allows thousands of application-level requests to share a small pool of actual Postgres connections. This is not optional for the Feb–May peak season.
Recommended: Neon
Neon is a serverless Postgres provider that separates storage from compute. The free tier includes 0.5 GB storage, autoscaling compute, and a built-in connection pooler (PgBouncer). Compute scales to zero after a period of inactivity — during the off-season, the database compute cost approaches zero while data remains durable.
Alternatives
| Provider | Pooler Included | Scale-to-Zero | Free Tier | Notes |
|---|---|---|---|---|
| Neon Recommended | Yes (PgBouncer) | Compute only | 0.5 GB, 0.25 vCPU | Best for serverless; branching for dev/staging |
| Supabase | Yes (Supavisor) | Project pause | 500 MB, pauses after inactivity | Also provides Auth, Storage, Realtime |
| PlanetScale (Vitess) | Yes | No | 5 GB | MySQL-compatible; not Postgres |
| AWS RDS / Aurora Serverless v2 | RDS Proxy ($) | Aurora Serverless | None | Most control; higher cost; ideal at growth scale |
| Fly Postgres | Self-managed | No | Included in Fly plan | Not fully managed; avoid unless ops-mature |
Connection Pooling Architecture
# Application → PgBouncer → Postgres # Use TRANSACTION mode for serverless/container workloads DATABASE_URL="postgresql://user:pass@pooler.neon.tech:5432/onesummer?pgbouncer=true" DATABASE_DIRECT_URL="postgresql://user:pass@ep-xxx.neon.tech:5432/onesummer" # ↑ direct connection for migrations only # Recommended pool settings (Neon default) pool_mode = transaction max_client_conn = 1000 default_pool_size = 20 # actual Postgres connections
Neon's branching feature creates instant, copy-on-write database snapshots. Use this for: (1) staging branch that mirrors production schema, (2) per-PR preview database branches in CI, and (3) safe migration testing before applying to production.
Migration Strategy
Use Drizzle ORM or Prisma for schema management. Run migrations via the direct (non-pooled) connection URL only. Never run ALTER TABLE through the pooler in transaction mode — it will fail on long-running DDL statements.
Backup Policy
- Neon provides point-in-time recovery (PITR) up to 7 days on free tier, 30 days on Pro.
- Take a manual
pg_dumpbackup before every major migration and store in object storage. - Set up a daily automated backup job via GitHub Actions cron during peak season.
CI / CD Pipeline
All deployments flow through GitHub Actions. The pipeline enforces test passage, security scanning, and preview deployment before any changes reach production.
GitHub Actions Workflow Structure
# .github/workflows/ ci.yml # runs on every PR ├─ lint (eslint + prettier) ├─ type-check (tsc --noEmit) ├─ unit tests (vitest) ├─ integration tests (against Neon branch) └─ security scan (npm audit + semgrep) preview.yml # runs on PR open/update ├─ build Docker image ├─ deploy API to Fly.io preview app ├─ deploy frontend to Netlify draft URL └─ create/update Neon DB branch deploy.yml # runs on merge to main ├─ build + push Docker image to registry ├─ run database migrations (direct URL) ├─ deploy to Fly.io production ├─ deploy frontend to Netlify production └─ post-deploy smoke tests backup.yml # cron: 0 2 * * * (2am daily, peak season) └─ pg_dump → compress → upload to R2
Container Registry
Push Docker images to GitHub Container Registry (ghcr.io) — free for public repos, $0.008/GB for private. Alternative: Docker Hub. Tag images with the Git SHA for deterministic rollbacks.
Secrets Management
Store all credentials in GitHub Actions Secrets (never in code or .env committed to the repo). Use environment-scoped secrets in GitHub to prevent staging secrets from reaching production jobs. Rotate the Neon database password and Fly API token quarterly.
Environments
| Environment | Purpose | Frontend | API | Database | Trigger |
|---|---|---|---|---|---|
| Production prod | Live user traffic | Netlify prod domain | Fly.io prod app | Neon main branch | Merge to main |
| Staging stage | Pre-release validation | Netlify branch deploy | Fly.io staging app | Neon staging branch | Merge to staging |
| Preview preview | Per-PR review | Netlify deploy preview | Fly.io ephemeral app | Neon PR branch | PR opened/updated |
| Local local | Developer machine | localhost:5173 |
localhost:3000 |
Docker Postgres or Neon dev branch | Manual |
Preview environments must never contain real user PII. Use synthetic seed data only. Neon's branching creates an empty schema branch — populate it with db:seed using anonymized test data. Document this in the contributing guide so all developers follow it.
Monitoring & Alerting
Observability for a seasonal platform has two distinct modes: a low-overhead baseline during the off-season, and active monitoring during the Feb–May peak window.
Error Tracking
Sentry (recommended) — free tier covers 5K errors/month. Captures frontend and backend exceptions with stack traces and context. Alternative: Highlight.io.
Metrics & APM
Fly.io built-in metrics for CPU/memory/request latency. For more depth: Grafana Cloud free tier (10K metrics). Alternative: Datadog (expensive at scale).
Uptime / Synthetic
Better Uptime or UptimeRobot — free tier checks every 3 minutes. Alert on HTTP 5xx from the API health endpoint. Critical during Feb–May.
Structured Logging
Fly.io log drain → Logtail / Better Stack. Free tier: 1 GB/month. Use structured JSON logs with request IDs to correlate frontend errors to API calls.
Alerting Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| API error rate | > 1% | > 5% | PagerDuty / Slack |
| P95 API latency | > 800ms | > 2000ms | Scale up + investigate |
| DB connection usage | > 70% | > 90% | Increase pool size |
| Disk usage (DB) | > 70% | > 90% | Upgrade storage tier |
| Fly machine count | — | ≥ 8 machines | Capacity review |
| Uptime check failure | 1 failure | 3 consecutive | Immediate page |
Seasonal Runbook
- January 15: Enable uptime monitoring alerts. Review Fly.io autoscale limits. Verify connection pooler is healthy.
- February 1: Switch to active monitoring mode. Set Sentry alert frequency to immediate. Enable daily DB backup job.
- May 31: Disable expensive alerting. Reduce DB to minimum tier. Pause read replica.
- Off-season: Weekly uptime check is sufficient. Sentry digest mode.
Cost Projections
All estimates assume a single-founder or small team operating at early-stage scale (thousands of users, not millions). Costs grow linearly with adoption; the architecture supports this without structural changes.
Cost Breakdown by Service
| Service | Provider | Off-Season | Peak Season | Scaling Trigger |
|---|---|---|---|---|
| Frontend CDN | Netlify | $0 | $0 – $19/mo | BW > 100 GB/mo |
| Container API | Fly.io | $0 (scaled to zero) | $10 – $80/mo | Concurrent requests |
| Postgres (compute) | Neon | $0 – $5/mo | $19 – $69/mo | Manual tier upgrade |
| Postgres (storage) | Neon | $0 – $3/mo | $3 – $15/mo | Data volume |
| Redis Cache | Upstash | $0 | $0 – $10/mo | Requests > 10K/day |
| Object Storage | Cloudflare R2 | $0 | $0 – $5/mo | Storage > 10 GB |
| Resend | $0 | $0 – $20/mo | Emails > 3K/mo | |
| Auth | Clerk | $0 | $0 – $25/mo | MAU > 10K |
| Monitoring | Sentry + Logtail | $0 | $0 – $26/mo | Error/log volume |
| Estimated Total | — | $0 – $8/mo | $50 – $269/mo | — |
The annual total depends heavily on user adoption during the Feb–May window. At early stage (hundreds of applicants), expect the lower bound. At thousands of concurrent users, expect $400–$600/year — still dramatically less than a traditional always-on infrastructure model that would cost $2,000–$6,000+ annually for the same capability.
Security & COPPA Compliance
The Children's Online Privacy Protection Act (COPPA) requires verifiable parental consent before collecting any personal information from children under 13. Violations carry civil penalties up to $51,744 per violation per child. This is not a future concern — it must be addressed before the first real user touches the product.
COPPA Compliance Infrastructure
Age Gating
Collect date of birth at registration. If the user is under 13, immediately pause data collection, store only the age flag (not the DoB) and route to the parental consent flow. Do not create a full profile until consent is verified.
Parental Consent Flow
Send a verifiable parental consent email (VPCE) to the provided parent email. The parent clicks a unique tokenized link, reviews what data will be collected, and explicitly consents. This token is single-use with a 72-hour expiry.
PII Isolation
Children's PII must be stored in a separate, encrypted Postgres schema with stricter access controls. Application-level code must pass through a COPPA-aware data access layer that logs all reads/writes to child records.
Data Minimization
Collect only what is necessary. Do not run analytics pixels, session recording tools, or third-party ad trackers on any page a child might view. Block all third-party scripts from the application domain.
Infrastructure Security Controls
| Control | Implementation | Layer |
|---|---|---|
| TLS everywhere | Enforced at CDN and Fly.io ingress; no HTTP allowed | Network |
| Secrets management | GitHub Actions Secrets; Fly.io secrets; never in env files | Application |
| Database encryption at rest | Neon encrypts all storage with AES-256 by default | Data |
| Database encryption in transit | Require SSL on all Postgres connections; ?sslmode=require |
Data |
| Authentication | Clerk (recommended) — handles session tokens, MFA, OAuth | Application |
| Rate limiting | Upstash Redis rate limiter at API middleware; 100 req/min per IP | API |
| CSRF protection | SvelteKit CSRF built-in for form actions; API uses Bearer tokens | Application |
| Content Security Policy | Strict CSP header via netlify.toml; no unsafe-inline |
Frontend |
| Input validation | Zod schemas at API boundary; never trust client data | Application |
| SQL injection prevention | Parameterized queries only via ORM; no raw string interpolation | Data |
| Dependency scanning | Dependabot + npm audit in CI; weekly automated PR |
Pipeline |
| Object storage ACLs | All buckets private by default; presigned URLs for user uploads | Data |
COPPA Data Handling Schema
-- Separate schema for child PII CREATE SCHEMA child_data; -- Row-level security — only the owning parent can read ALTER TABLE child_data.profiles ENABLE ROW LEVEL SECURITY; CREATE POLICY parent_owns_child ON child_data.profiles USING (parent_user_id = current_setting('app.current_user_id')::uuid); -- Audit log every access to child records CREATE TABLE child_data.access_log ( id uuid DEFAULT gen_random_uuid() PRIMARY KEY, child_id uuid NOT NULL, accessor_id uuid NOT NULL, action text NOT NULL, accessed_at timestamptz DEFAULT now() );
Privacy Policy Requirements
The COPPA privacy notice must be separate from the general privacy policy and written in plain language understandable to parents. It must describe: (1) what information is collected from children, (2) how it is used, (3) whether it is disclosed to third parties, and (4) the parent's rights to review, delete, and withdraw consent. Consult a privacy attorney before launch.
Disaster Recovery
Disaster recovery for OneSummer is primarily a data recovery problem. The frontend and API are stateless and redeploy from Git in under 5 minutes. The database is the only component that requires a formal recovery procedure.
Recovery Time Objective (RTO)
Target: < 30 minutes for full service restoration during peak season. The API and frontend can be redeployed in < 5 minutes; database restoration from a recent backup is the dominant recovery time.
Recovery Point Objective (RPO)
Target: < 1 hour data loss during peak season with daily backups and Neon PITR. During off-season, RPO of 24 hours is acceptable — traffic is near-zero.
Failure Scenarios and Responses
| Scenario | Detection | Response | Est. RTO |
|---|---|---|---|
| API container crash / OOM | Uptime check + Sentry | Fly.io auto-restarts; if persistent, roll back Docker image to previous SHA | 2–5 min |
| Bad deployment (regression) | Post-deploy smoke tests | fly deploy --image ghcr.io/onesummer/api:<prev-sha> |
3–5 min |
| Database corruption / bad migration | Error spike + manual detection | Use Neon PITR to restore to pre-migration timestamp; re-apply clean migration | 15–30 min |
| Neon regional outage | DB connection failure alerts | Restore most recent pg_dump backup to Supabase or RDS emergency instance |
30–60 min |
| CDN / Netlify outage | Uptime check | Point DNS to Cloudflare Pages fallback (keep repo connected to both) | 5–10 min |
| Credential compromise | Unusual access patterns / manual report | Rotate all secrets immediately; invalidate all user sessions via Clerk dashboard; audit access logs | 10–20 min |
Database Recovery Runbook
# Step 1: Identify the restore point # For Neon PITR — use the Neon console or CLI neon branches create \ --name recovery-attempt \ --parent main \ --timestamp "2025-03-15T14:30:00Z" # Step 2: Verify data integrity on the recovery branch psql "$RECOVERY_DATABASE_URL" -c "SELECT count(*) FROM applications;" # Step 3: Promote the recovery branch to production # (swap the DATABASE_URL environment variable in Fly.io) fly secrets set DATABASE_URL="$RECOVERY_DATABASE_URL" -a onesummer-api # Step 4: Restart the API machines fly machines restart -a onesummer-api # Step 5: Verify application health curl https://api.onesummer.com/health
Annual DR Test
Run a full database recovery drill each January (before peak season begins). Restore the production database to a staging environment using a 30-day-old backup and verify application functionality. Document the test results and update this runbook with any lessons learned.
Launch Checklist
Complete all items before accepting real user data. Items marked with a legal or compliance tag require external review.
- Infrastructure
-
Production domain configured with DNS pointing to Netlify; HTTPS enforced via HSTS with
max-age=31536000 -
Fly.io production app deployed;
min_machines = 0confirmed;max_machines = 10set for burst capacity -
Neon production database provisioned; connection pooler URL tested; migrations applied clean on
mainbranch -
All secrets stored in GitHub Actions Secrets and Fly.io secrets; no credentials in code or committed
.envfiles -
Cloudflare R2 bucket created with private ACL; presigned URL generation tested end-to-end
-
Upstash Redis instance provisioned; rate limiting middleware tested and confirmed blocking at threshold
- CI / CD
-
All three GitHub Actions workflows (
ci.yml,preview.yml,deploy.yml) passing on a test PR and merge -
Post-deploy smoke tests hitting at least: health endpoint, auth flow, DB read, DB write
-
Rollback procedure tested: deploy an intentionally broken image, confirm failure detection, execute rollback, confirm recovery
-
Dependabot enabled on the repo with weekly schedule for both npm and Docker base image
- Security
-
All API endpoints authenticated; no unauthenticated routes expose PII
-
SQL injection test suite passing; no raw string interpolation in query builder
-
Database row-level security policies verified on
child_dataschema -
Penetration test or security review completed (even informal — OWASP Top 10 checklist minimum)
- COPPA / Legal Requires Attorney Review
-
Age gate implemented and tested: users under 13 are blocked from completing profile until parental consent is verified
-
Parental consent email flow tested end-to-end: token generation, email delivery, consent recording, account activation
-
COPPA-compliant privacy notice published — written for parent audience, reviewed by privacy counsel
-
Data deletion mechanism implemented: parent can request deletion of all child data via a documented process; deletion confirmed within 10 business days
-
No third-party analytics, tracking pixels, or session recording active on any page reachable by a child account
-
Terms of Service and Privacy Policy finalized and linked from footer, registration flow, and cookie banner
- Monitoring
-
Sentry initialized in both frontend and API; test error captured and visible in dashboard
-
Uptime monitoring configured for
api.onesummer.com/healthandonesummer.com; SMS/Slack alert tested -
Logtail log drain connected to Fly.io; structured logs visible with request ID correlation
-
Daily DB backup cron job enabled and first backup confirmed in R2 storage
- Disaster Recovery
-
Neon PITR confirmed working: restored staging to a point 1 hour prior; application loaded correctly
-
Runbook for all failure scenarios reviewed and accessible to all team members (not just the person who wrote it)
-
Emergency contact list current: Neon support, Fly.io support, domain registrar, privacy counsel
To capture the full Feb–May peak season, all checklist items must be complete and the system load-tested before January 15. Use Neon branching and Fly.io preview apps to iterate rapidly on staging without risk to production. The architecture is designed to stay out of your way so you can focus on the product.