Database & Infrastructure Analysis: Self-Hosted EC2 vs. Supabase

This document analyzes the engineering trade-offs between your current Self-Hosted PostgreSQL (Docker on EC2) and Supabase (Managed Cloud PostgreSQL). It specifically evaluates the infrastructure disadvantages of a migration in the context of the 23 active API endpoints currently running on your EC2 instance.

1. Our Existing API Ecosystem on EC2 (23 Active Endpoints)

Your FastAPI backend currently orchestrates 23 active endpoints across 7 active routers:

System & Health (2 Endpoints)

  • GET / - Root status and redirect to API documentation.
  • GET /health - Service health monitor.

Authentication (auth.py — 1 Endpoint)

  • POST /auth/anonymous - Translates client hardware UUIDs (device_id) into secure JWT tokens.

Student Profiles (users.py — 3 Endpoints)

  • GET /users/me - Retrieves student profile data and CEFR levels.
  • PUT /users/me - Updates demographic details (name, age, location).
  • PUT /users/me/settings - Updates focus configurations (grammar, vocab, fluency metrics).

Placement Screening (placements.py — 3 Endpoints)

  • POST /placements - Saves screening level assessments, confidence metrics, and conversation transcript logs.
  • GET /placements/latest - Fetches the student’s most recent placement test.
  • GET /placements - Lists historical placement records for the student.

Conversation Sessions (conversations.py — 6 Endpoints)

  • POST /conversations - Creates a new active call/conversation container.
  • GET /conversations - Lists user conversation logs (paged).
  • GET /conversations/{id} - Retrieves metadata for a specific call session.
  • PUT /conversations/{id}/complete - Concludes the session and records duration.
  • POST /conversations/{id}/messages - Saves individual voice turns, grammar corrections, and STT/LLM/TTS latencies.
  • GET /conversations/{id}/messages - Retrieves chat transcripts for a session.

Analytics & Progress (progress.py — 1 Endpoint)

  • GET /progress/summary - Computes aggregate stats (total calls, speaking time, current CEFR level).

Voice AI Orchestration (llm.py — 4 Endpoints)

  • POST /llm/conversation-starter - Generates AI topic introductions.
  • POST /llm/conversation-turn - Real-time conversational AI loop (Zoe response).
  • POST /llm/placement-turn - Real-time placement interview AI loop.
  • POST /llm/session-analysis - Post-session evaluation (courage metrics, best moments).

Classroom Dashboard (admin.py — 3 Endpoints)

  • GET /admin/stats - Global dashboard reports (requires secret key).
  • GET /admin/conversations/{id}/transcript - Admin review of specific student transcripts.
  • GET /admin/placements/{user_id}/transcript - Admin review of student placement screens.

2. Infrastructure Disadvantages of Converting to Supabase

Migrating this API ecosystem to Supabase (whether migrating the database only, or moving the APIs to serverless Edge Functions) introduces critical infrastructure penalties:

A. The Latency Multiplier (The “Double Network Hop”)

For voice-first learning, total response latency must be kept under 1.5 to 2.0 seconds to maintain conversational flow.
  • Current Setup (EC2 Local): FastAPI calls Postgres via the local Docker network loopback (localhost). Network travel time is <0.1 ms per query.
  • Supabase Setup (Remote DB): Every database query must leave the EC2 instance and travel over the WAN to Supabase’s servers.
    • This adds 15 ms to 50 ms of network latency per query (even inside the same cloud region).
    • In routes like POST /llm/conversation-turn, the code executes multiple database reads and writes sequentially (Select User Profile \rightarrow Fetch Chat History \rightarrow Insert new message turn \rightarrow Log latency data).
    • The Penalty: If each query takes 30ms, you add 150ms of pure network travel delay to every single spoken turn.
[Current Setup]
FastAPI (EC2) <---(Local Loopback <0.1ms)---> Postgres Container (EC2)

[Supabase Setup]
FastAPI (EC2) <---(WAN network transit 15-50ms)---> Supabase DB (Cloud)

B. Split Network Hops for Voice AI

Endpoints like /llm/conversation-turn and /llm/placement-turn must orchestrate external LLM APIs (Gemini/OpenAI) alongside database state.
  1. Hop 1: FastAPI queries the DB to fetch student progress and chat logs.
  2. Hop 2: FastAPI calls Gemini/OpenAI to generate Zoe’s response.
  3. Hop 3: FastAPI writes Zoe’s response and latency telemetry back to the DB.
  • If DB is on Supabase: Hops 1 and 3 travel over the internet to Supabase, while Hop 2 travels to Google/OpenAI. This splits your database traffic from your application traffic, creating a distributed system that is highly vulnerable to network jitter and latency spikes.

C. Serverless “Cold Start” Latency (If migrating APIs completely)

If you move your 23 API endpoints off FastAPI and into Supabase Edge Functions (Deno/TypeScript):
  • Serverless functions run in “cold” micro-containers. If a student hasn’t initiated a conversation in the last 15 minutes, the next API request faces a 1 to 3-second container boot time (Cold Start).
  • The Penalty: The student speaks, and the app freezes for 4 seconds before Zoe even begins processing the audio. This breaks the voice-first experience.

D. WAN Connection Pooling & Stability

  • Current Setup: FastAPI maintains a persistent pool of local TCP sockets directly to Postgres.
  • Supabase Setup: Your API must maintain a connection pool across the open internet to Supabase. Any temporary WAN routing glitch will disrupt SQLAlchemy database transactions, causing random ConnectionResetError exceptions inside the conversational loop. To mitigate this, you must set up and maintain a dedicated transaction proxy (like PgBouncer), adding operational complexity.

3. Data Safety & Backups (The Real Risk)

If your EC2 instance crashes or restarts, your database is NOT lost.
  • Your Docker container writes database files directly to an AWS persistent EBS (Elastic Block Store) SSD volume.
  • EBS volumes are durable, replicated network storage. They survive VM reboots and crashes automatically.

The actual risks of data loss on EC2:

  1. Logical Deletion (App Bugs): If a bug in the code runs a bad query (e.g., DELETE FROM users), the database executes it instantly. Without a point-in-time backup, you cannot restore the database state to “3 hours ago.”
  2. Human Error (Accidental Deletion): A developer running docker compose down -v (which deletes named volumes) or terminating the EC2 instance without disabling the “Delete EBS on termination” setting.
  3. Server Corruption: Sudden power failure on the host physical machine can corrupt active writes to the database filesystem.
  • Self-Hosted EC2: You must manually write, schedule, and test backup scripts (e.g., cron jobs running pg_dump and uploading snapshots to AWS S3) to ensure you have a backup history.
  • Supabase: Automatically takes daily snapshots, logs the write-ahead logs, and manages database backups out-of-the-box.

4. Concurrency & FastAPI Connection Pooling

  • Active Users: Students who currently have the Ella app open (e.g., 1,000 students).
  • Concurrent Users (Database): Students whose app or backend is sending a SQL statement to the database at the exact same millisecond.
    • For 1,000 active students, you will typically see 10 to 50 concurrent DB queries at any given instant.

The Connection Trap in your FastAPI Code:

In /llm/conversation-turn, the database session is opened at the start of the request via Depends(get_db) and stays open while waiting for the Gemini/OpenAI API response (which takes 2 to 3 seconds).
@router.post("/conversation-turn")
async def conversation_turn(
    body: ConversationTurnRequest,
    db: AsyncSession = Depends(get_db), # <--- Opens DB Connection here
):
    # 1. Queries DB (Fast: 1ms)
    # 2. await llm_service.send_conversation_turn(...) # <--- SLOW: Waits 2-3 seconds for LLM
    # 3. Request ends, DB Connection is finally closed.
This blocks the database connection. If 100 students talk to Zoe at the same time, all 100 database connections in your pool will be frozen waiting for LLM responses, and the 101st student will experience a database timeout or crash.

How to Scale Connection Limits:

  • Option A: Current EC2 Setup: You can increase the connection pool size inside your SQLAlchemy engine (pool_size and max_overflow). More importantly, you must change your FastAPI code to query the DB, close the session, call the LLM, and then open a new session to write the result. This frees up database connections instantly.
  • Option B: Supabase: Includes PgBouncer / Supavisor pre-configured. These are connection poolers that queue thousands of client connections and execute them in a tight pool of physical Postgres connection slots. (Note: Even with Supabase, you must still fix the FastAPI code connection trap, otherwise the pooler will run out of slots).

5. Costing & External Service Dependencies

Costing:

  • Self-Hosted EC2: Flat rate (~8/monthfort3.micro; 8/month for t3.micro; ~16/month for t3.small). You can run millions of database queries and active users without the cost changing. You only pay for CPU, RAM, and Disk space.
  • Supabase:
    • Free Tier: Limited to 500 MB of database size. Under 20 minutes of daily active voice usage (~2.5 MB/user/year), you will hit the 500MB limit at ~200 active users.
    • Pro Tier (25/month):Coversupto8GBofdatabasestorage.Ifyourdatabasegrowsbeyond8GB,youarebilledanadditional25/month):** Covers up to 8 GB of database storage. If your database grows beyond 8 GB, you are billed an additional **0.50 per GB per month.

External Service Dependency:

By moving to Supabase, your database depends entirely on a third-party platform. If Supabase goes down, your app goes down. With self-hosted Docker on EC2, you are completely independent.

Comparison Summary

MetricOption A: Current EC2 Setup (Self-Hosted Postgres)Option B: Supabase (Managed Cloud Postgres)
Database CostFree (Shares the existing EC2 instance resources).Free (up to 500MB), then $25/month + storage overrides.
API LatencyUltra-Low (<1ms): Database and API live on the same local loop.Added Network Hops (15ms-50ms): Calls travel over the internet.
Safety (Backups)Manual: Requires you to write custom scripts to backup to S3.Automatic: Daily snapshots, point-in-time recovery out-of-the-box.
DevOps OverheadHigh: You configure replication, poolers, OS patches, and volumes.Zero: Fully managed cloud dashboard.
DependencyNone: You own and run the Docker containers.High: Dependent on Supabase cloud servers.