🎬 Myco Streaming Platform

Performance Troubleshooting Checklist

0
Total Items
0
Completed
0
Critical Priority
0%
Progress
1 Quantify the Failure - Establish Baseline Metrics
📊 Why This Matters

You cannot fix what you cannot measure. "It lags" is not actionable data. We need specific numbers to identify bottlenecks and track improvements.

  • Set up CloudWatch Dashboard for Real-time Monitoring
    Critical
    Create a dedicated dashboard to monitor key metrics during load events.
    aws cloudwatch put-dashboard --dashboard-name "MycoStreamingPerformance" --dashboard-body file://dashboard-config.json
    Target Metrics: Request latency (p50, p95, p99), Error rates (4xx/5xx), Concurrent viewers, Stream buffer ratio
  • Collect Request Performance Metrics
    Critical
    Pull detailed request metrics from CloudWatch during peak load periods.
    aws logs insights start-query --log-group-name "/aws/apigateway/myco-api" --start-time $(date -d "1 hour ago" +%s) --end-time $(date +%s) --query-string 'fields @timestamp, responseTime | filter responseTime > 2000 | sort @timestamp desc'
    Healthy Targets: p95 latency < 500ms, p99 latency < 1000ms, Error rate < 0.1%
  • Monitor Stream Quality Metrics
    Critical
    Track video-specific performance indicators that directly impact user experience.
    Key Streaming Metrics to Track
    Startup Time: Time from play request to first frame
    Buffer Ratio: % of playback time spent buffering
    Bitrate Adaptation: Frequency of quality changes
    CDN Cache Hit Ratio: % of requests served from cache
    Concurrent Viewer Count: Peak simultaneous streams
    Healthy Targets: Startup < 3s, Buffer ratio < 1%, Cache hit ratio > 95%
  • Enable X-Ray Distributed Tracing
    High
    Set up distributed tracing to see exactly where time is spent in each request.
    aws xray put-trace-segments --trace-segment-documents file://trace-segment.json

    Focus Areas: Database queries, external API calls, cache lookups, authentication

  • Set Up Load Testing to Reproduce Issues
    High
    Create reproducible load tests that simulate real user behavior.
    Load Test Scenarios
    Gradual Ramp: Slowly increase concurrent viewers to find breaking point
    Spike Test: Sudden traffic surge (like viral stream start)
    Sustained Load: Steady high traffic for extended periods
    Peak Events: Simulate major live events with 10x normal traffic
    # Example with k6 k6 run --vus 1000 --duration 10m streaming-load-test.js
2 CDN and Edge Layer Analysis
🌐 Why Start Here

For streaming platforms, 80% of performance issues are at the CDN/edge layer. This is your first line of defense and most common bottleneck.

  • Analyze CloudFront Cache Performance
    Critical
    Check if your CDN is actually protecting your origin from traffic.
    aws cloudfront get-distribution-config --id E1234567890123
    Key CloudFront Metrics to Check
    Cache Hit Ratio: Should be >95% for video segments
    Origin Shield Enabled: Reduces origin load by 90%+
    Cache Behaviors: Different TTLs for segments vs manifests
    Viewer Request Rate: Requests per second per edge location
    Error Rate: 4xx/5xx responses from CloudFront
    Healthy Targets: Cache hit ratio >95%, Origin shield enabled, Error rate <0.5%
  • Verify HLS/DASH Manifest Caching
    Critical
    Manifest files are requested every few seconds by every viewer. Improper caching here kills origins.
    curl -I https://d123456789.cloudfront.net/live/stream1/playlist.m3u8
    Check Headers: Cache-Control: max-age=6, X-Cache: Hit from cloudfront

    Common Issue: Setting TTL=0 on manifests works at low scale but causes origin storms at high scale.

  • Review Video Segment Size and Caching
    High
    Segment size affects both buffering and request volume. Wrong size causes cascade failures.
    Optimal Segment Configuration
    Segment Duration: 6-10 seconds (balance between latency and efficiency)
    Segment Size: 1-3MB per segment for good caching
    Cache TTL: 24+ hours for video segments (they never change)
    Adaptive Bitrate: 3-5 quality levels to match network conditions
    GOP Size: Matches segment duration for clean cuts
  • Check MediaPackage Configuration
    High
    If using AWS MediaPackage, verify it's optimized for your traffic patterns.
    aws mediapackage describe-origin-endpoint --id myco-hls-endpoint
    Key Settings: Origin shield enabled, Proper segment retention, CDN authorization configured
  • Verify Geographic Distribution
    Medium
    Ensure CDN has proper coverage for your audience and traffic is routing optimally.
    aws cloudfront get-distribution-config --id E1234567890123 | jq '.DistributionConfig.PriceClass'

    Check if you're using the right price class for your global audience distribution.

3 Load Balancer and Traffic Routing
⚖️ Traffic Distribution Layer

After CDN, the load balancer is your next bottleneck. It distributes traffic to healthy backends and can queue requests when backends are overwhelmed.

  • Monitor ALB/NLB Performance Metrics
    Critical
    Check if your load balancer is becoming a bottleneck under high load.
    aws elbv2 describe-load-balancers --load-balancer-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/myco-alb/1234567890123456
    Critical ALB Metrics
    TargetResponseTime: How long backends take to respond
    SurgeQueueLength: Requests waiting for available targets
    SpilloverCount: Requests rejected due to full surge queue
    RejectedConnectionCount: Connections rejected due to load
    ActiveConnectionCount: Current active connections
    Warning Signs: SurgeQueueLength > 0, SpilloverCount > 0, TargetResponseTime increasing
  • Verify Target Health and Distribution
    Critical
    Ensure traffic is evenly distributed across healthy targets.
    aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myco-targets/1234567890123456

    Check for: Unhealthy targets, Uneven request distribution, Cross-zone load balancing enabled

  • Review Health Check Configuration
    High
    Aggressive health checks can overwhelm targets during high load. Verify they're properly tuned.
    Health Check Best Practices
    Interval: 30 seconds (not 10s) to reduce load
    Timeout: 5-10 seconds depending on app response time
    Healthy Threshold: 2-3 consecutive successes
    Unhealthy Threshold: 3-5 consecutive failures
    Health Check Path: Lightweight endpoint, not main app logic
  • Check Connection Draining Settings
    High
    During autoscaling events, improper connection draining can cause request drops.
    aws elbv2 describe-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myco-targets/1234567890123456
    Recommended: Deregistration delay 30-60s for web apps, 5-15s for APIs
  • Verify Sticky Sessions Configuration
    Medium
    If using session affinity, check it's not causing uneven load distribution.

    Best Practice: Use stateless design when possible, or external session store (Redis)

4 Compute Tier Analysis (EC2/ECS/EKS/Lambda)
💻 Application Layer Performance

Your compute layer processes requests and generates responses. Issues here manifest as high CPU/memory usage, slow autoscaling, or connection limits.

  • Analyze Autoscaling Performance
    Critical
    Check if autoscaling can keep up with traffic spikes typical of live streaming events.
    aws autoscaling describe-scaling-activities --auto-scaling-group-name myco-asg --max-items 20
    Autoscaling Optimization Checklist
    Scale-out Time: Should complete in <2 minutes for streaming events
    Cooldown Period: 300s default may be too long for spiky traffic
    Scaling Metric: CPU utilization alone may not capture connection load
    Target Capacity: Keep 20-30% headroom for sudden spikes
    Warm Pool: Pre-warmed instances for instant scaling
    Target: Scale-out in <2 minutes, Scale-in in 5-10 minutes with proper cooldowns
  • Monitor Resource Utilization
    Critical
    Track CPU, memory, network, and disk I/O to identify resource bottlenecks.
    aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=AutoScalingGroupName,Value=myco-asg --start-time $(date -d "1 hour ago" --iso-8601) --end-time $(date --iso-8601) --period 300 --statistics Average,Maximum
    Warning Thresholds: CPU >70%, Memory >80%, Network >80% of instance capacity
  • Check Connection and File Descriptor Limits
    Critical
    Streaming applications often hit connection limits before CPU limits.
    # Check current limits on instance ulimit -n # File descriptors ss -s # Socket summary netstat -an | wc -l # Total connections
    Common Connection Limit Issues
    File Descriptor Limit: Default 1024 may be too low
    Socket Limits: TIME_WAIT sockets consuming connection pool
    Application Thread Pool: Fixed thread pools causing queuing
    Database Connection Pool: All connections consumed
    WebSocket Connections: Long-lived connections for chat/interactive features
  • Review Instance Types and Sizes
    High
    Ensure you're using optimal instance types for your workload characteristics.
    Instance Type Recommendations
    API Servers: C5/C5n for CPU-intensive request processing
    Video Processing: C5n/M5n for high network throughput
    Database: R5/R5n for memory-intensive workloads
    Cache Nodes: R5/X1e for large in-memory datasets
    High Connection Count: C5n/M5n for enhanced networking
  • Validate Application Startup Time
    High
    Slow application startup defeats fast autoscaling. Measure and optimize boot time.
    # Time application startup time docker run myco/streaming-app
    Target: Application ready to serve traffic in <30 seconds

    Optimization ideas: Pre-built AMIs, smaller Docker images, lazy initialization, warm pools

5 Database and Storage Layer
🗄️ Data Layer Performance

Database bottlenecks often appear as sudden performance cliffs. A query that works fine at 100 RPS becomes catastrophic at 10,000 RPS.

  • Analyze RDS Performance Insights
    Critical
    Use RDS Performance Insights to identify database bottlenecks during load spikes.
    aws rds describe-db-instances --db-instance-identifier myco-primary
    Key Database Metrics
    DB Load (AAS): Active sessions averaged over time
    CPU Utilization: Database server CPU usage
    Database Connections: Active vs max connections
    Read/Write IOPS: Disk I/O operations per second
    Lock Waits: Queries waiting for locks
    Top SQL Statements: Most resource-intensive queries
    Warning Signs: DB Load > vCPUs, Connection % > 80%, Long-running queries
  • Check for Missing Database Indexes
    Critical
    Queries without proper indexes cause table scans that kill performance at scale.
    # PostgreSQL: Find missing indexes SELECT schemaname,tablename,attname,n_distinct,correlation FROM pg_stats WHERE tablename = 'user_sessions' AND n_distinct > 100;
    Common Streaming App Query Patterns
    User Authentication: Index on user_id, email, token fields
    Stream Lookup: Index on stream_id, status, created_at
    Viewer Count: Aggregation queries need composite indexes
    Chat Messages: Index on stream_id + timestamp
    Analytics: Time-based queries need date/time indexes
  • Monitor Connection Pool Health
    Critical
    Database connection pool exhaustion is a common scaling bottleneck.
    aws rds describe-db-instances --db-instance-identifier myco-primary | grep -A 5 "DBParameterGroups"
    Check: max_connections setting, active vs idle connections, connection wait time

    Fix: Use connection pooler like PgBouncer, or AWS RDS Proxy for automatic pooling

  • Analyze DynamoDB Throttling (if applicable)
    High
    DynamoDB hot partitions are extremely common in streaming apps where one viral stream creates hot keys.
    aws dynamodb describe-table --table-name myco-streams --query 'Table.ProvisionedThroughput'
    DynamoDB Hot Partition Signs
    Throttled Requests: ProvisionedThroughputExceededException
    Hot Partition Key: One stream getting 90% of traffic
    Insufficient RCU/WCU: Under-provisioned for peak load
    Poor Key Design: Keys not distributing load evenly
    Missing Global Secondary Indexes: Table scans instead of queries

    Solutions: Use stream_id + timestamp composite keys, enable DynamoDB auto-scaling, add random suffix to hot keys

  • Review Read Replica Strategy
    High
    Read-heavy workloads like user profiles and stream metadata benefit from read replicas.
    aws rds describe-db-instances --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`]'

    Best Practice: Route read queries to replicas, keep writes on primary, monitor replica lag

6 Cache Layer Analysis (ElastiCache/In-Memory)
⚡ Caching Strategy

Proper caching can reduce database load by 90%+. Missing or misconfigured cache layers often cause database bottlenecks at scale.

  • Monitor Cache Hit Ratios
    Critical
    Low cache hit ratios indicate cache misses that flood the database.
    aws elasticache describe-cache-clusters --cache-cluster-id myco-redis-001
    Target Hit Ratios: Session data >95%, User profiles >90%, Stream metadata >85%

    Monitor: CacheHits vs CacheMisses, Evictions, Memory utilization

  • Check Cache Key Strategy
    Critical
    Poor cache key design leads to unnecessary misses and inefficient memory usage.
    Effective Cache Key Patterns
    User Sessions: session:${token_hash}
    Stream Data: stream:${stream_id}:metadata
    Viewer Counts: count:${stream_id}:${time_bucket}
    User Profiles: user:${user_id}:profile
    Chat History: chat:${stream_id}:${page}

    Avoid: Keys without namespaces, keys with user input (injection risk), overly long keys

  • Validate Cache Expiration Strategy
    High
    Wrong TTLs cause either stale data or excessive database hits.
    Recommended TTLs by Data Type
    User Sessions: 24 hours (match session timeout)
    Stream Metadata: 5 minutes (relatively static)
    Viewer Counts: 30 seconds (frequently changing)
    User Profiles: 1 hour (occasionally updated)
    Configuration: 10 minutes (rare changes)
  • Check Memory Usage and Eviction Policy
    High
    Running out of cache memory triggers evictions that hurt performance.
    aws elasticache describe-cache-clusters --cache-cluster-id myco-redis-001 --show-cache-node-info
    Target: Memory usage <80%, Evictions <1% of total operations

    Eviction Policy: Use allkeys-lru for general caching, volatile-ttl for session data

  • Review Cache Warming Strategy
    Medium
    Cold caches after deployments or failures cause database load spikes.

    Strategy: Pre-load popular streams, warm user sessions gradually, use consistent hashing for multiple cache nodes

7 Application Code Analysis
🔍 Code-Level Performance Issues

Inefficient code patterns that work fine at low load become catastrophic bottlenecks at scale. These require code changes, not just infrastructure scaling.

  • Profile Application Under Load
    Critical
    Use application profiling tools to identify hot code paths and memory leaks.
    Profiling Tools by Language
    Node.js: node --prof, 0x, clinic.js
    Python: cProfile, py-spy, memory_profiler
    Java: JProfiler, async-profiler, JVM metrics
    Go: pprof, go tool trace
    Universal: AWS X-Ray for distributed tracing

    Focus on: CPU hotspots, memory allocation patterns, blocking I/O operations

  • Identify N+1 Query Problems
    Critical
    N+1 queries are the most common database performance killer in streaming apps.
    Common N+1 Patterns in Streaming Apps
    Stream List: Loading streams, then user details for each
    Chat Messages: Fetching messages, then user profiles for each
    Viewer Lists: Getting viewers, then user info for each
    Recommendations: Loading streams, then category/metadata for each
    Analytics: Aggregating data without proper joins

    Solutions: Use SELECT with JOINs, DataLoader pattern, batch API calls, eager loading in ORMs

  • Review Async vs Sync Operations
    High
    Blocking synchronous calls on hot paths limit concurrency and cause thread pool starvation.
    Operations That Should Be Async
    Database Queries: Use connection pooling with async drivers
    External API Calls: Stream health checks, CDN uploads
    Cache Operations: Redis/Memcached lookups
    File I/O: Log writes, config reads
    Notifications: Email, push notifications, webhooks
  • Check Thread Pool and Connection Pool Sizing
    High
    Incorrectly sized pools cause either resource waste or performance bottlenecks.
    Rough Guidelines: DB pool = 2x CPU cores, HTTP client pool = 10x CPU cores, Thread pool = depends on I/O ratio

    Monitor: Pool utilization, queue length, wait times

  • Analyze Memory Usage Patterns
    High
    Memory leaks and excessive garbage collection cause performance degradation over time.
    Common Memory Issues
    Unbounded Caches: In-memory caches without size limits
    WebSocket Leaks: Not cleaning up closed connections
    Event Listener Leaks: Accumulating listeners over time
    Large Object Retention: Holding references to large data structures
    Buffer Pooling: Not reusing buffers for video/media processing
  • Review Error Handling and Retry Logic
    Medium
    Poor error handling can cause cascading failures and retry storms that make problems worse.

    Best Practices: Exponential backoff, circuit breakers, fail-fast for permanent errors, proper timeouts

8 Claude Code Integration and Remediation Plan
🤖 Automated Diagnosis and Fixes

With proper access, Claude Code can automate much of the diagnostic work and propose specific fixes. Here's what it can and cannot do.

  • Grant Claude Code AWS Read Access
    Critical
    Provide Claude Code with read-only access to diagnose infrastructure issues.
    Required AWS Permissions
    CloudWatch: GetMetricStatistics, DescribeAlarms, GetLogEvents
    ELB: DescribeLoadBalancers, DescribeTargetHealth
    EC2: DescribeInstances, DescribeAutoScalingGroups
    RDS: DescribeDBInstances, GetPerformanceInsights
    ElastiCache: DescribeCacheClusters
    X-Ray: GetTraceSummaries, BatchGetTraces
    # Example IAM policy for Claude Code read access { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricStatistics", "cloudwatch:DescribeAlarms", "logs:GetLogEvents", "ec2:Describe*", "elasticloadbalancing:Describe*", "rds:Describe*", "elasticache:Describe*", "xray:GetTraceSummaries", "xray:BatchGetTraces" ], "Resource": "*" } ] }
  • Provide Repository Access
    Critical
    Claude Code needs access to your application code to identify performance anti-patterns.

    Scope: Application code, infrastructure as code (Terraform/CloudFormation), deployment scripts, configuration files

    Security: Use read-only access initially, then approve specific PRs for fixes

  • Define Claude Code Tasks
    High
    Specify what you want Claude Code to focus on first.
    What Claude Code Can Do Well
    Metrics Analysis: Pull CloudWatch data and identify bottlenecks
    Code Review: Find N+1 queries, memory leaks, blocking operations
    Infrastructure Audit: Review autoscaling, caching, database configs
    Load Test Creation: Write realistic load tests with k6/Artillery
    Monitoring Setup: Add missing metrics, alerts, dashboards
    Performance Fixes: Implement caching, async patterns, query optimization
    What Requires Human Review
    Production Changes: Infrastructure modifications need approval
    Architecture Decisions: Major redesigns require business context
    Budget Impact: Scaling decisions affect costs
    Breaking Changes: API modifications need compatibility review
  • Establish Testing Environment
    High
    Claude Code should test fixes in staging before production deployment.

    Requirements: Staging environment that mirrors production, realistic test data, ability to simulate load

    Safety: All changes tested and approved before production rollout

  • Set Up Continuous Monitoring
    Medium
    Implement monitoring that will catch performance regressions before they impact users.

    Automated Alerts: Latency spikes, error rate increases, capacity thresholds, deployment impact

📊 Ready to Begin Diagnosis

Complete each phase systematically, starting with Phase 1. Use this checklist to track your progress and ensure no critical steps are missed.

📋 Expand All 📁 Collapse All