You cannot fix what you cannot measure. "It lags" is not actionable data. We need specific numbers to identify bottlenecks and track improvements.
-
Set up CloudWatch Dashboard for Real-time MonitoringCriticalCreate a dedicated dashboard to monitor key metrics during load events.aws cloudwatch put-dashboard --dashboard-name "MycoStreamingPerformance" --dashboard-body file://dashboard-config.jsonTarget Metrics: Request latency (p50, p95, p99), Error rates (4xx/5xx), Concurrent viewers, Stream buffer ratio
-
Collect Request Performance MetricsCriticalPull detailed request metrics from CloudWatch during peak load periods.aws logs insights start-query --log-group-name "/aws/apigateway/myco-api" --start-time $(date -d "1 hour ago" +%s) --end-time $(date +%s) --query-string 'fields @timestamp, responseTime | filter responseTime > 2000 | sort @timestamp desc'Healthy Targets: p95 latency < 500ms, p99 latency < 1000ms, Error rate < 0.1%
-
Monitor Stream Quality MetricsCriticalTrack video-specific performance indicators that directly impact user experience.Key Streaming Metrics to Track ▼• Startup Time: Time from play request to first frame
• Buffer Ratio: % of playback time spent buffering
• Bitrate Adaptation: Frequency of quality changes
• CDN Cache Hit Ratio: % of requests served from cache
• Concurrent Viewer Count: Peak simultaneous streamsHealthy Targets: Startup < 3s, Buffer ratio < 1%, Cache hit ratio > 95% -
Enable X-Ray Distributed TracingHighSet up distributed tracing to see exactly where time is spent in each request.aws xray put-trace-segments --trace-segment-documents file://trace-segment.json
Focus Areas: Database queries, external API calls, cache lookups, authentication
-
Set Up Load Testing to Reproduce IssuesHighCreate reproducible load tests that simulate real user behavior.Load Test Scenarios ▼• Gradual Ramp: Slowly increase concurrent viewers to find breaking point
• Spike Test: Sudden traffic surge (like viral stream start)
• Sustained Load: Steady high traffic for extended periods
• Peak Events: Simulate major live events with 10x normal traffic# Example with k6 k6 run --vus 1000 --duration 10m streaming-load-test.js
For streaming platforms, 80% of performance issues are at the CDN/edge layer. This is your first line of defense and most common bottleneck.
-
Analyze CloudFront Cache PerformanceCriticalCheck if your CDN is actually protecting your origin from traffic.aws cloudfront get-distribution-config --id E1234567890123Key CloudFront Metrics to Check ▼• Cache Hit Ratio: Should be >95% for video segments
• Origin Shield Enabled: Reduces origin load by 90%+
• Cache Behaviors: Different TTLs for segments vs manifests
• Viewer Request Rate: Requests per second per edge location
• Error Rate: 4xx/5xx responses from CloudFrontHealthy Targets: Cache hit ratio >95%, Origin shield enabled, Error rate <0.5% -
Verify HLS/DASH Manifest CachingCriticalManifest files are requested every few seconds by every viewer. Improper caching here kills origins.curl -I https://d123456789.cloudfront.net/live/stream1/playlist.m3u8Check Headers: Cache-Control: max-age=6, X-Cache: Hit from cloudfront
Common Issue: Setting TTL=0 on manifests works at low scale but causes origin storms at high scale.
-
Review Video Segment Size and CachingHighSegment size affects both buffering and request volume. Wrong size causes cascade failures.Optimal Segment Configuration ▼• Segment Duration: 6-10 seconds (balance between latency and efficiency)
• Segment Size: 1-3MB per segment for good caching
• Cache TTL: 24+ hours for video segments (they never change)
• Adaptive Bitrate: 3-5 quality levels to match network conditions
• GOP Size: Matches segment duration for clean cuts -
Check MediaPackage ConfigurationHighIf using AWS MediaPackage, verify it's optimized for your traffic patterns.aws mediapackage describe-origin-endpoint --id myco-hls-endpointKey Settings: Origin shield enabled, Proper segment retention, CDN authorization configured
-
Verify Geographic DistributionMediumEnsure CDN has proper coverage for your audience and traffic is routing optimally.aws cloudfront get-distribution-config --id E1234567890123 | jq '.DistributionConfig.PriceClass'
Check if you're using the right price class for your global audience distribution.
After CDN, the load balancer is your next bottleneck. It distributes traffic to healthy backends and can queue requests when backends are overwhelmed.
-
Monitor ALB/NLB Performance MetricsCriticalCheck if your load balancer is becoming a bottleneck under high load.aws elbv2 describe-load-balancers --load-balancer-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/myco-alb/1234567890123456Critical ALB Metrics ▼• TargetResponseTime: How long backends take to respond
• SurgeQueueLength: Requests waiting for available targets
• SpilloverCount: Requests rejected due to full surge queue
• RejectedConnectionCount: Connections rejected due to load
• ActiveConnectionCount: Current active connectionsWarning Signs: SurgeQueueLength > 0, SpilloverCount > 0, TargetResponseTime increasing -
Verify Target Health and DistributionCriticalEnsure traffic is evenly distributed across healthy targets.aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myco-targets/1234567890123456
Check for: Unhealthy targets, Uneven request distribution, Cross-zone load balancing enabled
-
Review Health Check ConfigurationHighAggressive health checks can overwhelm targets during high load. Verify they're properly tuned.Health Check Best Practices ▼• Interval: 30 seconds (not 10s) to reduce load
• Timeout: 5-10 seconds depending on app response time
• Healthy Threshold: 2-3 consecutive successes
• Unhealthy Threshold: 3-5 consecutive failures
• Health Check Path: Lightweight endpoint, not main app logic -
Check Connection Draining SettingsHighDuring autoscaling events, improper connection draining can cause request drops.aws elbv2 describe-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myco-targets/1234567890123456Recommended: Deregistration delay 30-60s for web apps, 5-15s for APIs
-
Verify Sticky Sessions ConfigurationMediumIf using session affinity, check it's not causing uneven load distribution.
Best Practice: Use stateless design when possible, or external session store (Redis)
Your compute layer processes requests and generates responses. Issues here manifest as high CPU/memory usage, slow autoscaling, or connection limits.
-
Analyze Autoscaling PerformanceCriticalCheck if autoscaling can keep up with traffic spikes typical of live streaming events.aws autoscaling describe-scaling-activities --auto-scaling-group-name myco-asg --max-items 20Autoscaling Optimization Checklist ▼• Scale-out Time: Should complete in <2 minutes for streaming events
• Cooldown Period: 300s default may be too long for spiky traffic
• Scaling Metric: CPU utilization alone may not capture connection load
• Target Capacity: Keep 20-30% headroom for sudden spikes
• Warm Pool: Pre-warmed instances for instant scalingTarget: Scale-out in <2 minutes, Scale-in in 5-10 minutes with proper cooldowns -
Monitor Resource UtilizationCriticalTrack CPU, memory, network, and disk I/O to identify resource bottlenecks.aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=AutoScalingGroupName,Value=myco-asg --start-time $(date -d "1 hour ago" --iso-8601) --end-time $(date --iso-8601) --period 300 --statistics Average,MaximumWarning Thresholds: CPU >70%, Memory >80%, Network >80% of instance capacity
-
Check Connection and File Descriptor LimitsCriticalStreaming applications often hit connection limits before CPU limits.# Check current limits on instance ulimit -n # File descriptors ss -s # Socket summary netstat -an | wc -l # Total connectionsCommon Connection Limit Issues ▼• File Descriptor Limit: Default 1024 may be too low
• Socket Limits: TIME_WAIT sockets consuming connection pool
• Application Thread Pool: Fixed thread pools causing queuing
• Database Connection Pool: All connections consumed
• WebSocket Connections: Long-lived connections for chat/interactive features -
Review Instance Types and SizesHighEnsure you're using optimal instance types for your workload characteristics.Instance Type Recommendations ▼• API Servers: C5/C5n for CPU-intensive request processing
• Video Processing: C5n/M5n for high network throughput
• Database: R5/R5n for memory-intensive workloads
• Cache Nodes: R5/X1e for large in-memory datasets
• High Connection Count: C5n/M5n for enhanced networking -
Validate Application Startup TimeHighSlow application startup defeats fast autoscaling. Measure and optimize boot time.# Time application startup time docker run myco/streaming-appTarget: Application ready to serve traffic in <30 seconds
Optimization ideas: Pre-built AMIs, smaller Docker images, lazy initialization, warm pools
Database bottlenecks often appear as sudden performance cliffs. A query that works fine at 100 RPS becomes catastrophic at 10,000 RPS.
-
Analyze RDS Performance InsightsCriticalUse RDS Performance Insights to identify database bottlenecks during load spikes.aws rds describe-db-instances --db-instance-identifier myco-primaryKey Database Metrics ▼• DB Load (AAS): Active sessions averaged over time
• CPU Utilization: Database server CPU usage
• Database Connections: Active vs max connections
• Read/Write IOPS: Disk I/O operations per second
• Lock Waits: Queries waiting for locks
• Top SQL Statements: Most resource-intensive queriesWarning Signs: DB Load > vCPUs, Connection % > 80%, Long-running queries -
Check for Missing Database IndexesCriticalQueries without proper indexes cause table scans that kill performance at scale.# PostgreSQL: Find missing indexes SELECT schemaname,tablename,attname,n_distinct,correlation FROM pg_stats WHERE tablename = 'user_sessions' AND n_distinct > 100;Common Streaming App Query Patterns ▼• User Authentication: Index on user_id, email, token fields
• Stream Lookup: Index on stream_id, status, created_at
• Viewer Count: Aggregation queries need composite indexes
• Chat Messages: Index on stream_id + timestamp
• Analytics: Time-based queries need date/time indexes -
Monitor Connection Pool HealthCriticalDatabase connection pool exhaustion is a common scaling bottleneck.aws rds describe-db-instances --db-instance-identifier myco-primary | grep -A 5 "DBParameterGroups"Check: max_connections setting, active vs idle connections, connection wait time
Fix: Use connection pooler like PgBouncer, or AWS RDS Proxy for automatic pooling
-
Analyze DynamoDB Throttling (if applicable)HighDynamoDB hot partitions are extremely common in streaming apps where one viral stream creates hot keys.aws dynamodb describe-table --table-name myco-streams --query 'Table.ProvisionedThroughput'DynamoDB Hot Partition Signs ▼• Throttled Requests: ProvisionedThroughputExceededException
• Hot Partition Key: One stream getting 90% of traffic
• Insufficient RCU/WCU: Under-provisioned for peak load
• Poor Key Design: Keys not distributing load evenly
• Missing Global Secondary Indexes: Table scans instead of queriesSolutions: Use stream_id + timestamp composite keys, enable DynamoDB auto-scaling, add random suffix to hot keys
-
Review Read Replica StrategyHighRead-heavy workloads like user profiles and stream metadata benefit from read replicas.aws rds describe-db-instances --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`]'
Best Practice: Route read queries to replicas, keep writes on primary, monitor replica lag
Proper caching can reduce database load by 90%+. Missing or misconfigured cache layers often cause database bottlenecks at scale.
-
Monitor Cache Hit RatiosCriticalLow cache hit ratios indicate cache misses that flood the database.aws elasticache describe-cache-clusters --cache-cluster-id myco-redis-001Target Hit Ratios: Session data >95%, User profiles >90%, Stream metadata >85%
Monitor: CacheHits vs CacheMisses, Evictions, Memory utilization
-
Check Cache Key StrategyCriticalPoor cache key design leads to unnecessary misses and inefficient memory usage.Effective Cache Key Patterns ▼• User Sessions: session:${token_hash}
• Stream Data: stream:${stream_id}:metadata
• Viewer Counts: count:${stream_id}:${time_bucket}
• User Profiles: user:${user_id}:profile
• Chat History: chat:${stream_id}:${page}Avoid: Keys without namespaces, keys with user input (injection risk), overly long keys
-
Validate Cache Expiration StrategyHighWrong TTLs cause either stale data or excessive database hits.Recommended TTLs by Data Type ▼• User Sessions: 24 hours (match session timeout)
• Stream Metadata: 5 minutes (relatively static)
• Viewer Counts: 30 seconds (frequently changing)
• User Profiles: 1 hour (occasionally updated)
• Configuration: 10 minutes (rare changes) -
Check Memory Usage and Eviction PolicyHighRunning out of cache memory triggers evictions that hurt performance.aws elasticache describe-cache-clusters --cache-cluster-id myco-redis-001 --show-cache-node-infoTarget: Memory usage <80%, Evictions <1% of total operations
Eviction Policy: Use allkeys-lru for general caching, volatile-ttl for session data
-
Review Cache Warming StrategyMediumCold caches after deployments or failures cause database load spikes.
Strategy: Pre-load popular streams, warm user sessions gradually, use consistent hashing for multiple cache nodes
Inefficient code patterns that work fine at low load become catastrophic bottlenecks at scale. These require code changes, not just infrastructure scaling.
-
Profile Application Under LoadCriticalUse application profiling tools to identify hot code paths and memory leaks.Profiling Tools by Language ▼• Node.js: node --prof, 0x, clinic.js
• Python: cProfile, py-spy, memory_profiler
• Java: JProfiler, async-profiler, JVM metrics
• Go: pprof, go tool trace
• Universal: AWS X-Ray for distributed tracingFocus on: CPU hotspots, memory allocation patterns, blocking I/O operations
-
Identify N+1 Query ProblemsCriticalN+1 queries are the most common database performance killer in streaming apps.Common N+1 Patterns in Streaming Apps ▼• Stream List: Loading streams, then user details for each
• Chat Messages: Fetching messages, then user profiles for each
• Viewer Lists: Getting viewers, then user info for each
• Recommendations: Loading streams, then category/metadata for each
• Analytics: Aggregating data without proper joinsSolutions: Use SELECT with JOINs, DataLoader pattern, batch API calls, eager loading in ORMs
-
Review Async vs Sync OperationsHighBlocking synchronous calls on hot paths limit concurrency and cause thread pool starvation.Operations That Should Be Async ▼• Database Queries: Use connection pooling with async drivers
• External API Calls: Stream health checks, CDN uploads
• Cache Operations: Redis/Memcached lookups
• File I/O: Log writes, config reads
• Notifications: Email, push notifications, webhooks -
Check Thread Pool and Connection Pool SizingHighIncorrectly sized pools cause either resource waste or performance bottlenecks.Rough Guidelines: DB pool = 2x CPU cores, HTTP client pool = 10x CPU cores, Thread pool = depends on I/O ratio
Monitor: Pool utilization, queue length, wait times
-
Analyze Memory Usage PatternsHighMemory leaks and excessive garbage collection cause performance degradation over time.Common Memory Issues ▼• Unbounded Caches: In-memory caches without size limits
• WebSocket Leaks: Not cleaning up closed connections
• Event Listener Leaks: Accumulating listeners over time
• Large Object Retention: Holding references to large data structures
• Buffer Pooling: Not reusing buffers for video/media processing -
Review Error Handling and Retry LogicMediumPoor error handling can cause cascading failures and retry storms that make problems worse.
Best Practices: Exponential backoff, circuit breakers, fail-fast for permanent errors, proper timeouts
With proper access, Claude Code can automate much of the diagnostic work and propose specific fixes. Here's what it can and cannot do.
-
Grant Claude Code AWS Read AccessCriticalProvide Claude Code with read-only access to diagnose infrastructure issues.Required AWS Permissions ▼• CloudWatch: GetMetricStatistics, DescribeAlarms, GetLogEvents
• ELB: DescribeLoadBalancers, DescribeTargetHealth
• EC2: DescribeInstances, DescribeAutoScalingGroups
• RDS: DescribeDBInstances, GetPerformanceInsights
• ElastiCache: DescribeCacheClusters
• X-Ray: GetTraceSummaries, BatchGetTraces# Example IAM policy for Claude Code read access { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricStatistics", "cloudwatch:DescribeAlarms", "logs:GetLogEvents", "ec2:Describe*", "elasticloadbalancing:Describe*", "rds:Describe*", "elasticache:Describe*", "xray:GetTraceSummaries", "xray:BatchGetTraces" ], "Resource": "*" } ] } -
Provide Repository AccessCriticalClaude Code needs access to your application code to identify performance anti-patterns.
Scope: Application code, infrastructure as code (Terraform/CloudFormation), deployment scripts, configuration files
Security: Use read-only access initially, then approve specific PRs for fixes
-
Define Claude Code TasksHighSpecify what you want Claude Code to focus on first.What Claude Code Can Do Well ▼• Metrics Analysis: Pull CloudWatch data and identify bottlenecks
• Code Review: Find N+1 queries, memory leaks, blocking operations
• Infrastructure Audit: Review autoscaling, caching, database configs
• Load Test Creation: Write realistic load tests with k6/Artillery
• Monitoring Setup: Add missing metrics, alerts, dashboards
• Performance Fixes: Implement caching, async patterns, query optimizationWhat Requires Human Review ▼• Production Changes: Infrastructure modifications need approval
• Architecture Decisions: Major redesigns require business context
• Budget Impact: Scaling decisions affect costs
• Breaking Changes: API modifications need compatibility review -
Establish Testing EnvironmentHighClaude Code should test fixes in staging before production deployment.
Requirements: Staging environment that mirrors production, realistic test data, ability to simulate load
Safety: All changes tested and approved before production rollout
-
Set Up Continuous MonitoringMediumImplement monitoring that will catch performance regressions before they impact users.
Automated Alerts: Latency spikes, error rate increases, capacity thresholds, deployment impact
📊 Ready to Begin Diagnosis
Complete each phase systematically, starting with Phase 1. Use this checklist to track your progress and ensure no critical steps are missed.