Myco Streaming Platform - Performance Troubleshooting Checklist

1 Quantify the Failure - Establish Baseline Metrics

▼

📊 Why This Matters

You cannot fix what you cannot measure. "It lags" is not actionable data. We need specific numbers to identify bottlenecks and track improvements.

Set up CloudWatch Dashboard for Real-time Monitoring
Critical

Create a dedicated dashboard to monitor key metrics during load events.
aws cloudwatch put-dashboard --dashboard-name "MycoStreamingPerformance" --dashboard-body file://dashboard-config.json

Target Metrics: Request latency (p50, p95, p99), Error rates (4xx/5xx), Concurrent viewers, Stream buffer ratio
Collect Request Performance Metrics
Critical

Pull detailed request metrics from CloudWatch during peak load periods.
aws logs insights start-query --log-group-name "/aws/apigateway/myco-api" --start-time $(date -d "1 hour ago" +%s) --end-time $(date +%s) --query-string 'fields @timestamp, responseTime | filter responseTime > 2000 | sort @timestamp desc'

Healthy Targets: p95 latency < 500ms, p99 latency < 1000ms, Error rate < 0.1%
Monitor Stream Quality Metrics
Critical

Track video-specific performance indicators that directly impact user experience.

Key Streaming Metrics to Track ▼

• Startup Time: Time from play request to first frame
• Buffer Ratio: % of playback time spent buffering
• Bitrate Adaptation: Frequency of quality changes
• CDN Cache Hit Ratio: % of requests served from cache
• Concurrent Viewer Count: Peak simultaneous streams

Healthy Targets: Startup < 3s, Buffer ratio < 1%, Cache hit ratio > 95%
Enable X-Ray Distributed Tracing
High

Set up distributed tracing to see exactly where time is spent in each request.
aws xray put-trace-segments --trace-segment-documents file://trace-segment.json

Focus Areas: Database queries, external API calls, cache lookups, authentication
Set Up Load Testing to Reproduce Issues
High

Create reproducible load tests that simulate real user behavior.

Load Test Scenarios ▼

• Gradual Ramp: Slowly increase concurrent viewers to find breaking point
• Spike Test: Sudden traffic surge (like viral stream start)
• Sustained Load: Steady high traffic for extended periods
• Peak Events: Simulate major live events with 10x normal traffic

# Example with k6 k6 run --vus 1000 --duration 10m streaming-load-test.js

2 CDN and Edge Layer Analysis

▼

🌐 Why Start Here

For streaming platforms, 80% of performance issues are at the CDN/edge layer. This is your first line of defense and most common bottleneck.

Analyze CloudFront Cache Performance
Critical

Check if your CDN is actually protecting your origin from traffic.
aws cloudfront get-distribution-config --id E1234567890123

Key CloudFront Metrics to Check ▼

• Cache Hit Ratio: Should be >95% for video segments
• Origin Shield Enabled: Reduces origin load by 90%+
• Cache Behaviors: Different TTLs for segments vs manifests
• Viewer Request Rate: Requests per second per edge location
• Error Rate: 4xx/5xx responses from CloudFront

Healthy Targets: Cache hit ratio >95%, Origin shield enabled, Error rate <0.5%
Verify HLS/DASH Manifest Caching
Critical

Manifest files are requested every few seconds by every viewer. Improper caching here kills origins.
curl -I https://d123456789.cloudfront.net/live/stream1/playlist.m3u8

Check Headers: Cache-Control: max-age=6, X-Cache: Hit from cloudfront

Common Issue: Setting TTL=0 on manifests works at low scale but causes origin storms at high scale.
Review Video Segment Size and Caching
High

Segment size affects both buffering and request volume. Wrong size causes cascade failures.

Optimal Segment Configuration ▼

• Segment Duration: 6-10 seconds (balance between latency and efficiency)
• Segment Size: 1-3MB per segment for good caching
• Cache TTL: 24+ hours for video segments (they never change)
• Adaptive Bitrate: 3-5 quality levels to match network conditions
• GOP Size: Matches segment duration for clean cuts
Check MediaPackage Configuration
High

If using AWS MediaPackage, verify it's optimized for your traffic patterns.
aws mediapackage describe-origin-endpoint --id myco-hls-endpoint

Key Settings: Origin shield enabled, Proper segment retention, CDN authorization configured
Verify Geographic Distribution
Medium

Ensure CDN has proper coverage for your audience and traffic is routing optimally.
aws cloudfront get-distribution-config --id E1234567890123 | jq '.DistributionConfig.PriceClass'

Check if you're using the right price class for your global audience distribution.

3 Load Balancer and Traffic Routing

▼

⚖️ Traffic Distribution Layer

After CDN, the load balancer is your next bottleneck. It distributes traffic to healthy backends and can queue requests when backends are overwhelmed.

Monitor ALB/NLB Performance Metrics
Critical

Check if your load balancer is becoming a bottleneck under high load.
aws elbv2 describe-load-balancers --load-balancer-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/myco-alb/1234567890123456

Critical ALB Metrics ▼

• TargetResponseTime: How long backends take to respond
• SurgeQueueLength: Requests waiting for available targets
• SpilloverCount: Requests rejected due to full surge queue
• RejectedConnectionCount: Connections rejected due to load
• ActiveConnectionCount: Current active connections

Warning Signs: SurgeQueueLength > 0, SpilloverCount > 0, TargetResponseTime increasing
Verify Target Health and Distribution
Critical

Ensure traffic is evenly distributed across healthy targets.
aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myco-targets/1234567890123456

Check for: Unhealthy targets, Uneven request distribution, Cross-zone load balancing enabled
Review Health Check Configuration
High

Aggressive health checks can overwhelm targets during high load. Verify they're properly tuned.

Health Check Best Practices ▼

• Interval: 30 seconds (not 10s) to reduce load
• Timeout: 5-10 seconds depending on app response time
• Healthy Threshold: 2-3 consecutive successes
• Unhealthy Threshold: 3-5 consecutive failures
• Health Check Path: Lightweight endpoint, not main app logic
Check Connection Draining Settings
High

During autoscaling events, improper connection draining can cause request drops.
aws elbv2 describe-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myco-targets/1234567890123456

Recommended: Deregistration delay 30-60s for web apps, 5-15s for APIs
Verify Sticky Sessions Configuration
Medium

If using session affinity, check it's not causing uneven load distribution.
Best Practice: Use stateless design when possible, or external session store (Redis)

4 Compute Tier Analysis (EC2/ECS/EKS/Lambda)

▼

💻 Application Layer Performance

Your compute layer processes requests and generates responses. Issues here manifest as high CPU/memory usage, slow autoscaling, or connection limits.

Analyze Autoscaling Performance
Critical

Check if autoscaling can keep up with traffic spikes typical of live streaming events.
aws autoscaling describe-scaling-activities --auto-scaling-group-name myco-asg --max-items 20

Autoscaling Optimization Checklist ▼

• Scale-out Time: Should complete in <2 minutes for streaming events
• Cooldown Period: 300s default may be too long for spiky traffic
• Scaling Metric: CPU utilization alone may not capture connection load
• Target Capacity: Keep 20-30% headroom for sudden spikes
• Warm Pool: Pre-warmed instances for instant scaling

Target: Scale-out in <2 minutes, Scale-in in 5-10 minutes with proper cooldowns
Monitor Resource Utilization
Critical

Track CPU, memory, network, and disk I/O to identify resource bottlenecks.
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=AutoScalingGroupName,Value=myco-asg --start-time $(date -d "1 hour ago" --iso-8601) --end-time $(date --iso-8601) --period 300 --statistics Average,Maximum

Warning Thresholds: CPU >70%, Memory >80%, Network >80% of instance capacity
Check Connection and File Descriptor Limits
Critical

Streaming applications often hit connection limits before CPU limits.
# Check current limits on instance ulimit -n # File descriptors ss -s # Socket summary netstat -an | wc -l # Total connections

Common Connection Limit Issues ▼

• File Descriptor Limit: Default 1024 may be too low
• Socket Limits: TIME_WAIT sockets consuming connection pool
• Application Thread Pool: Fixed thread pools causing queuing
• Database Connection Pool: All connections consumed
• WebSocket Connections: Long-lived connections for chat/interactive features
Review Instance Types and Sizes
High

Ensure you're using optimal instance types for your workload characteristics.

Instance Type Recommendations ▼

• API Servers: C5/C5n for CPU-intensive request processing
• Video Processing: C5n/M5n for high network throughput
• Database: R5/R5n for memory-intensive workloads
• Cache Nodes: R5/X1e for large in-memory datasets
• High Connection Count: C5n/M5n for enhanced networking
Validate Application Startup Time
High

Slow application startup defeats fast autoscaling. Measure and optimize boot time.
# Time application startup time docker run myco/streaming-app

Target: Application ready to serve traffic in <30 seconds

Optimization ideas: Pre-built AMIs, smaller Docker images, lazy initialization, warm pools

5 Database and Storage Layer

▼

🗄️ Data Layer Performance

Database bottlenecks often appear as sudden performance cliffs. A query that works fine at 100 RPS becomes catastrophic at 10,000 RPS.

Analyze RDS Performance Insights
Critical

Use RDS Performance Insights to identify database bottlenecks during load spikes.
aws rds describe-db-instances --db-instance-identifier myco-primary

Key Database Metrics ▼

• DB Load (AAS): Active sessions averaged over time
• CPU Utilization: Database server CPU usage
• Database Connections: Active vs max connections
• Read/Write IOPS: Disk I/O operations per second
• Lock Waits: Queries waiting for locks
• Top SQL Statements: Most resource-intensive queries

Warning Signs: DB Load > vCPUs, Connection % > 80%, Long-running queries
Check for Missing Database Indexes
Critical

Queries without proper indexes cause table scans that kill performance at scale.
# PostgreSQL: Find missing indexes SELECT schemaname,tablename,attname,n_distinct,correlation FROM pg_stats WHERE tablename = 'user_sessions' AND n_distinct > 100;

Common Streaming App Query Patterns ▼

• User Authentication: Index on user_id, email, token fields
• Stream Lookup: Index on stream_id, status, created_at
• Viewer Count: Aggregation queries need composite indexes
• Chat Messages: Index on stream_id + timestamp
• Analytics: Time-based queries need date/time indexes
Monitor Connection Pool Health
Critical

Database connection pool exhaustion is a common scaling bottleneck.
aws rds describe-db-instances --db-instance-identifier myco-primary | grep -A 5 "DBParameterGroups"

Check: max_connections setting, active vs idle connections, connection wait time

Fix: Use connection pooler like PgBouncer, or AWS RDS Proxy for automatic pooling
Analyze DynamoDB Throttling (if applicable)
High

DynamoDB hot partitions are extremely common in streaming apps where one viral stream creates hot keys.
aws dynamodb describe-table --table-name myco-streams --query 'Table.ProvisionedThroughput'

DynamoDB Hot Partition Signs ▼

• Throttled Requests: ProvisionedThroughputExceededException
• Hot Partition Key: One stream getting 90% of traffic
• Insufficient RCU/WCU: Under-provisioned for peak load
• Poor Key Design: Keys not distributing load evenly
• Missing Global Secondary Indexes: Table scans instead of queries

Solutions: Use stream_id + timestamp composite keys, enable DynamoDB auto-scaling, add random suffix to hot keys
Review Read Replica Strategy
High

Read-heavy workloads like user profiles and stream metadata benefit from read replicas.
aws rds describe-db-instances --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`]'

Best Practice: Route read queries to replicas, keep writes on primary, monitor replica lag

6 Cache Layer Analysis (ElastiCache/In-Memory)

▼

⚡ Caching Strategy

Proper caching can reduce database load by 90%+. Missing or misconfigured cache layers often cause database bottlenecks at scale.

Monitor Cache Hit Ratios
Critical

Low cache hit ratios indicate cache misses that flood the database.
aws elasticache describe-cache-clusters --cache-cluster-id myco-redis-001

Target Hit Ratios: Session data >95%, User profiles >90%, Stream metadata >85%

Monitor: CacheHits vs CacheMisses, Evictions, Memory utilization
Check Cache Key Strategy
Critical

Poor cache key design leads to unnecessary misses and inefficient memory usage.

Effective Cache Key Patterns ▼

• User Sessions: session:${token_hash}
• Stream Data: stream:${stream_id}:metadata
• Viewer Counts: count:${stream_id}:${time_bucket}
• User Profiles: user:${user_id}:profile
• Chat History: chat:${stream_id}:${page}

Avoid: Keys without namespaces, keys with user input (injection risk), overly long keys
Validate Cache Expiration Strategy
High

Wrong TTLs cause either stale data or excessive database hits.

Recommended TTLs by Data Type ▼

• User Sessions: 24 hours (match session timeout)
• Stream Metadata: 5 minutes (relatively static)
• Viewer Counts: 30 seconds (frequently changing)
• User Profiles: 1 hour (occasionally updated)
• Configuration: 10 minutes (rare changes)
Check Memory Usage and Eviction Policy
High

Running out of cache memory triggers evictions that hurt performance.
aws elasticache describe-cache-clusters --cache-cluster-id myco-redis-001 --show-cache-node-info

Target: Memory usage <80%, Evictions <1% of total operations

Eviction Policy: Use allkeys-lru for general caching, volatile-ttl for session data
Review Cache Warming Strategy
Medium

Cold caches after deployments or failures cause database load spikes.
Strategy: Pre-load popular streams, warm user sessions gradually, use consistent hashing for multiple cache nodes

7 Application Code Analysis

▼

🔍 Code-Level Performance Issues

Inefficient code patterns that work fine at low load become catastrophic bottlenecks at scale. These require code changes, not just infrastructure scaling.

Profile Application Under Load
Critical

Use application profiling tools to identify hot code paths and memory leaks.

Profiling Tools by Language ▼

• Node.js: node --prof, 0x, clinic.js
• Python: cProfile, py-spy, memory_profiler
• Java: JProfiler, async-profiler, JVM metrics
• Go: pprof, go tool trace
• Universal: AWS X-Ray for distributed tracing

Focus on: CPU hotspots, memory allocation patterns, blocking I/O operations
Identify N+1 Query Problems
Critical

N+1 queries are the most common database performance killer in streaming apps.

Common N+1 Patterns in Streaming Apps ▼

• Stream List: Loading streams, then user details for each
• Chat Messages: Fetching messages, then user profiles for each
• Viewer Lists: Getting viewers, then user info for each
• Recommendations: Loading streams, then category/metadata for each
• Analytics: Aggregating data without proper joins

Solutions: Use SELECT with JOINs, DataLoader pattern, batch API calls, eager loading in ORMs
Review Async vs Sync Operations
High

Blocking synchronous calls on hot paths limit concurrency and cause thread pool starvation.

Operations That Should Be Async ▼

• Database Queries: Use connection pooling with async drivers
• External API Calls: Stream health checks, CDN uploads
• Cache Operations: Redis/Memcached lookups
• File I/O: Log writes, config reads
• Notifications: Email, push notifications, webhooks
Check Thread Pool and Connection Pool Sizing
High

Incorrectly sized pools cause either resource waste or performance bottlenecks.
Rough Guidelines: DB pool = 2x CPU cores, HTTP client pool = 10x CPU cores, Thread pool = depends on I/O ratio

Monitor: Pool utilization, queue length, wait times
Analyze Memory Usage Patterns
High

Memory leaks and excessive garbage collection cause performance degradation over time.

Common Memory Issues ▼

• Unbounded Caches: In-memory caches without size limits
• WebSocket Leaks: Not cleaning up closed connections
• Event Listener Leaks: Accumulating listeners over time
• Large Object Retention: Holding references to large data structures
• Buffer Pooling: Not reusing buffers for video/media processing
Review Error Handling and Retry Logic
Medium

Poor error handling can cause cascading failures and retry storms that make problems worse.
Best Practices: Exponential backoff, circuit breakers, fail-fast for permanent errors, proper timeouts

8 Claude Code Integration and Remediation Plan

▼

🤖 Automated Diagnosis and Fixes

With proper access, Claude Code can automate much of the diagnostic work and propose specific fixes. Here's what it can and cannot do.

Grant Claude Code AWS Read Access
Critical

Provide Claude Code with read-only access to diagnose infrastructure issues.

Required AWS Permissions ▼

• CloudWatch: GetMetricStatistics, DescribeAlarms, GetLogEvents
• ELB: DescribeLoadBalancers, DescribeTargetHealth
• EC2: DescribeInstances, DescribeAutoScalingGroups
• RDS: DescribeDBInstances, GetPerformanceInsights
• ElastiCache: DescribeCacheClusters
• X-Ray: GetTraceSummaries, BatchGetTraces

# Example IAM policy for Claude Code read access { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricStatistics", "cloudwatch:DescribeAlarms", "logs:GetLogEvents", "ec2:Describe*", "elasticloadbalancing:Describe*", "rds:Describe*", "elasticache:Describe*", "xray:GetTraceSummaries", "xray:BatchGetTraces" ], "Resource": "*" } ] }
Provide Repository Access
Critical

Claude Code needs access to your application code to identify performance anti-patterns.
Scope: Application code, infrastructure as code (Terraform/CloudFormation), deployment scripts, configuration files

Security: Use read-only access initially, then approve specific PRs for fixes
Define Claude Code Tasks
High

Specify what you want Claude Code to focus on first.

What Claude Code Can Do Well ▼

• Metrics Analysis: Pull CloudWatch data and identify bottlenecks
• Code Review: Find N+1 queries, memory leaks, blocking operations
• Infrastructure Audit: Review autoscaling, caching, database configs
• Load Test Creation: Write realistic load tests with k6/Artillery
• Monitoring Setup: Add missing metrics, alerts, dashboards
• Performance Fixes: Implement caching, async patterns, query optimization

What Requires Human Review ▼

• Production Changes: Infrastructure modifications need approval
• Architecture Decisions: Major redesigns require business context
• Budget Impact: Scaling decisions affect costs
• Breaking Changes: API modifications need compatibility review
Establish Testing Environment
High

Claude Code should test fixes in staging before production deployment.
Requirements: Staging environment that mirrors production, realistic test data, ability to simulate load

Safety: All changes tested and approved before production rollout
Set Up Continuous Monitoring
Medium

Implement monitoring that will catch performance regressions before they impact users.
Automated Alerts: Latency spikes, error rate increases, capacity thresholds, deployment impact

🎬 Myco Streaming Platform

📊 Ready to Begin Diagnosis