Stress Testing: Determining Load Limits
Stress test intentionally overloads the system beyond normal operation to find the breaking point. Answers questions: at what RPS do errors start rising? How does the system recover after overload? What's the bottleneck—DB, CPU, memory, network?
Methodology
Step 1: Define baseline. Run normal load (50–70% of expected peak) and record metrics: p95 latency, error rate, CPU/memory.
Step 2: Stepwise increase. Raise load in steps of 10–20% every 2–5 minutes. Record point where errors or latency start rising.
Step 3: Find breaking point. Continue until degradation (error rate > 5% or latency > 5x baseline).
Step 4: Recovery. Remove load and observe how quickly the system returns to normal.
k6 Stress Test Scenario
// tests/stress/breaking-point.js
import http from 'k6/http'
import { check, sleep } from 'k6'
import { Rate, Trend, Counter } from 'k6/metrics'
const errorRate = new Rate('errors')
const requestsPerSecond = new Counter('requests_per_second')
export const options = {
stages: [
// Warmup to normal traffic
{ duration: '2m', target: 50 },
{ duration: '3m', target: 50 }, // baseline level
// Stepwise increase
{ duration: '2m', target: 100 },
{ duration: '3m', target: 100 },
{ duration: '2m', target: 200 },
{ duration: '3m', target: 200 },
{ duration: '2m', target: 400 },
{ duration: '3m', target: 400 },
{ duration: '2m', target: 800 },
{ duration: '3m', target: 800 },
{ duration: '2m', target: 1600 },
{ duration: '3m', target: 1600 },
// Cooldown and observe recovery
{ duration: '5m', target: 50 },
{ duration: '3m', target: 0 },
],
// Don't abort on threshold breach—need to see full picture
thresholds: {
http_req_duration: [
{ threshold: 'p(95)<2000', abortOnFail: false },
],
errors: [
{ threshold: 'rate<0.1', abortOnFail: false }
]
}
}
const BASE_URL = __ENV.BASE_URL || 'http://localhost:3000'
export default function() {
const responses = http.batch([
['GET', `${BASE_URL}/api/products?limit=20`],
['GET', `${BASE_URL}/api/categories`],
])
responses.forEach(r => {
check(r, { 'status 2xx': (r) => r.status >= 200 && r.status < 300 })
errorRate.add(r.status >= 400)
})
requestsPerSecond.add(2)
sleep(0.1)
}
export function handleSummary(data) {
// Find degradation point from collected data
const stages = analyzeStages(data)
return {
'stress-results.json': JSON.stringify(data, null, 2),
stdout: generateReport(stages)
}
}
function generateReport(stages) {
return `
=== STRESS TEST REPORT ===
Breaking Point Analysis:
${stages.map(s => ` VUs: ${s.vus} | p95: ${s.p95}ms | Errors: ${(s.errorRate*100).toFixed(1)}%`).join('\n')}
`
}
Monitoring During Test
Run system metrics collection in parallel:
#!/bin/bash
# scripts/monitor-stress-test.sh
TARGET_HOST="app-server-ip"
INTERVAL=10 # seconds
while true; do
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# CPU, Memory, Load Average
ssh $TARGET_HOST "
echo -n '$TIMESTAMP '
echo -n 'cpu:'; top -bn1 | grep 'Cpu(s)' | awk '{print \$2}'; echo -n ' '
echo -n 'mem:'; free | grep Mem | awk '{print \$3/\$2 * 100}'; echo -n ' '
echo -n 'load:'; cat /proc/loadavg | awk '{print \$1}'
echo -n 'conns:'; ss -s | grep -o 'estab [0-9]*' | awk '{print \$2}'
"
# PostgreSQL: active queries and locks
ssh $TARGET_HOST "
PGPASSWORD=pass psql -U app -d appdb -t -c \"
SELECT 'active_queries:', count(*) FROM pg_stat_activity
WHERE state = 'active' AND query NOT LIKE '%pg_stat%';
SELECT 'long_queries:', count(*) FROM pg_stat_activity
WHERE state = 'active' AND query_start < NOW() - interval '5 seconds';
SELECT 'locks:', count(*) FROM pg_locks WHERE NOT granted;
\"
"
sleep $INTERVAL
done | tee stress-monitor.log
Analyzing Results with Prometheus + Grafana
# k6 with Prometheus Remote Write
k6 run \
-o experimental-prometheus-rw \
--env K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write \
--env K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true \
tests/stress/breaking-point.js
# Grafana queries for stress test analysis
# RPS in real time
rate(k6_http_reqs_total[30s])
# Error rate over time (find degradation moment)
rate(k6_http_req_failed_total[30s]) / rate(k6_http_reqs_total[30s])
# p95 latency in real time
histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket[30s]))
# Correlation: load vs latency vs errors
Identifying Bottleneck
# analyze_stress_results.py
import json
import pandas as pd
def analyze_breaking_point(results_file):
with open(results_file) as f:
data = json.load(f)
# Extract time series
metrics = data['metrics']
analysis = {
'max_rps_before_errors': find_max_sustainable_rps(metrics),
'error_threshold_rps': find_error_threshold(metrics),
'latency_degradation_point': find_latency_degradation(metrics),
'recovery_time_seconds': find_recovery_time(metrics),
}
print("=== Breaking Point Analysis ===")
print(f"Max sustainable RPS (< 1% errors): {analysis['max_rps_before_errors']}")
print(f"Error threshold RPS: {analysis['error_threshold_rps']}")
print(f"p95 > 1s at RPS: {analysis['latency_degradation_point']}")
print(f"Recovery time after load removal: {analysis['recovery_time_seconds']}s")
# Recommendations
if analysis['max_rps_before_errors'] < 100:
print("\n[!] LOW capacity. Consider: DB connection pooling, caching, horizontal scaling")
elif analysis['recovery_time_seconds'] > 120:
print("\n[!] SLOW recovery. Consider: circuit breakers, graceful degradation")
return analysis
Typical Bottlenecks and Diagnostics
| Symptom | Probable Cause | Diagnostics |
|---|---|---|
| Latency grows, CPU low | DB locks or slow queries | pg_stat_activity, slow query log |
| CPU 100%, few errors | Computational bottleneck | top, application profiler |
ENOMEM errors |
Memory leak or OOM | free -m, /proc/meminfo |
| Connection refused | Connection pool exhausted | pgBouncer stats, netstat |
| 502 Bad Gateway | Worker processes overloaded | Nginx error log, worker_processes |
Timeline
Stress test with stepwise load profile, monitoring, and breaking point analysis—2–3 business days.







