Analyzing Bottlenecks from Load Test Results
Load test showed degradation—now find the cause. Sequential analysis: network level → application → database → infrastructure. Bottleneck is always singular: fix it, run test again.
Diagnostic Sequence
High latency or errors
│
├── High p95 latency, CPU < 70%, memory OK
│ └── → Database: slow queries, locks, N+1
│
├── CPU 90–100%, latency grows proportionally
│ └── → Computational bottleneck: profile CPU-hot paths
│
├── Memory grows, swap active
│ └── → Memory leak or heap too small
│
├── ENOMEM / EMFILE / ECONNREFUSED
│ └── → System limits: ulimit, file descriptors, TCP backlog
│
└── 502/504 errors, app OK
└── → Nginx upstream, load balancer timeout
Analyzing PostgreSQL Under Load
-- Running queries right now (execute during test)
SELECT pid, now() - query_start AS duration,
state, wait_event_type, wait_event,
left(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
AND query NOT LIKE '%pg_stat_activity%'
ORDER BY duration DESC;
-- Locks: who's blocking whom
SELECT blocked.pid, blocked.query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.cardinality(pg_blocking_pids(blocked.pid)) > 0;
-- Heaviest queries (pg_stat_statements)
SELECT query, calls, mean_exec_time, total_exec_time,
stddev_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Missing indexes: sequential scans on large tables
SELECT relname, seq_scan, seq_tup_read,
idx_scan, seq_tup_read / nullif(seq_scan, 0) AS avg_rows_per_seqscan
FROM pg_stat_user_tables
WHERE seq_scan > 100
AND seq_tup_read > 10000
ORDER BY seq_tup_read DESC;
Node.js: CPU Profiler
// server.js—enable V8 profiling via signal
process.on('SIGUSR1', () => {
const { Session } = require('inspector')
const session = new Session()
session.connect()
session.post('Profiler.enable')
session.post('Profiler.start')
// Profile for 30 seconds
setTimeout(() => {
session.post('Profiler.stop', (err, { profile }) => {
require('fs').writeFileSync('./cpu-profile.cpuprofile', JSON.stringify(profile))
console.log('CPU profile saved to cpu-profile.cpuprofile')
session.disconnect()
})
}, 30000)
})
// Run under load: kill -USR1 <pid>
// Open in Chrome DevTools → More Tools → JavaScript Profiler
# Alternative: 0x flamegraph
npm install -g 0x
0x --output-dir profile node server.js &
APP_PID=$!
# Run k6 test
k6 run tests/load/main.js
# Stop and get flamegraph
kill -USR2 $APP_PID
# Opens flamegraph.html in browser
Flamegraph: What to Look For
Wide flat bars in the middle—long CPU work. Typical findings:
- JSON.parse/stringify in hot path—switch to streaming or schema-based serializer
- bcrypt with high cost factor—reduce cost or cache sessions
- Regex without compilation—move outside function
- Synchronous file operation (fs.readFileSync) in request
Python: Production Profiler Under Load
# pyinstrument—non-intrusive profiler for production
pip install pyinstrument
# Middleware for Flask
from pyinstrument import Profiler
from flask import request, g
@app.before_request
def start_profiler():
if request.args.get('profile') == 'true':
g.profiler = Profiler()
g.profiler.start()
@app.after_request
def stop_profiler(response):
if hasattr(g, 'profiler'):
g.profiler.stop()
# Return HTML report in response
response.data = g.profiler.output_html()
response.content_type = 'text/html'
return response
# Request with profiling: GET /api/posts?profile=true
Analyzing Connection Pool
# Monitor pgBouncer during test
psql -h localhost -p 6432 pgbouncer -c "SHOW POOLS;"
# What to watch:
# cl_active: clients actively working
# cl_waiting: clients waiting for connection (>0 = problem)
# sv_active: server connections active
# sv_idle: idle connections in pool
# maxwait: maximum wait time (sec)
-- PostgreSQL: connection pool statistics
SELECT datname, count(*) AS total_connections,
count(*) FILTER (WHERE state = 'active') AS active,
count(*) FILTER (WHERE state = 'idle') AS idle,
count(*) FILTER (WHERE wait_event_type = 'Lock') AS waiting_lock
FROM pg_stat_activity
GROUP BY datname;
Analyzing k6 Results: Finding Degradation Moment
# parse_k6_results.py
import json
import pandas as pd
def find_degradation_point(json_results: str):
"""Find degradation moment from metrics time series"""
records = []
with open(json_results) as f:
for line in f:
try:
record = json.loads(line)
if record.get('type') == 'Point':
records.append({
'timestamp': record['data']['time'],
'metric': record['metric'],
'value': record['data']['value']
})
except:
continue
df = pd.DataFrame(records)
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Analyze p95 latency in 1-minute windows
p95_df = df[df['metric'] == 'http_req_duration'].copy()
p95_df = p95_df.set_index('timestamp').resample('1min')['value'].quantile(0.95)
# Find first minute when p95 exceeded threshold
threshold = 500 # ms
degradation = p95_df[p95_df > threshold]
if not degradation.empty:
print(f"Degradation detected at: {degradation.index[0]}")
print(f"p95 at degradation: {degradation.iloc[0]:.0f}ms")
else:
print("No degradation detected (all within threshold)")
return p95_df
Typical Optimizations After Analysis
| Bottleneck | Symptom | Solution |
|---|---|---|
| N+1 queries to DB | DB active queries >> VU count | DataLoader / eager loading / JOIN |
| Missing index | SeqScan on large table | CREATE INDEX CONCURRENTLY |
| Slow JSON serialization | High CPU, hot path in serialize | Protobuf / simdjson / msgpack |
| Connection pool overflow | cl_waiting > 0 in pgBouncer |
Increase pool_size or add replicas |
| GC pauses | Spiky latency without CPU load | Increase heap, tune GC flags |
| Table locks | wait_event = Lock in pg_stat |
Optimize operation order, NOWAIT |
Timeline
Full bottleneck analysis from load test results with recommendations and verification test—1–2 business days.







