Stress testing to determine website load limits

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Stress Testing: Determining Load Limits

Stress test intentionally overloads the system beyond normal operation to find the breaking point. Answers questions: at what RPS do errors start rising? How does the system recover after overload? What's the bottleneck—DB, CPU, memory, network?

Methodology

Step 1: Define baseline. Run normal load (50–70% of expected peak) and record metrics: p95 latency, error rate, CPU/memory.

Step 2: Stepwise increase. Raise load in steps of 10–20% every 2–5 minutes. Record point where errors or latency start rising.

Step 3: Find breaking point. Continue until degradation (error rate > 5% or latency > 5x baseline).

Step 4: Recovery. Remove load and observe how quickly the system returns to normal.

k6 Stress Test Scenario

// tests/stress/breaking-point.js
import http from 'k6/http'
import { check, sleep } from 'k6'
import { Rate, Trend, Counter } from 'k6/metrics'

const errorRate = new Rate('errors')
const requestsPerSecond = new Counter('requests_per_second')

export const options = {
  stages: [
    // Warmup to normal traffic
    { duration: '2m',  target: 50 },
    { duration: '3m',  target: 50 },   // baseline level

    // Stepwise increase
    { duration: '2m',  target: 100 },
    { duration: '3m',  target: 100 },

    { duration: '2m',  target: 200 },
    { duration: '3m',  target: 200 },

    { duration: '2m',  target: 400 },
    { duration: '3m',  target: 400 },

    { duration: '2m',  target: 800 },
    { duration: '3m',  target: 800 },

    { duration: '2m',  target: 1600 },
    { duration: '3m',  target: 1600 },

    // Cooldown and observe recovery
    { duration: '5m',  target: 50 },
    { duration: '3m',  target: 0 },
  ],

  // Don't abort on threshold breach—need to see full picture
  thresholds: {
    http_req_duration: [
      { threshold: 'p(95)<2000', abortOnFail: false },
    ],
    errors: [
      { threshold: 'rate<0.1', abortOnFail: false }
    ]
  }
}

const BASE_URL = __ENV.BASE_URL || 'http://localhost:3000'

export default function() {
  const responses = http.batch([
    ['GET', `${BASE_URL}/api/products?limit=20`],
    ['GET', `${BASE_URL}/api/categories`],
  ])

  responses.forEach(r => {
    check(r, { 'status 2xx': (r) => r.status >= 200 && r.status < 300 })
    errorRate.add(r.status >= 400)
  })

  requestsPerSecond.add(2)
  sleep(0.1)
}

export function handleSummary(data) {
  // Find degradation point from collected data
  const stages = analyzeStages(data)
  return {
    'stress-results.json': JSON.stringify(data, null, 2),
    stdout: generateReport(stages)
  }
}

function generateReport(stages) {
  return `
=== STRESS TEST REPORT ===
Breaking Point Analysis:
${stages.map(s => `  VUs: ${s.vus} | p95: ${s.p95}ms | Errors: ${(s.errorRate*100).toFixed(1)}%`).join('\n')}
`
}

Monitoring During Test

Run system metrics collection in parallel:

#!/bin/bash
# scripts/monitor-stress-test.sh

TARGET_HOST="app-server-ip"
INTERVAL=10  # seconds

while true; do
  TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)

  # CPU, Memory, Load Average
  ssh $TARGET_HOST "
    echo -n '$TIMESTAMP '
    echo -n 'cpu:'; top -bn1 | grep 'Cpu(s)' | awk '{print \$2}'; echo -n ' '
    echo -n 'mem:'; free | grep Mem | awk '{print \$3/\$2 * 100}'; echo -n ' '
    echo -n 'load:'; cat /proc/loadavg | awk '{print \$1}'
    echo -n 'conns:'; ss -s | grep -o 'estab [0-9]*' | awk '{print \$2}'
  "

  # PostgreSQL: active queries and locks
  ssh $TARGET_HOST "
    PGPASSWORD=pass psql -U app -d appdb -t -c \"
      SELECT 'active_queries:', count(*) FROM pg_stat_activity
        WHERE state = 'active' AND query NOT LIKE '%pg_stat%';
      SELECT 'long_queries:', count(*) FROM pg_stat_activity
        WHERE state = 'active' AND query_start < NOW() - interval '5 seconds';
      SELECT 'locks:', count(*) FROM pg_locks WHERE NOT granted;
    \"
  "

  sleep $INTERVAL
done | tee stress-monitor.log

Analyzing Results with Prometheus + Grafana

# k6 with Prometheus Remote Write
k6 run \
  -o experimental-prometheus-rw \
  --env K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write \
  --env K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true \
  tests/stress/breaking-point.js
# Grafana queries for stress test analysis

# RPS in real time
rate(k6_http_reqs_total[30s])

# Error rate over time (find degradation moment)
rate(k6_http_req_failed_total[30s]) / rate(k6_http_reqs_total[30s])

# p95 latency in real time
histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket[30s]))

# Correlation: load vs latency vs errors

Identifying Bottleneck

# analyze_stress_results.py
import json
import pandas as pd

def analyze_breaking_point(results_file):
    with open(results_file) as f:
        data = json.load(f)

    # Extract time series
    metrics = data['metrics']

    analysis = {
        'max_rps_before_errors': find_max_sustainable_rps(metrics),
        'error_threshold_rps': find_error_threshold(metrics),
        'latency_degradation_point': find_latency_degradation(metrics),
        'recovery_time_seconds': find_recovery_time(metrics),
    }

    print("=== Breaking Point Analysis ===")
    print(f"Max sustainable RPS (< 1% errors): {analysis['max_rps_before_errors']}")
    print(f"Error threshold RPS: {analysis['error_threshold_rps']}")
    print(f"p95 > 1s at RPS: {analysis['latency_degradation_point']}")
    print(f"Recovery time after load removal: {analysis['recovery_time_seconds']}s")

    # Recommendations
    if analysis['max_rps_before_errors'] < 100:
        print("\n[!] LOW capacity. Consider: DB connection pooling, caching, horizontal scaling")
    elif analysis['recovery_time_seconds'] > 120:
        print("\n[!] SLOW recovery. Consider: circuit breakers, graceful degradation")

    return analysis

Typical Bottlenecks and Diagnostics

Symptom Probable Cause Diagnostics
Latency grows, CPU low DB locks or slow queries pg_stat_activity, slow query log
CPU 100%, few errors Computational bottleneck top, application profiler
ENOMEM errors Memory leak or OOM free -m, /proc/meminfo
Connection refused Connection pool exhausted pgBouncer stats, netstat
502 Bad Gateway Worker processes overloaded Nginx error log, worker_processes

Timeline

Stress test with stepwise load profile, monitoring, and breaking point analysis—2–3 business days.