Developing a retry system for Bitrix24 integrations

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.
Our competencies:
Development stages

Developing a Retry System for Bitrix24 Integrations

Integrations fail. An external API returns 503, the network hiccups, a banking service goes offline for maintenance. The question isn't whether the integration will fail, but what happens after it does. A retry system is automatic recovery: didn't work now—we'll try again in a minute, an hour, a day. If after N attempts it still fails—notify a human.

Principles That Cannot Be Violated

Idempotency. A retry attempt must produce the same result as the first, without side effects. If an operation creates a payment order in the bank—the repeated call mustn't create a second one. For this, use idempotency_key (unique operation UUID)—the bank or external system ignores duplicates with the same key.

Exponential backoff. First attempt—immediately. Second—after 1 minute. Third—after 4 minutes. Fourth—after 16 minutes. This prevents a storm of retry requests when a overloaded service recovers.

Jitter. Add a random component to the delay (±20%). If thousand operations fail simultaneously and all retry with identical delays—we get another storm. Jitter spreads the peak.

Maximum retry attempts. After N attempts (usually 5–10), the operation is marked as permanently failed. Then—manual intervention.

Queue Architecture with Retry

For cloud Bitrix24 (no server access), retry is implemented via:

  • Bitrix agents (\CAgent::AddAgent)—for simple scenarios with few operations
  • External service (separate PHP/Node.js server) with Redis Queue or RabbitMQ

For on-premise Bitrix24—agents or queue based on infoblock/HL-block.

Task structure in queue:

{
  "id": "uuid-v4",
  "type": "bank_payment_create",
  "payload": {
    "deal_id": 1234,
    "amount": 50000,
    "idempotency_key": "pay-uuid-v4"
  },
  "attempts": 2,
  "max_attempts": 5,
  "next_run_at": "2025-03-13T15:30:00Z",
  "status": "pending",
  "last_error": "Connection timeout"
}

Task table: integration_jobs in PostgreSQL or MySQL. Index on (status, next_run_at)—the worker picks tasks ready for execution.

Worker Implementation

Worker is a separate process, launched by cron every minute (or daemon via Supervisor). Algorithm:

// Grab a batch of tasks for execution (with FOR UPDATE SKIP LOCKED lock)
$jobs = JobRepository::getPending(limit: 10);

foreach ($jobs as $job) {
    try {
        $job->markRunning();
        $handler = HandlerFactory::create($job->type);
        $handler->execute($job->payload);
        $job->markSuccess();
    } catch (RetryableException $e) {
        // Temporary error—schedule retry
        $delay = $this->calcBackoff($job->attempts); // 2^attempts * 60 seconds
        $delay += rand(0, (int)($delay * 0.2)); // jitter
        $job->scheduleRetry($delay, $e->getMessage());
    } catch (FatalException $e) {
        // Business error—don't retry, notify
        $job->markFailed($e->getMessage());
        $this->notify($job);
    }
}

FOR UPDATE SKIP LOCKED—mandatory with multiple workers. Without it, two workers might take one task and execute it twice.

Exception Classification

Correctly divide errors into "retry" and "don't retry":

Error Type Class Retry
HTTP 429 (Rate Limit) RetryableException Yes, long delay
HTTP 503 / 502 (Service Unavailable) RetryableException Yes
Network timeout RetryableException Yes
HTTP 401 (Unauthorized) Special: refresh token, then retry Yes, once
HTTP 400 (Bad Request) FatalException No
HTTP 422 (Validation Error) FatalException No
Duplicate operation (idempotency hit) Success

Dead Letter Queue

Tasks that exhaust retry limits move to Dead Letter Queue (DLQ)—separate table or queue. DLQ isn't a trash bin, it's a list of things requiring attention. Interface for DLQ:

  • View failed tasks with complete attempt history
  • Manual retry after fixing the error cause
  • Edit payload (if data needs correction before retry)
  • Batch retry of task groups

Bitrix24 Integration

On permanent error or threshold exceeded—notify responsible person in Bitrix24:

\CIMNotify::Add([
    'MESSAGE_TYPE' => IM_MESSAGE_SYSTEM,
    'TO_USER_ID' => $responsibleUserId,
    'MESSAGE' => "Integration: operation #{$job->id} failed after {$job->attempts} attempts. " .
                 "Error: {$job->last_error}. Manual intervention required.",
]);

Or via REST API im.notify.system.add if notification is sent from external service.

Queue Monitoring

Metric What It Shows
pending_jobs_count Current load, size of unexecuted tasks
failed_jobs_count Accumulated error debt
avg_retry_count Average attempts until success
p99_execution_time Worker performance
dlq_size_delta DLQ growth or shrinkage

Development Stages

Stage Content Timeline
Design Data schema, error classification, backoff strategy 2–3 days
Task table and repository CRUD, locks, indexes 2–3 days
Worker Core logic, exception handling 3–5 days
DLQ and interface Viewing, manual retry 3–5 days
Notifications Bitrix24 IM integration 1–2 days
Monitoring Metrics, dashboard 2–3 days

A retry system is a mandatory component of any production integration. Without it, every external service failure turns into lost operations and manual recovery work.