Building Resilient Node.js Applications: Circuit Breakers, Retries & Bulkheads

2025-09-20

Building Resilient Node.js Applications: Circuit Breakers, Retries & Bulkheads

In production environments, failure is inevitable. Networks partition, databases timeout, and external services become unavailable. The difference between a robust system and a fragile one lies in how gracefully it handles these failures.

This guide covers three essential resilience patterns:

  1. Circuit Breakers - Prevent cascading failures
  2. Retry Mechanisms - Handle transient failures intelligently
  3. Bulkheads - Isolate resources to contain failures

Why Resilience Matters

Consider this scenario: Your e-commerce API depends on a payment service. When the payment service becomes slow, your entire application starts timing out, affecting even non-payment operations like browsing products.

Without resilience patterns:

  • One slow dependency brings down the entire system
  • Resources get exhausted waiting for failed operations
  • Users experience complete service outages

With resilience patterns:

  • Failures are contained and isolated
  • Graceful degradation maintains core functionality
  • System remains responsive under partial failures

1. Circuit Breaker Pattern

The circuit breaker acts like an electrical breaker - it "trips" when failures reach a threshold, preventing further calls to a failing service.

States of a Circuit Breaker

enum CircuitState {
  CLOSED = 'CLOSED',     // Normal operation
  OPEN = 'OPEN',         // Blocking calls
  HALF_OPEN = 'HALF_OPEN' // Testing recovery
}

Implementation

interface CircuitBreakerOptions {
  failureThreshold: number;
  recoveryTimeout: number;
  monitoringPeriod: number;
  expectedErrors?: Array<new (...args: any[]) => Error>;
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount = 0;
  private lastFailureTime?: number;
  private successCount = 0;

  constructor(private options: CircuitBreakerOptions) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptReset()) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure(error);
      throw error;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= 3) { // Require 3 successes to close
        this.state = CircuitState.CLOSED;
        this.successCount = 0;
      }
    }
  }

  private onFailure(error: Error): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.state === CircuitState.HALF_OPEN) {
      this.state = CircuitState.OPEN;
      this.successCount = 0;
    }

    if (this.failureCount >= this.options.failureThreshold) {
      this.state = CircuitState.OPEN;
    }
  }

  private shouldAttemptReset(): boolean {
    return (
      this.lastFailureTime &&
      Date.now() - this.lastFailureTime >= this.options.recoveryTimeout
    );
  }

  getState(): CircuitState {
    return this.state;
  }

  getMetrics() {
    return {
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
    };
  }
}

Using Circuit Breaker with HTTP Calls

import axios from 'axios';

class PaymentService {
  private circuitBreaker: CircuitBreaker;

  constructor() {
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: 5,
      recoveryTimeout: 30000, // 30 seconds
      monitoringPeriod: 10000, // 10 seconds
    });
  }

  async processPayment(paymentData: PaymentRequest): Promise<PaymentResponse> {
    try {
      return await this.circuitBreaker.execute(async () => {
        const response = await axios.post('/api/payment', paymentData, {
          timeout: 5000,
        });
        return response.data;
      });
    } catch (error) {
      // Fallback: Store payment for later processing
      await this.storeForRetry(paymentData);
      throw new PaymentUnavailableError('Payment service temporarily unavailable');
    }
  }

  private async storeForRetry(paymentData: PaymentRequest): Promise<void> {
    // Store in queue for background processing
    await this.paymentQueue.add('process-payment', paymentData);
  }
}

2. Intelligent Retry Mechanisms

Not all failures should trigger retries. Implement smart retry logic that distinguishes between transient and permanent failures.

Exponential Backoff with Jitter

interface RetryOptions {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  backoffMultiplier: number;
  jitter: boolean;
  retryableErrors?: Array<new (...args: any[]) => Error>;
}

class RetryManager {
  constructor(private options: RetryOptions) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 1; attempt <= this.options.maxAttempts; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error as Error;
        
        if (!this.shouldRetry(error as Error, attempt)) {
          throw error;
        }

        if (attempt < this.options.maxAttempts) {
          const delay = this.calculateDelay(attempt);
          await this.sleep(delay);
        }
      }
    }
    
    throw lastError!;
  }

  private shouldRetry(error: Error, attempt: number): boolean {
    if (attempt >= this.options.maxAttempts) {
      return false;
    }

    // Don't retry client errors (4xx)
    if (error.message.includes('400') || error.message.includes('404')) {
      return false;
    }

    // Retry on network errors, timeouts, 5xx errors
    return (
      error.message.includes('timeout') ||
      error.message.includes('ECONNRESET') ||
      error.message.includes('500') ||
      error.message.includes('502') ||
      error.message.includes('503')
    );
  }

  private calculateDelay(attempt: number): number {
    let delay = this.options.baseDelay * Math.pow(this.options.backoffMultiplier, attempt - 1);
    delay = Math.min(delay, this.options.maxDelay);

    if (this.options.jitter) {
      // Add random jitter to prevent thundering herd
      delay = delay * (0.5 + Math.random() * 0.5);
    }

    return delay;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Database Retry Example

class DatabaseService {
  private retryManager: RetryManager;

  constructor() {
    this.retryManager = new RetryManager({
      maxAttempts: 3,
      baseDelay: 1000,
      maxDelay: 10000,
      backoffMultiplier: 2,
      jitter: true,
    });
  }

  async findUser(id: string): Promise<User | null> {
    return this.retryManager.execute(async () => {
      try {
        return await this.db.user.findUnique({ where: { id } });
      } catch (error) {
        // Log the attempt
        console.log(`Database query failed, will retry: ${error.message}`);
        throw error;
      }
    });
  }

  async createUser(userData: CreateUserRequest): Promise<User> {
    return this.retryManager.execute(async () => {
      return await this.db.user.create({ data: userData });
    });
  }
}

3. Bulkhead Pattern

Bulkheads isolate resources to prevent cascading failures. Like compartments in a ship, if one fails, others remain functional.

Thread Pool Isolation

import { Worker } from 'worker_threads';

class BulkheadExecutor {
  private pools: Map<string, WorkerPool> = new Map();

  constructor() {
    // Create isolated pools for different operations
    this.pools.set('cpu-intensive', new WorkerPool(2, './workers/cpu-worker.js'));
    this.pools.set('io-operations', new WorkerPool(5, './workers/io-worker.js'));
    this.pools.set('external-api', new WorkerPool(3, './workers/api-worker.js'));
  }

  async execute<T>(poolName: string, task: any): Promise<T> {
    const pool = this.pools.get(poolName);
    if (!pool) {
      throw new Error(`Pool ${poolName} not found`);
    }

    return pool.execute(task);
  }
}

class WorkerPool {
  private workers: Worker[] = [];
  private queue: Array<{ task: any; resolve: Function; reject: Function }> = [];
  private activeWorkers = 0;

  constructor(private size: number, private workerScript: string) {
    this.initializePool();
  }

  private initializePool(): void {
    for (let i = 0; i < this.size; i++) {
      this.createWorker();
    }
  }

  private createWorker(): void {
    const worker = new Worker(this.workerScript);
    
    worker.on('message', (result) => {
      this.activeWorkers--;
      this.processQueue();
    });

    worker.on('error', (error) => {
      console.error('Worker error:', error);
      this.activeWorkers--;
      this.processQueue();
    });

    this.workers.push(worker);
  }

  async execute<T>(task: any): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ task, resolve, reject });
      this.processQueue();
    });
  }

  private processQueue(): void {
    if (this.queue.length === 0 || this.activeWorkers >= this.size) {
      return;
    }

    const { task, resolve, reject } = this.queue.shift()!;
    const worker = this.workers[this.activeWorkers];
    
    this.activeWorkers++;
    
    const timeout = setTimeout(() => {
      reject(new Error('Worker timeout'));
    }, 30000);

    worker.once('message', (result) => {
      clearTimeout(timeout);
      resolve(result);
    });

    worker.once('error', (error) => {
      clearTimeout(timeout);
      reject(error);
    });

    worker.postMessage(task);
  }
}

Connection Pool Isolation

import { Pool } from 'pg';

class DatabaseBulkhead {
  private readPool: Pool;
  private writePool: Pool;
  private analyticsPool: Pool;

  constructor() {
    // Separate connection pools for different operations
    this.readPool = new Pool({
      connectionString: process.env.DATABASE_URL,
      max: 10, // Max 10 connections for reads
      idleTimeoutMillis: 30000,
    });

    this.writePool = new Pool({
      connectionString: process.env.DATABASE_URL,
      max: 5, // Max 5 connections for writes
      idleTimeoutMillis: 30000,
    });

    this.analyticsPool = new Pool({
      connectionString: process.env.ANALYTICS_DATABASE_URL,
      max: 3, // Separate pool for heavy analytics queries
      idleTimeoutMillis: 30000,
    });
  }

  async executeRead<T>(query: string, params?: any[]): Promise<T> {
    const client = await this.readPool.connect();
    try {
      const result = await client.query(query, params);
      return result.rows;
    } finally {
      client.release();
    }
  }

  async executeWrite<T>(query: string, params?: any[]): Promise<T> {
    const client = await this.writePool.connect();
    try {
      const result = await client.query(query, params);
      return result.rows;
    } finally {
      client.release();
    }
  }

  async executeAnalytics<T>(query: string, params?: any[]): Promise<T> {
    const client = await this.analyticsPool.connect();
    try {
      const result = await client.query(query, params);
      return result.rows;
    } finally {
      client.release();
    }
  }
}

4. Combining All Patterns

Here's how to combine circuit breakers, retries, and bulkheads for maximum resilience:

class ResilientService {
  private circuitBreaker: CircuitBreaker;
  private retryManager: RetryManager;
  private bulkhead: BulkheadExecutor;
  private dbBulkhead: DatabaseBulkhead;

  constructor() {
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: 5,
      recoveryTimeout: 30000,
      monitoringPeriod: 10000,
    });

    this.retryManager = new RetryManager({
      maxAttempts: 3,
      baseDelay: 1000,
      maxDelay: 10000,
      backoffMultiplier: 2,
      jitter: true,
    });

    this.bulkhead = new BulkheadExecutor();
    this.dbBulkhead = new DatabaseBulkhead();
  }

  async processOrder(orderData: OrderRequest): Promise<OrderResponse> {
    // Use bulkhead for CPU-intensive validation
    const validationResult = await this.bulkhead.execute('cpu-intensive', {
      type: 'validate-order',
      data: orderData,
    });

    if (!validationResult.valid) {
      throw new Error('Invalid order data');
    }

    // Use circuit breaker + retry for external payment
    const paymentResult = await this.retryManager.execute(async () => {
      return this.circuitBreaker.execute(async () => {
        return this.callPaymentService(orderData.payment);
      });
    });

    // Use database bulkhead for order creation
    const order = await this.dbBulkhead.executeWrite(
      'INSERT INTO orders (user_id, amount, status) VALUES ($1, $2, $3) RETURNING *',
      [orderData.userId, orderData.amount, 'pending']
    );

    return {
      orderId: order[0].id,
      status: 'created',
      paymentId: paymentResult.id,
    };
  }

  private async callPaymentService(paymentData: any): Promise<any> {
    // External API call with timeout
    const response = await fetch('/api/external/payment', {
      method: 'POST',
      body: JSON.stringify(paymentData),
      headers: { 'Content-Type': 'application/json' },
      signal: AbortSignal.timeout(5000), // 5 second timeout
    });

    if (!response.ok) {
      throw new Error(`Payment service error: ${response.status}`);
    }

    return response.json();
  }
}

5. Monitoring and Observability

Track the health of your resilience patterns:

class ResilienceMetrics {
  private metrics = {
    circuitBreakerStates: new Map<string, CircuitState>(),
    retryAttempts: new Map<string, number>(),
    bulkheadUtilization: new Map<string, number>(),
  };

  recordCircuitBreakerState(name: string, state: CircuitState): void {
    this.metrics.circuitBreakerStates.set(name, state);
    
    // Send to monitoring system (Prometheus, DataDog, etc.)
    console.log(`Circuit breaker ${name} state: ${state}`);
  }

  recordRetryAttempt(operation: string, attempt: number): void {
    this.metrics.retryAttempts.set(operation, attempt);
    
    if (attempt > 1) {
      console.log(`Retry attempt ${attempt} for ${operation}`);
    }
  }

  recordBulkheadUtilization(pool: string, utilization: number): void {
    this.metrics.bulkheadUtilization.set(pool, utilization);
    
    if (utilization > 0.8) {
      console.warn(`High bulkhead utilization: ${pool} at ${utilization * 100}%`);
    }
  }

  getHealthCheck(): HealthStatus {
    const openCircuits = Array.from(this.metrics.circuitBreakerStates.entries())
      .filter(([, state]) => state === CircuitState.OPEN);

    const highRetryOperations = Array.from(this.metrics.retryAttempts.entries())
      .filter(([, attempts]) => attempts > 2);

    const overloadedPools = Array.from(this.metrics.bulkheadUtilization.entries())
      .filter(([, utilization]) => utilization > 0.9);

    return {
      healthy: openCircuits.length === 0 && overloadedPools.length === 0,
      openCircuits: openCircuits.map(([name]) => name),
      highRetryOperations: highRetryOperations.map(([name]) => name),
      overloadedPools: overloadedPools.map(([name]) => name),
    };
  }
}

interface HealthStatus {
  healthy: boolean;
  openCircuits: string[];
  highRetryOperations: string[];
  overloadedPools: string[];
}

6. Best Practices

Configuration Management

interface ResilienceConfig {
  circuitBreaker: {
    failureThreshold: number;
    recoveryTimeout: number;
  };
  retry: {
    maxAttempts: number;
    baseDelay: number;
    maxDelay: number;
  };
  bulkhead: {
    pools: Record<string, { size: number; timeout: number }>;
  };
}

// Environment-specific configurations
const configs: Record<string, ResilienceConfig> = {
  development: {
    circuitBreaker: { failureThreshold: 3, recoveryTimeout: 10000 },
    retry: { maxAttempts: 2, baseDelay: 500, maxDelay: 5000 },
    bulkhead: {
      pools: {
        'cpu-intensive': { size: 1, timeout: 10000 },
        'io-operations': { size: 2, timeout: 5000 },
      },
    },
  },
  production: {
    circuitBreaker: { failureThreshold: 5, recoveryTimeout: 30000 },
    retry: { maxAttempts: 3, baseDelay: 1000, maxDelay: 10000 },
    bulkhead: {
      pools: {
        'cpu-intensive': { size: 4, timeout: 30000 },
        'io-operations': { size: 10, timeout: 15000 },
      },
    },
  },
};

Testing Resilience Patterns

describe('Circuit Breaker', () => {
  let circuitBreaker: CircuitBreaker;
  let mockService: jest.Mock;

  beforeEach(() => {
    circuitBreaker = new CircuitBreaker({
      failureThreshold: 2,
      recoveryTimeout: 1000,
      monitoringPeriod: 500,
    });
    mockService = jest.fn();
  });

  it('should open circuit after threshold failures', async () => {
    mockService.mockRejectedValue(new Error('Service error'));

    // First failure
    await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.CLOSED);

    // Second failure - should open circuit
    await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);

    // Third call should be blocked
    await expect(circuitBreaker.execute(mockService)).rejects.toThrow('Circuit breaker is OPEN');
    expect(mockService).toHaveBeenCalledTimes(2);
  });

  it('should transition to half-open after recovery timeout', async () => {
    // Open the circuit
    mockService.mockRejectedValue(new Error('Service error'));
    await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
    await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
    
    expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);

    // Wait for recovery timeout
    await new Promise(resolve => setTimeout(resolve, 1100));

    // Next call should transition to half-open
    mockService.mockResolvedValue('success');
    await circuitBreaker.execute(mockService);
    
    expect(circuitBreaker.getState()).toBe(CircuitState.HALF_OPEN);
  });
});

7. Production Deployment Tips

Environment Variables

# Circuit Breaker Settings
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=30000

# Retry Settings
RETRY_MAX_ATTEMPTS=3
RETRY_BASE_DELAY=1000
RETRY_MAX_DELAY=10000

# Bulkhead Settings
BULKHEAD_CPU_POOL_SIZE=4
BULKHEAD_IO_POOL_SIZE=10
BULKHEAD_API_POOL_SIZE=5

# Database Pool Settings
DB_READ_POOL_SIZE=10
DB_WRITE_POOL_SIZE=5
DB_ANALYTICS_POOL_SIZE=3

Docker Configuration

FROM node:20-alpine

WORKDIR /app

# Install dependencies
COPY package*.json ./
RUN npm ci --only=production

# Copy application code
COPY . .

# Set resource limits for resilience
ENV NODE_OPTIONS="--max-old-space-size=1024"

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

EXPOSE 3000
CMD ["node", "dist/index.js"]

Summary

Building resilient Node.js applications requires implementing multiple defense layers:

  • Circuit Breakers prevent cascading failures by stopping calls to failing services
  • Retry Mechanisms handle transient failures with intelligent backoff strategies
  • Bulkheads isolate resources to contain failures and prevent resource exhaustion

Key takeaways:

Combine patterns - Use circuit breakers with retries for external services
Monitor everything - Track metrics to identify issues before they become outages
Test failure scenarios - Regularly test your resilience patterns with chaos engineering
Configure per environment - Different environments need different thresholds
Graceful degradation - Always have fallback strategies when services fail

Remember: Resilience is not about preventing failures—it's about failing gracefully and recovering quickly.

Happy building! 🚀

← Back to Home