Building Resilient Node.js Applications: Circuit Breakers, Retries & Bulkheads
2025-09-20
Building Resilient Node.js Applications: Circuit Breakers, Retries & Bulkheads
In production environments, failure is inevitable. Networks partition, databases timeout, and external services become unavailable. The difference between a robust system and a fragile one lies in how gracefully it handles these failures.
This guide covers three essential resilience patterns:
- Circuit Breakers - Prevent cascading failures
- Retry Mechanisms - Handle transient failures intelligently
- Bulkheads - Isolate resources to contain failures
Why Resilience Matters
Consider this scenario: Your e-commerce API depends on a payment service. When the payment service becomes slow, your entire application starts timing out, affecting even non-payment operations like browsing products.
Without resilience patterns:
- One slow dependency brings down the entire system
- Resources get exhausted waiting for failed operations
- Users experience complete service outages
With resilience patterns:
- Failures are contained and isolated
- Graceful degradation maintains core functionality
- System remains responsive under partial failures
1. Circuit Breaker Pattern
The circuit breaker acts like an electrical breaker - it "trips" when failures reach a threshold, preventing further calls to a failing service.
States of a Circuit Breaker
enum CircuitState {
CLOSED = 'CLOSED', // Normal operation
OPEN = 'OPEN', // Blocking calls
HALF_OPEN = 'HALF_OPEN' // Testing recovery
}
Implementation
interface CircuitBreakerOptions {
failureThreshold: number;
recoveryTimeout: number;
monitoringPeriod: number;
expectedErrors?: Array<new (...args: any[]) => Error>;
}
class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private failureCount = 0;
private lastFailureTime?: number;
private successCount = 0;
constructor(private options: CircuitBreakerOptions) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === CircuitState.OPEN) {
if (this.shouldAttemptReset()) {
this.state = CircuitState.HALF_OPEN;
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure(error);
throw error;
}
}
private onSuccess(): void {
this.failureCount = 0;
if (this.state === CircuitState.HALF_OPEN) {
this.successCount++;
if (this.successCount >= 3) { // Require 3 successes to close
this.state = CircuitState.CLOSED;
this.successCount = 0;
}
}
}
private onFailure(error: Error): void {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.state === CircuitState.HALF_OPEN) {
this.state = CircuitState.OPEN;
this.successCount = 0;
}
if (this.failureCount >= this.options.failureThreshold) {
this.state = CircuitState.OPEN;
}
}
private shouldAttemptReset(): boolean {
return (
this.lastFailureTime &&
Date.now() - this.lastFailureTime >= this.options.recoveryTimeout
);
}
getState(): CircuitState {
return this.state;
}
getMetrics() {
return {
state: this.state,
failureCount: this.failureCount,
successCount: this.successCount,
};
}
}
Using Circuit Breaker with HTTP Calls
import axios from 'axios';
class PaymentService {
private circuitBreaker: CircuitBreaker;
constructor() {
this.circuitBreaker = new CircuitBreaker({
failureThreshold: 5,
recoveryTimeout: 30000, // 30 seconds
monitoringPeriod: 10000, // 10 seconds
});
}
async processPayment(paymentData: PaymentRequest): Promise<PaymentResponse> {
try {
return await this.circuitBreaker.execute(async () => {
const response = await axios.post('/api/payment', paymentData, {
timeout: 5000,
});
return response.data;
});
} catch (error) {
// Fallback: Store payment for later processing
await this.storeForRetry(paymentData);
throw new PaymentUnavailableError('Payment service temporarily unavailable');
}
}
private async storeForRetry(paymentData: PaymentRequest): Promise<void> {
// Store in queue for background processing
await this.paymentQueue.add('process-payment', paymentData);
}
}
2. Intelligent Retry Mechanisms
Not all failures should trigger retries. Implement smart retry logic that distinguishes between transient and permanent failures.
Exponential Backoff with Jitter
interface RetryOptions {
maxAttempts: number;
baseDelay: number;
maxDelay: number;
backoffMultiplier: number;
jitter: boolean;
retryableErrors?: Array<new (...args: any[]) => Error>;
}
class RetryManager {
constructor(private options: RetryOptions) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
let lastError: Error;
for (let attempt = 1; attempt <= this.options.maxAttempts; attempt++) {
try {
return await operation();
} catch (error) {
lastError = error as Error;
if (!this.shouldRetry(error as Error, attempt)) {
throw error;
}
if (attempt < this.options.maxAttempts) {
const delay = this.calculateDelay(attempt);
await this.sleep(delay);
}
}
}
throw lastError!;
}
private shouldRetry(error: Error, attempt: number): boolean {
if (attempt >= this.options.maxAttempts) {
return false;
}
// Don't retry client errors (4xx)
if (error.message.includes('400') || error.message.includes('404')) {
return false;
}
// Retry on network errors, timeouts, 5xx errors
return (
error.message.includes('timeout') ||
error.message.includes('ECONNRESET') ||
error.message.includes('500') ||
error.message.includes('502') ||
error.message.includes('503')
);
}
private calculateDelay(attempt: number): number {
let delay = this.options.baseDelay * Math.pow(this.options.backoffMultiplier, attempt - 1);
delay = Math.min(delay, this.options.maxDelay);
if (this.options.jitter) {
// Add random jitter to prevent thundering herd
delay = delay * (0.5 + Math.random() * 0.5);
}
return delay;
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
Database Retry Example
class DatabaseService {
private retryManager: RetryManager;
constructor() {
this.retryManager = new RetryManager({
maxAttempts: 3,
baseDelay: 1000,
maxDelay: 10000,
backoffMultiplier: 2,
jitter: true,
});
}
async findUser(id: string): Promise<User | null> {
return this.retryManager.execute(async () => {
try {
return await this.db.user.findUnique({ where: { id } });
} catch (error) {
// Log the attempt
console.log(`Database query failed, will retry: ${error.message}`);
throw error;
}
});
}
async createUser(userData: CreateUserRequest): Promise<User> {
return this.retryManager.execute(async () => {
return await this.db.user.create({ data: userData });
});
}
}
3. Bulkhead Pattern
Bulkheads isolate resources to prevent cascading failures. Like compartments in a ship, if one fails, others remain functional.
Thread Pool Isolation
import { Worker } from 'worker_threads';
class BulkheadExecutor {
private pools: Map<string, WorkerPool> = new Map();
constructor() {
// Create isolated pools for different operations
this.pools.set('cpu-intensive', new WorkerPool(2, './workers/cpu-worker.js'));
this.pools.set('io-operations', new WorkerPool(5, './workers/io-worker.js'));
this.pools.set('external-api', new WorkerPool(3, './workers/api-worker.js'));
}
async execute<T>(poolName: string, task: any): Promise<T> {
const pool = this.pools.get(poolName);
if (!pool) {
throw new Error(`Pool ${poolName} not found`);
}
return pool.execute(task);
}
}
class WorkerPool {
private workers: Worker[] = [];
private queue: Array<{ task: any; resolve: Function; reject: Function }> = [];
private activeWorkers = 0;
constructor(private size: number, private workerScript: string) {
this.initializePool();
}
private initializePool(): void {
for (let i = 0; i < this.size; i++) {
this.createWorker();
}
}
private createWorker(): void {
const worker = new Worker(this.workerScript);
worker.on('message', (result) => {
this.activeWorkers--;
this.processQueue();
});
worker.on('error', (error) => {
console.error('Worker error:', error);
this.activeWorkers--;
this.processQueue();
});
this.workers.push(worker);
}
async execute<T>(task: any): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push({ task, resolve, reject });
this.processQueue();
});
}
private processQueue(): void {
if (this.queue.length === 0 || this.activeWorkers >= this.size) {
return;
}
const { task, resolve, reject } = this.queue.shift()!;
const worker = this.workers[this.activeWorkers];
this.activeWorkers++;
const timeout = setTimeout(() => {
reject(new Error('Worker timeout'));
}, 30000);
worker.once('message', (result) => {
clearTimeout(timeout);
resolve(result);
});
worker.once('error', (error) => {
clearTimeout(timeout);
reject(error);
});
worker.postMessage(task);
}
}
Connection Pool Isolation
import { Pool } from 'pg';
class DatabaseBulkhead {
private readPool: Pool;
private writePool: Pool;
private analyticsPool: Pool;
constructor() {
// Separate connection pools for different operations
this.readPool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10, // Max 10 connections for reads
idleTimeoutMillis: 30000,
});
this.writePool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 5, // Max 5 connections for writes
idleTimeoutMillis: 30000,
});
this.analyticsPool = new Pool({
connectionString: process.env.ANALYTICS_DATABASE_URL,
max: 3, // Separate pool for heavy analytics queries
idleTimeoutMillis: 30000,
});
}
async executeRead<T>(query: string, params?: any[]): Promise<T> {
const client = await this.readPool.connect();
try {
const result = await client.query(query, params);
return result.rows;
} finally {
client.release();
}
}
async executeWrite<T>(query: string, params?: any[]): Promise<T> {
const client = await this.writePool.connect();
try {
const result = await client.query(query, params);
return result.rows;
} finally {
client.release();
}
}
async executeAnalytics<T>(query: string, params?: any[]): Promise<T> {
const client = await this.analyticsPool.connect();
try {
const result = await client.query(query, params);
return result.rows;
} finally {
client.release();
}
}
}
4. Combining All Patterns
Here's how to combine circuit breakers, retries, and bulkheads for maximum resilience:
class ResilientService {
private circuitBreaker: CircuitBreaker;
private retryManager: RetryManager;
private bulkhead: BulkheadExecutor;
private dbBulkhead: DatabaseBulkhead;
constructor() {
this.circuitBreaker = new CircuitBreaker({
failureThreshold: 5,
recoveryTimeout: 30000,
monitoringPeriod: 10000,
});
this.retryManager = new RetryManager({
maxAttempts: 3,
baseDelay: 1000,
maxDelay: 10000,
backoffMultiplier: 2,
jitter: true,
});
this.bulkhead = new BulkheadExecutor();
this.dbBulkhead = new DatabaseBulkhead();
}
async processOrder(orderData: OrderRequest): Promise<OrderResponse> {
// Use bulkhead for CPU-intensive validation
const validationResult = await this.bulkhead.execute('cpu-intensive', {
type: 'validate-order',
data: orderData,
});
if (!validationResult.valid) {
throw new Error('Invalid order data');
}
// Use circuit breaker + retry for external payment
const paymentResult = await this.retryManager.execute(async () => {
return this.circuitBreaker.execute(async () => {
return this.callPaymentService(orderData.payment);
});
});
// Use database bulkhead for order creation
const order = await this.dbBulkhead.executeWrite(
'INSERT INTO orders (user_id, amount, status) VALUES ($1, $2, $3) RETURNING *',
[orderData.userId, orderData.amount, 'pending']
);
return {
orderId: order[0].id,
status: 'created',
paymentId: paymentResult.id,
};
}
private async callPaymentService(paymentData: any): Promise<any> {
// External API call with timeout
const response = await fetch('/api/external/payment', {
method: 'POST',
body: JSON.stringify(paymentData),
headers: { 'Content-Type': 'application/json' },
signal: AbortSignal.timeout(5000), // 5 second timeout
});
if (!response.ok) {
throw new Error(`Payment service error: ${response.status}`);
}
return response.json();
}
}
5. Monitoring and Observability
Track the health of your resilience patterns:
class ResilienceMetrics {
private metrics = {
circuitBreakerStates: new Map<string, CircuitState>(),
retryAttempts: new Map<string, number>(),
bulkheadUtilization: new Map<string, number>(),
};
recordCircuitBreakerState(name: string, state: CircuitState): void {
this.metrics.circuitBreakerStates.set(name, state);
// Send to monitoring system (Prometheus, DataDog, etc.)
console.log(`Circuit breaker ${name} state: ${state}`);
}
recordRetryAttempt(operation: string, attempt: number): void {
this.metrics.retryAttempts.set(operation, attempt);
if (attempt > 1) {
console.log(`Retry attempt ${attempt} for ${operation}`);
}
}
recordBulkheadUtilization(pool: string, utilization: number): void {
this.metrics.bulkheadUtilization.set(pool, utilization);
if (utilization > 0.8) {
console.warn(`High bulkhead utilization: ${pool} at ${utilization * 100}%`);
}
}
getHealthCheck(): HealthStatus {
const openCircuits = Array.from(this.metrics.circuitBreakerStates.entries())
.filter(([, state]) => state === CircuitState.OPEN);
const highRetryOperations = Array.from(this.metrics.retryAttempts.entries())
.filter(([, attempts]) => attempts > 2);
const overloadedPools = Array.from(this.metrics.bulkheadUtilization.entries())
.filter(([, utilization]) => utilization > 0.9);
return {
healthy: openCircuits.length === 0 && overloadedPools.length === 0,
openCircuits: openCircuits.map(([name]) => name),
highRetryOperations: highRetryOperations.map(([name]) => name),
overloadedPools: overloadedPools.map(([name]) => name),
};
}
}
interface HealthStatus {
healthy: boolean;
openCircuits: string[];
highRetryOperations: string[];
overloadedPools: string[];
}
6. Best Practices
Configuration Management
interface ResilienceConfig {
circuitBreaker: {
failureThreshold: number;
recoveryTimeout: number;
};
retry: {
maxAttempts: number;
baseDelay: number;
maxDelay: number;
};
bulkhead: {
pools: Record<string, { size: number; timeout: number }>;
};
}
// Environment-specific configurations
const configs: Record<string, ResilienceConfig> = {
development: {
circuitBreaker: { failureThreshold: 3, recoveryTimeout: 10000 },
retry: { maxAttempts: 2, baseDelay: 500, maxDelay: 5000 },
bulkhead: {
pools: {
'cpu-intensive': { size: 1, timeout: 10000 },
'io-operations': { size: 2, timeout: 5000 },
},
},
},
production: {
circuitBreaker: { failureThreshold: 5, recoveryTimeout: 30000 },
retry: { maxAttempts: 3, baseDelay: 1000, maxDelay: 10000 },
bulkhead: {
pools: {
'cpu-intensive': { size: 4, timeout: 30000 },
'io-operations': { size: 10, timeout: 15000 },
},
},
},
};
Testing Resilience Patterns
describe('Circuit Breaker', () => {
let circuitBreaker: CircuitBreaker;
let mockService: jest.Mock;
beforeEach(() => {
circuitBreaker = new CircuitBreaker({
failureThreshold: 2,
recoveryTimeout: 1000,
monitoringPeriod: 500,
});
mockService = jest.fn();
});
it('should open circuit after threshold failures', async () => {
mockService.mockRejectedValue(new Error('Service error'));
// First failure
await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
expect(circuitBreaker.getState()).toBe(CircuitState.CLOSED);
// Second failure - should open circuit
await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
// Third call should be blocked
await expect(circuitBreaker.execute(mockService)).rejects.toThrow('Circuit breaker is OPEN');
expect(mockService).toHaveBeenCalledTimes(2);
});
it('should transition to half-open after recovery timeout', async () => {
// Open the circuit
mockService.mockRejectedValue(new Error('Service error'));
await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
await expect(circuitBreaker.execute(mockService)).rejects.toThrow();
expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
// Wait for recovery timeout
await new Promise(resolve => setTimeout(resolve, 1100));
// Next call should transition to half-open
mockService.mockResolvedValue('success');
await circuitBreaker.execute(mockService);
expect(circuitBreaker.getState()).toBe(CircuitState.HALF_OPEN);
});
});
7. Production Deployment Tips
Environment Variables
# Circuit Breaker Settings
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=30000
# Retry Settings
RETRY_MAX_ATTEMPTS=3
RETRY_BASE_DELAY=1000
RETRY_MAX_DELAY=10000
# Bulkhead Settings
BULKHEAD_CPU_POOL_SIZE=4
BULKHEAD_IO_POOL_SIZE=10
BULKHEAD_API_POOL_SIZE=5
# Database Pool Settings
DB_READ_POOL_SIZE=10
DB_WRITE_POOL_SIZE=5
DB_ANALYTICS_POOL_SIZE=3
Docker Configuration
FROM node:20-alpine
WORKDIR /app
# Install dependencies
COPY package*.json ./
RUN npm ci --only=production
# Copy application code
COPY . .
# Set resource limits for resilience
ENV NODE_OPTIONS="--max-old-space-size=1024"
# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
EXPOSE 3000
CMD ["node", "dist/index.js"]
Summary
Building resilient Node.js applications requires implementing multiple defense layers:
- Circuit Breakers prevent cascading failures by stopping calls to failing services
- Retry Mechanisms handle transient failures with intelligent backoff strategies
- Bulkheads isolate resources to contain failures and prevent resource exhaustion
Key takeaways:
Combine patterns - Use circuit breakers with retries for external services
Monitor everything - Track metrics to identify issues before they become outages
Test failure scenarios - Regularly test your resilience patterns with chaos engineering
Configure per environment - Different environments need different thresholds
Graceful degradation - Always have fallback strategies when services fail
Remember: Resilience is not about preventing failures—it's about failing gracefully and recovering quickly.
Happy building! 🚀