Building a Production-Grade Postback System: Lessons from the Trenches

While building an affiliate marketing platform, I spent three months on a feature most users will never see: the postback system. What started as "just send HTTP requests when conversions happen" turned into a fascinating journey through distributed systems, reliability patterns, and production-grade infrastructure.

This isn't about affiliate marketing—it's about reliable event delivery, a problem every developer faces when building integrations, webhooks, payment systems, or any event-driven architecture.

The Problem: Event Delivery at Scale

A postback system delivers conversion events to partner servers. When a user completes an action (sale, signup, click), you need to notify multiple partners reliably. Simple, right?

Wrong.

Here's what I learned the hard way:

Partner servers go down randomly (and don't tell you)
Networks are unreliable—timeouts happen constantly
Duplicate delivery can cost real money
A slow partner can bring down your entire system
You need delivery guarantees, not "best effort"

The stakes are high: Failed postbacks mean lost revenue, broken partnerships, and compliance issues.

Why This Matters Beyond Affiliate Marketing

Before we dive deep, understand that postback systems are everywhere:

Use Case	Event Type	Why It Needs This
Payment Gateways	Transaction confirmations	Money is involved—duplicates cost thousands
E-commerce	Order status webhooks	Inventory sync, shipping triggers
Ad Networks	Click/impression callbacks	Attribution accuracy, billing
SaaS Integrations	Webhook deliveries (Stripe-style)	Third-party app functionality
IoT Platforms	Device event notifications	Real-time device monitoring
Microservices	Inter-service events	Distributed system communication

If you've ever built webhooks, API integrations, or notification systems—this is for you.

Architecture Evolution: From Naive to Production-Grade

Level 0: The Naive Approach (Don't Do This)

// ❌ What I started with (production disaster waiting to happen)
app.post('/conversion', async (req, res) => {
  const conversion = await saveConversion(req.body);
  
  // Send postbacks synchronously
  for (const partner of partners) {
    try {
      await fetch(partner.url, {
        method: 'POST',
        body: JSON.stringify(conversion),
        headers: { 'Content-Type': 'application/json' }
      });
    } catch (error) {
      console.log('Failed:', error.message); // Lost forever!
    }
  }
  
  res.json({ success: true });
});

Why this fails:

User waits for all partner responses (10+ seconds)
One slow partner times out the entire request
Failed postbacks are lost forever
No retry mechanism
No idempotency protection
API becomes unresponsive under load

Production impact: 503 errors, angry users, lost conversions.

Level 1: Queue-Based Architecture

The first major improvement: decouple ingestion from delivery.

import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';

const connection = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  maxRetriesPerRequest: null
});

const postbackQueue = new Queue('postbacks', { connection });

// ✅ API endpoint becomes fast
app.post('/conversion', async (req, res) => {
  const conversion = await saveConversion(req.body);
  
  // Queue for async processing
  await postbackQueue.add('send-postback', {
    conversionId: conversion.id,
    partners: partners.map(p => p.id),
    timestamp: Date.now()
  });
  
  res.json({ success: true }); // Instant response
});

// Worker handles delivery
const worker = new Worker('postbacks', async (job) => {
  const { conversionId, partners } = job.data;
  const conversion = await getConversion(conversionId);
  
  for (const partnerId of partners) {
    await deliverPostback(partnerId, conversion);
  }
}, { connection });

What this solves:

API responds instantly (< 50ms)
Failures don't block user requests
Workers can scale independently
Redis persistence prevents job loss

What's still missing:

No retry logic
No idempotency
No delivery guarantees
No monitoring

Level 2: Retry Strategy with Exponential Backoff

Networks fail. Servers go down. You need intelligent retries.

interface RetryConfig {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  backoffMultiplier: number;
  jitter: boolean;
}

class RetryManager {
  constructor(private config: RetryConfig) {}

  async executeWithRetry<T>(
    operation: () => Promise<T>,
    context: string
  ): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 1; attempt <= this.config.maxAttempts; attempt++) {
      try {
        const result = await operation();
        
        if (attempt > 1) {
          // Log successful retry
          logger.info('Retry succeeded', { context, attempt });
        }
        
        return result;
      } catch (error) {
        lastError = error as Error;
        
        // Don't retry on client errors (4xx)
        if (this.isClientError(error)) {
          logger.error('Client error - no retry', { context, error });
          throw error;
        }
        
        // Don't retry on last attempt
        if (attempt === this.config.maxAttempts) {
          logger.error('Max retries reached', { context, attempt });
          throw error;
        }
        
        // Calculate delay with exponential backoff
        const delay = this.calculateDelay(attempt);
        logger.warn('Retrying after delay', { context, attempt, delay });
        
        await this.sleep(delay);
      }
    }
    
    throw lastError!;
  }

  private calculateDelay(attempt: number): number {
    // Exponential: 1s, 2s, 4s, 8s, 16s...
    let delay = this.config.baseDelay * 
                Math.pow(this.config.backoffMultiplier, attempt - 1);
    
    // Cap at maxDelay
    delay = Math.min(delay, this.config.maxDelay);
    
    // Add jitter to prevent thundering herd
    if (this.config.jitter) {
      delay = delay * (0.5 + Math.random() * 0.5);
    }
    
    return delay;
  }

  private isClientError(error: any): boolean {
    return error.response?.status >= 400 && error.response?.status < 500;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Integration with BullMQ:

const postbackQueue = new Queue('postbacks', {
  connection,
  defaultJobOptions: {
    attempts: 10,
    backoff: {
      type: 'exponential',
      delay: 2000 // Start with 2s
    },
    removeOnComplete: 100, // Keep last 100 successful jobs
    removeOnFail: 1000      // Keep last 1000 failed jobs
  }
});

const worker = new Worker('postbacks', async (job) => {
  const { conversionId, partnerId } = job.data;
  
  try {
    await retryManager.executeWithRetry(
      () => sendPostback(partnerId, conversionId),
      `postback-${partnerId}-${conversionId}`
    );
  } catch (error) {
    // After all retries failed, move to dead letter queue
    if (job.attemptsMade >= job.opts.attempts) {
      await deadLetterQueue.add('failed-postback', {
        ...job.data,
        error: error.message,
        attempts: job.attemptsMade,
        failedAt: new Date()
      });
    }
    throw error; // Let BullMQ handle the retry
  }
}, {
  connection,
  concurrency: 10 // Process 10 jobs in parallel
});

Why exponential backoff?

Linear backoff (1s, 2s, 3s) overwhelms recovering servers
Exponential backoff gives servers time to recover
Jitter prevents synchronized retries ("thundering herd")

The Idempotency Problem

The nightmare scenario:

Partner's server processes postback successfully
Returns 500 error due to internal issue (not network)
Your system retries
Partner processes it again
Duplicate conversion = double payout = angry partners

The solution: Idempotency keys

import crypto from 'crypto';

interface PostbackEvent {
  conversionId: string;
  partnerId: string;
  amount: number;
  currency: string;
  timestamp: string;
  idempotencyKey: string;
  signature: string;
}

class IdempotencyManager {
  constructor(private redis: Redis) {}

  generateKey(conversion: Conversion, partnerId: string): string {
    // Create unique key from conversion + partner + timestamp window
    const data = `${conversion.id}:${partnerId}:${this.getTimeWindow()}`;
    return crypto.createHash('sha256').update(data).digest('hex');
  }

  async checkAndStore(key: string, ttl: number = 86400): Promise<boolean> {
    // Atomic check-and-set using Redis
    const result = await this.redis.set(
      `idempotency:${key}`,
      '1',
      'EX',
      ttl,
      'NX' // Only set if not exists
    );
    
    return result === 'OK'; // true = can proceed, false = duplicate
  }

  async isProcessed(key: string): Promise<boolean> {
    const exists = await this.redis.exists(`idempotency:${key}`);
    return exists === 1;
  }

  private getTimeWindow(): string {
    // Round to nearest hour for idempotency window
    const now = new Date();
    now.setMinutes(0, 0, 0);
    return now.toISOString();
  }
}

class PostbackDeliveryService {
  constructor(
    private idempotencyManager: IdempotencyManager,
    private retryManager: RetryManager
  ) {}

  async deliver(conversion: Conversion, partner: Partner): Promise<DeliveryResult> {
    // Generate idempotency key
    const idempotencyKey = this.idempotencyManager.generateKey(
      conversion,
      partner.id
    );
    
    // Check if already processed
    const canProceed = await this.idempotencyManager.checkAndStore(idempotencyKey);
    
    if (!canProceed) {
      logger.info('Duplicate postback detected', {
        conversionId: conversion.id,
        partnerId: partner.id,
        idempotencyKey
      });
      
      return {
        status: 'skipped',
        reason: 'duplicate',
        idempotencyKey
      };
    }
    
    // Generate HMAC signature for security
    const signature = this.generateSignature(conversion, partner.secret);
    
    // Build postback payload
    const event: PostbackEvent = {
      conversionId: conversion.id,
      partnerId: partner.id,
      amount: conversion.amount,
      currency: conversion.currency,
      timestamp: new Date().toISOString(),
      idempotencyKey,
      signature
    };
    
    // Deliver with retries
    try {
      const result = await this.retryManager.executeWithRetry(
        () => this.sendHTTP(partner.url, event),
        `postback-${partner.id}-${conversion.id}`
      );
      
      return {
        status: 'success',
        idempotencyKey,
        response: result
      };
    } catch (error) {
      // Even on failure, keep idempotency key to prevent duplicates
      return {
        status: 'failed',
        idempotencyKey,
        error: error.message
      };
    }
  }

  private generateSignature(conversion: Conversion, secret: string): string {
    const payload = `${conversion.id}:${conversion.amount}:${conversion.currency}`;
    return crypto
      .createHmac('sha256', secret)
      .update(payload)
      .digest('hex');
  }

  private async sendHTTP(url: string, event: PostbackEvent): Promise<any> {
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Idempotency-Key': event.idempotencyKey,
        'X-Signature': event.signature
      },
      body: JSON.stringify(event),
      signal: AbortSignal.timeout(10000) // 10s timeout
    });

    if (!response.ok) {
      throw new Error(`HTTP ${response.status}: ${response.statusText}`);
    }

    return response.json();
  }
}

Key design decisions:

Time window: Idempotency keys include hourly window, not exact timestamp
- Prevents issues with slight clock skew
- Allows legitimate retries within reasonable timeframe
Redis TTL: Keys expire after 24 hours
- Balances memory usage with safety
- Configurable per partner needs
Store on attempt, not success: Even failed deliveries store key
- Prevents retry storms on intermittent failures
- Partner might have processed before returning error

Circuit Breaker Pattern for Partner Health

When a partner's server goes down, you don't want to keep hammering it with requests. Circuit breakers stop requests to failing services.

enum CircuitState {
  CLOSED = 'CLOSED',     // Normal operation
  OPEN = 'OPEN',         // Blocking requests
  HALF_OPEN = 'HALF_OPEN' // Testing recovery
}

interface CircuitBreakerConfig {
  failureThreshold: number;
  successThreshold: number;
  timeout: number;
  monitoringWindow: number;
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failures: number = 0;
  private successes: number = 0;
  private lastFailureTime?: number;
  private stateChangeTime: number = Date.now();

  constructor(
    private partnerId: string,
    private config: CircuitBreakerConfig
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    // Check if circuit should transition
    this.checkStateTransition();

    if (this.state === CircuitState.OPEN) {
      throw new CircuitOpenError(
        `Circuit breaker OPEN for partner ${this.partnerId}`
      );
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure(error as Error);
      throw error;
    }
  }

  private onSuccess(): void {
    this.failures = 0;

    if (this.state === CircuitState.HALF_OPEN) {
      this.successes++;
      
      // Require multiple successes before closing
      if (this.successes >= this.config.successThreshold) {
        this.transitionTo(CircuitState.CLOSED);
        this.successes = 0;
        
        logger.info('Circuit breaker closed', {
          partnerId: this.partnerId,
          downtime: Date.now() - this.stateChangeTime
        });
      }
    }
  }

  private onFailure(error: Error): void {
    this.failures++;
    this.lastFailureTime = Date.now();

    if (this.state === CircuitState.HALF_OPEN) {
      // Single failure in half-open reopens circuit
      this.transitionTo(CircuitState.OPEN);
      this.successes = 0;
      
      logger.warn('Circuit breaker reopened', {
        partnerId: this.partnerId,
        error: error.message
      });
    }

    if (this.failures >= this.config.failureThreshold) {
      this.transitionTo(CircuitState.OPEN);
      
      logger.error('Circuit breaker opened', {
        partnerId: this.partnerId,
        failures: this.failures,
        error: error.message
      });
    }
  }

  private checkStateTransition(): void {
    if (this.state === CircuitState.OPEN && this.shouldAttemptReset()) {
      this.transitionTo(CircuitState.HALF_OPEN);
      
      logger.info('Circuit breaker half-open', {
        partnerId: this.partnerId,
        downtime: Date.now() - this.stateChangeTime
      });
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    return Date.now() - this.lastFailureTime >= this.config.timeout;
  }

  private transitionTo(newState: CircuitState): void {
    const oldState = this.state;
    this.state = newState;
    this.stateChangeTime = Date.now();
    
    // Emit metrics for monitoring
    metrics.circuitBreakerStateChange(this.partnerId, oldState, newState);
  }

  getState(): CircuitState {
    return this.state;
  }

  getMetrics() {
    return {
      state: this.state,
      failures: this.failures,
      successes: this.successes,
      stateAge: Date.now() - this.stateChangeTime
    };
  }
}

class CircuitOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'CircuitOpenError';
  }
}

Integration with delivery service:

class PartnerCircuitManager {
  private breakers: Map<string, CircuitBreaker> = new Map();

  getBreaker(partnerId: string): CircuitBreaker {
    if (!this.breakers.has(partnerId)) {
      this.breakers.set(partnerId, new CircuitBreaker(partnerId, {
        failureThreshold: 5,
        successThreshold: 3,
        timeout: 60000, // 1 minute
        monitoringWindow: 10000
      }));
    }
    return this.breakers.get(partnerId)!;
  }

  async deliverWithCircuitBreaker(
    partner: Partner,
    conversion: Conversion
  ): Promise<DeliveryResult> {
    const breaker = this.getBreaker(partner.id);

    try {
      return await breaker.execute(async () => {
        return this.deliveryService.deliver(conversion, partner);
      });
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        // Don't queue for retry - circuit is open
        return {
          status: 'circuit_open',
          partnerId: partner.id,
          reason: 'Partner temporarily unavailable'
        };
      }
      throw error;
    }
  }

  getAllStates(): Map<string, CircuitState> {
    const states = new Map<string, CircuitState>();
    for (const [partnerId, breaker] of this.breakers) {
      states.set(partnerId, breaker.getState());
    }
    return states;
  }
}

Why circuit breakers matter:

Prevent cascading failures: One bad partner doesn't DOS your system
Faster failure detection: Stop wasting resources on down services
Automatic recovery: Tests service health and reopens when ready
Better user experience: Fail fast instead of timing out

Advanced: Postback Template Engine

Partners need different data formats. Building a flexible template system was crucial.

interface PostbackTemplate {
  url: string;
  method: 'GET' | 'POST';
  headers?: Record<string, string>;
  bodyFormat?: 'json' | 'form' | 'xml';
  macros: Record<string, string>;
}

class TemplateEngine {
  private macros: Map<string, (conversion: Conversion) => string> = new Map();

  constructor() {
    this.registerDefaultMacros();
  }

  private registerDefaultMacros(): void {
    // Register common macros
    this.macros.set('CONVERSION_ID', (c) => c.id);
    this.macros.set('AMOUNT', (c) => c.amount.toString());
    this.macros.set('CURRENCY', (c) => c.currency);
    this.macros.set('TIMESTAMP', (c) => new Date(c.createdAt).getTime().toString());
    this.macros.set('USER_ID', (c) => c.userId);
    this.macros.set('CLICK_ID', (c) => c.clickId | '');
    this.macros.set('IP_ADDRESS', (c) => c.ipAddress);
    this.macros.set('USER_AGENT', (c) => encodeURIComponent(c.userAgent));
    
    // Advanced macros
    this.macros.set('TRANSACTION_ID', (c) => c.transactionId | '');
    this.macros.set('PRODUCT_ID', (c) => c.productId | '');
    this.macros.set('CATEGORY', (c) => c.category | '');
    this.macros.set('COUNTRY', (c) => c.country | '');
    
    // Custom formatting
    this.macros.set('AMOUNT_CENTS', (c) => (c.amount * 100).toFixed(0));
    this.macros.set('ISO_TIMESTAMP', (c) => new Date(c.createdAt).toISOString());
  }

  compile(template: PostbackTemplate, conversion: Conversion): CompiledPostback {
    // Replace macros in URL
    let url = template.url;
    for (const [macro, value] of Object.entries(template.macros)) {
      const replacement = this.resolveMacro(macro, conversion);
      url = url.replace(new RegExp(`{${macro}}`, 'g'), replacement);
    }

    // Replace macros in headers
    const headers: Record<string, string> = {};
    if (template.headers) {
      for (const [key, value] of Object.entries(template.headers)) {
        headers[key] = this.replaceMacros(value, conversion);
      }
    }

    // Build body based on format
    let body: any = null;
    if (template.method === 'POST') {
      body = this.buildBody(template, conversion);
    }

    return { url, method: template.method, headers, body };
  }

  private resolveMacro(macro: string, conversion: Conversion): string {
    const resolver = this.macros.get(macro);
    if (!resolver) {
      logger.warn('Unknown macro', { macro });
      return '';
    }
    return resolver(conversion);
  }

  private replaceMacros(text: string, conversion: Conversion): string {
    let result = text;
    for (const [macro, resolver] of this.macros) {
      result = result.replace(
        new RegExp(`{${macro}}`, 'g'),
        resolver(conversion)
      );
    }
    return result;
  }

  private buildBody(template: PostbackTemplate, conversion: Conversion): any {
    if (template.bodyFormat === 'json') {
      return this.buildJsonBody(template.macros, conversion);
    } else if (template.bodyFormat === 'form') {
      return this.buildFormBody(template.macros, conversion);
    } else if (template.bodyFormat === 'xml') {
      return this.buildXmlBody(template.macros, conversion);
    }
    return null;
  }

  private buildJsonBody(macros: Record<string, string>, conversion: Conversion): any {
    const body: any = {};
    for (const [key, macro] of Object.entries(macros)) {
      body[key] = this.resolveMacro(macro, conversion);
    }
    return body;
  }

  private buildFormBody(macros: Record<string, string>, conversion: Conversion): string {
    const params = new URLSearchParams();
    for (const [key, macro] of Object.entries(macros)) {
      params.append(key, this.resolveMacro(macro, conversion));
    }
    return params.toString();
  }

  private buildXmlBody(macros: Record<string, string>, conversion: Conversion): string {
    let xml = '<?xml version="1.0" encoding="UTF-8"?><postback>';
    for (const [key, macro] of Object.entries(macros)) {
      const value = this.resolveMacro(macro, conversion);
      xml += `<${key}>${this.escapeXml(value)}</${key}>`;
    }
    xml += '</postback>';
    return xml;
  }

  private escapeXml(text: string): string {
    return text
      .replace(/&/g, '&amp;')
      .replace(/</g, '&lt;')
      .replace(/>/g, '&gt;')
      .replace(/"/g, '&quot;')
      .replace(/'/g, '&apos;');
  }
}

interface CompiledPostback {
  url: string;
  method: 'GET' | 'POST';
  headers: Record<string, string>;
  body: any;
}

Example usage:

// Partner 1: Simple GET request
const partner1Template: PostbackTemplate = {
  url: 'https://partner1.com/postback?id={CONVERSION_ID}&amount={AMOUNT}&currency={CURRENCY}',
  method: 'GET',
  macros: {
    CONVERSION_ID: 'CONVERSION_ID',
    AMOUNT: 'AMOUNT',
    CURRENCY: 'CURRENCY'
  }
};

// Partner 2: POST with JSON body
const partner2Template: PostbackTemplate = {
  url: 'https://partner2.com/api/conversions',
  method: 'POST',
  headers: {
    'Authorization': 'Bearer {API_KEY}',
    'Content-Type': 'application/json'
  },
  bodyFormat: 'json',
  macros: {
    transaction_id: 'CONVERSION_ID',
    amount_cents: 'AMOUNT_CENTS',
    click_id: 'CLICK_ID',
    timestamp: 'ISO_TIMESTAMP'
  }
};

// Partner 3: Form-encoded POST
const partner3Template: PostbackTemplate = {
  url: 'https://partner3.com/track',
  method: 'POST',
  headers: {
    'Content-Type': 'application/x-www-form-urlencoded'
  },
  bodyFormat: 'form',
  macros: {
    txid: 'TRANSACTION_ID',
    payout: 'AMOUNT',
    ip: 'IP_ADDRESS'
  }
};

Monitoring & Observability

You can't fix what you can't see. Comprehensive monitoring is critical.

interface DeliveryMetrics {
  partnerId: string;
  successCount: number;
  failureCount: number;
  averageLatency: number;
  p95Latency: number;
  p99Latency: number;
  lastSuccessAt?: Date;
  lastFailureAt?: Date;
  circuitState: CircuitState;
  queueDepth: number;
}

class MetricsCollector {
  private metrics: Map<string, DeliveryMetrics> = new Map();
  private latencies: Map<string, number[]> = new Map();

  recordSuccess(partnerId: string, latency: number): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.successCount++;
    metric.lastSuccessAt = new Date();
    
    this.recordLatency(partnerId, latency);
    this.updateLatencyPercentiles(partnerId);
  }

  recordFailure(partnerId: string, error: Error): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.failureCount++;
    metric.lastFailureAt = new Date();
    
    // Log to error tracking (Sentry, etc.)
    logger.error('Postback delivery failed', {
      partnerId,
      error: error.message,
      stack: error.stack
    });
  }

  recordCircuitState(partnerId: string, state: CircuitState): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.circuitState = state;
  }

  recordQueueDepth(partnerId: string, depth: number): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.queueDepth = depth;
  }

  private recordLatency(partnerId: string, latency: number): void {
    if (!this.latencies.has(partnerId)) {
      this.latencies.set(partnerId, []);
    }
    
    const latencies = this.latencies.get(partnerId)!;
    latencies.push(latency);
    
    // Keep only last 1000 measurements
    if (latencies.length > 1000) {
      latencies.shift();
    }
  }

  private updateLatencyPercentiles(partnerId: string): void {
    const latencies = this.latencies.get(partnerId);
    if (!latencies | latencies.length === 0) return;
    
    const sorted = [...latencies].sort((a, b) => a - b);
    const metric = this.getOrCreateMetric(partnerId);
    
    metric.averageLatency = latencies.reduce((a, b) => a + b, 0) / latencies.length;
    metric.p95Latency = sorted[Math.floor(sorted.length * 0.95)];
    metric.p99Latency = sorted[Math.floor(sorted.length * 0.99)];
  }

  getMetrics(partnerId: string): DeliveryMetrics | undefined {
    return this.metrics.get(partnerId);
  }

  getAllMetrics(): DeliveryMetrics[] {
    return Array.from(this.metrics.values());
  }

  getHealthCheck(): HealthCheckResult {
    const metrics = this.getAllMetrics();
    
    const openCircuits = metrics.filter(m => m.circuitState === CircuitState.OPEN);
    const slowPartners = metrics.filter(m => m.p95Latency > 5000); // > 5s
    const failingPartners = metrics.filter(m => {
      const total = m.successCount + m.failureCount;
      return total > 0 && (m.failureCount / total) > 0.1; // > 10% failure rate
    });

    return {
      healthy: openCircuits.length === 0 && failingPartners.length === 0,
      openCircuits: openCircuits.map(m => m.partnerId),
      slowPartners: slowPartners.map(m => m.partnerId),
      failingPartners: failingPartners.map(m => ({
        partnerId: m.partnerId,
        failureRate: (m.failureCount / (m.successCount + m.failureCount))
      })),
      totalQueueDepth: metrics.reduce((sum, m) => sum + m.queueDepth, 0)
    };
  }

  private getOrCreateMetric(partnerId: string): DeliveryMetrics {
    if (!this.metrics.has(partnerId)) {
      this.metrics.set(partnerId, {
        partnerId,
        successCount: 0,
        failureCount: 0,
        averageLatency: 0,
        p95Latency: 0,
        p99Latency: 0,
        circuitState: CircuitState.CLOSED,
        queueDepth: 0
      });
    }
    return this.metrics.get(partnerId)!;
  }
}

interface HealthCheckResult {
  healthy: boolean;
  openCircuits: string[];
  slowPartners: string[];
  failingPartners: Array<{ partnerId: string; failureRate: number }>;
  totalQueueDepth: number;
}

Real-time alerting:

class AlertManager {
  constructor(
    private metricsCollector: MetricsCollector,
    private slackWebhook: string
  ) {
    // Check health every minute
    setInterval(() => this.checkAndAlert(), 60000);
  }

  async checkAndAlert(): Promise<void> {
    const health = this.metricsCollector.getHealthCheck();
    
    // Alert on open circuits
    if (health.openCircuits.length > 0) {
      await this.sendAlert({
        severity: 'high',
        title: 'Circuit Breakers Open',
        message: `${health.openCircuits.length} partner(s) unavailable: ${health.openCircuits.join(', ')}`,
        color: 'danger'
      });
    }

    // Alert on high failure rates
    for (const failing of health.failingPartners) {
      if (failing.failureRate > 0.25) { // > 25%
        await this.sendAlert({
          severity: 'medium',
          title: 'High Failure Rate',
          message: `Partner ${failing.partnerId}: ${(failing.failureRate * 100).toFixed(1)}% failures`,
          color: 'warning'
        });
      }
    }

    // Alert on queue backup
    if (health.totalQueueDepth > 10000) {
      await this.sendAlert({
        severity: 'critical',
        title: 'Queue Backup',
        message: `${health.totalQueueDepth} postbacks pending delivery`,
        color: 'danger'
      });
    }

    // Alert on slow partners
    if (health.slowPartners.length > 0) {
      await this.sendAlert({
        severity: 'low',
        title: 'Slow Partners',
        message: `${health.slowPartners.length} partner(s) responding slowly: ${health.slowPartners.join(', ')}`,
        color: 'warning'
      });
    }
  }

  private async sendAlert(alert: Alert): Promise<void> {
    try {
      await fetch(this.slackWebhook, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          attachments: [{
            color: alert.color,
            title: `[${alert.severity.toUpperCase()}] ${alert.title}`,
            text: alert.message,
            ts: Math.floor(Date.now() / 1000)
          }]
        })
      });
    } catch (error) {
      logger.error('Failed to send alert', { error });
    }
  }
}

interface Alert {
  severity: 'low' | 'medium' | 'high' | 'critical';
  title: string;
  message: string;
  color: 'good' | 'warning' | 'danger';
}

Dashboard endpoint:

app.get('/api/postback-metrics', async (req, res) => {
  const metrics = metricsCollector.getAllMetrics();
  const health = metricsCollector.getHealthCheck();
  
  res.json({
    health,
    metrics: metrics.map(m => ({
      partnerId: m.partnerId,
      successRate: m.successCount / (m.successCount + m.failureCount) | 0,
      averageLatency: Math.round(m.averageLatency),
      p95Latency: Math.round(m.p95Latency),
      circuitState: m.circuitState,
      queueDepth: m.queueDepth,
      lastSuccess: m.lastSuccessAt?.toISOString(),
      lastFailure: m.lastFailureAt?.toISOString()
    }))
  });
});

Database Schema for Audit Trail

Maintain a complete audit log for compliance and debugging:

-- Postback delivery logs
CREATE TABLE postback_logs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  conversion_id UUID NOT NULL REFERENCES conversions(id),
  partner_id UUID NOT NULL REFERENCES partners(id),
  
  -- Request details
  idempotency_key VARCHAR(64) NOT NULL,
  url TEXT NOT NULL,
  method VARCHAR(10) NOT NULL,
  headers JSONB,
  body JSONB,
  
  -- Response details
  status VARCHAR(20) NOT NULL, -- 'success', 'failed', 'skipped', 'circuit_open'
  http_status_code INTEGER,
  response_body TEXT,
  error_message TEXT,
  
  -- Timing
  latency_ms INTEGER,
  attempt_number INTEGER NOT NULL DEFAULT 1,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  
  -- Indexes for queries
  INDEX idx_conversion_partner (conversion_id, partner_id),
  INDEX idx_status_created (status, created_at),
  INDEX idx_idempotency (idempotency_key)
);

-- Dead letter queue for failed deliveries
CREATE TABLE postback_dlq (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  conversion_id UUID NOT NULL,
  partner_id UUID NOT NULL,
  
  -- Failure details
  error_message TEXT NOT NULL,
  attempts INTEGER NOT NULL,
  last_attempt_at TIMESTAMP NOT NULL,
  
  -- Original payload
  payload JSONB NOT NULL,
  
  -- Status tracking
  status VARCHAR(20) NOT NULL DEFAULT 'pending', -- 'pending', 'retrying', 'resolved', 'abandoned'
  resolved_at TIMESTAMP,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  
  INDEX idx_status_created (status, created_at)
);

-- Partner health metrics (time-series data)
CREATE TABLE partner_metrics (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  partner_id UUID NOT NULL REFERENCES partners(id),
  
  -- Metrics window
  window_start TIMESTAMP NOT NULL,
  window_end TIMESTAMP NOT NULL,
  
  -- Counters
  success_count INTEGER NOT NULL DEFAULT 0,
  failure_count INTEGER NOT NULL DEFAULT 0,
  
  -- Latency stats
  avg_latency_ms INTEGER,
  p95_latency_ms INTEGER,
  p99_latency_ms INTEGER,
  
  -- Circuit breaker state
  circuit_state VARCHAR(20),
  
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  
  INDEX idx_partner_window (partner_id, window_start)
);

TypeScript repository:

class PostbackLogRepository {
  constructor(private db: Database) {}

  async logAttempt(log: PostbackLogEntry): Promise<void> {
    await this.db.query(`
      INSERT INTO postback_logs (
        conversion_id, partner_id, idempotency_key,
        url, method, headers, body,
        status, http_status_code, response_body, error_message,
        latency_ms, attempt_number
      ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13)
    `, [
      log.conversionId,
      log.partnerId,
      log.idempotencyKey,
      log.url,
      log.method,
      JSON.stringify(log.headers),
      JSON.stringify(log.body),
      log.status,
      log.httpStatusCode,
      log.responseBody,
      log.errorMessage,
      log.latencyMs,
      log.attemptNumber
    ]);
  }

  async getDeliveryHistory(
    conversionId: string,
    partnerId: string
  ): Promise<PostbackLogEntry[]> {
    const result = await this.db.query(`
      SELECT * FROM postback_logs
      WHERE conversion_id = $1 AND partner_id = $2
      ORDER BY created_at DESC
    `, [conversionId, partnerId]);
    
    return result.rows;
  }

  async getFailedDeliveries(limit: number = 100): Promise<PostbackLogEntry[]> {
    const result = await this.db.query(`
      SELECT * FROM postback_logs
      WHERE status = 'failed'
      ORDER BY created_at DESC
      LIMIT $1
    `, [limit]);
    
    return result.rows;
  }
}

Performance Optimization Techniques

1. Batch Processing for Multiple Partners

class BatchPostbackProcessor {
  async processConversion(conversion: Conversion, partners: Partner[]): Promise<void> {
    // Group partners by priority
    const highPriority = partners.filter(p => p.priority === 'high');
    const normalPriority = partners.filter(p => p.priority === 'normal');
    
    // Send high priority immediately
    await Promise.all(
      highPriority.map(partner => 
        postbackQueue.add('send-postback', {
          conversionId: conversion.id,
          partnerId: partner.id
        }, {
          priority: 1 // Higher priority in queue
        })
      )
    );
    
    // Batch normal priority (reduces queue overhead)
    if (normalPriority.length > 0) {
      await postbackQueue.add('send-batch-postback', {
        conversionId: conversion.id,
        partnerIds: normalPriority.map(p => p.id)
      }, {
        priority: 10
      });
    }
  }
}

// Worker processes batches
const worker = new Worker('postbacks', async (job) => {
  if (job.name === 'send-batch-postback') {
    const { conversionId, partnerIds } = job.data;
    
    // Process in parallel with concurrency limit
    await Promise.all(
      partnerIds.map(partnerId =>
        limiter.schedule(() => 
          deliveryService.deliver(conversionId, partnerId)
        )
      )
    );
  }
}, { connection });

// Rate limiter for parallel processing
import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  maxConcurrent: 50, // Max 50 concurrent requests
  minTime: 20 // Minimum 20ms between requests
});

2. Connection Pooling

import axios from 'axios';
import http from 'http';
import https from 'https';

// Reuse HTTP connections
const httpAgent = new http.Agent({
  keepAlive: true,
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000
});

const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000
});

const axiosInstance = axios.create({
  httpAgent,
  httpsAgent,
  timeout: 10000
});

class HTTPClient {
  async post(url: string, data: any, headers: any): Promise<any> {
    return axiosInstance.post(url, data, { headers });
  }
}

3. Caching Partner Configurations

class PartnerConfigCache {
  private cache: Map<string, Partner> = new Map();
  private ttl: number = 300000; // 5 minutes

  async getPartner(partnerId: string): Promise<Partner> {
    // Check cache first
    const cached = this.cache.get(partnerId);
    if (cached) {
      return cached;
    }

    // Fetch from database
    const partner = await this.db.partner.findUnique({
      where: { id: partnerId }
    });

    if (!partner) {
      throw new Error(`Partner ${partnerId} not found`);
    }

    // Cache for future use
    this.cache.set(partnerId, partner);
    
    // Auto-expire after TTL
    setTimeout(() => {
      this.cache.delete(partnerId);
    }, this.ttl);

    return partner;
  }

  invalidate(partnerId: string): void {
    this.cache.delete(partnerId);
  }

  clear(): void {
    this.cache.clear();
  }
}

Testing Strategy

Unit Tests

import { describe, it, expect, jest, beforeEach } from '@jest/globals';

describe('RetryManager', () => {
  let retryManager: RetryManager;

  beforeEach(() => {
    retryManager = new RetryManager({
      maxAttempts: 3,
      baseDelay: 100,
      maxDelay: 1000,
      backoffMultiplier: 2,
      jitter: false
    });
  });

  it('should succeed on first attempt', async () => {
    const operation = jest.fn().mockResolvedValue('success');
    
    const result = await retryManager.executeWithRetry(operation, 'test');
    
    expect(result).toBe('success');
    expect(operation).toHaveBeenCalledTimes(1);
  });

  it('should retry on transient failure', async () => {
    const operation = jest.fn()
      .mockRejectedValueOnce(new Error('timeout'))
      .mockResolvedValueOnce('success');
    
    const result = await retryManager.executeWithRetry(operation, 'test');
    
    expect(result).toBe('success');
    expect(operation).toHaveBeenCalledTimes(2);
  });

  it('should not retry on client error', async () => {
    const error = new Error('400 Bad Request');
    const operation = jest.fn().mockRejectedValue(error);
    
    await expect(
      retryManager.executeWithRetry(operation, 'test')
    ).rejects.toThrow('400 Bad Request');
    
    expect(operation).toHaveBeenCalledTimes(1);
  });

  it('should throw after max attempts', async () => {
    const operation = jest.fn().mockRejectedValue(new Error('503'));
    
    await expect(
      retryManager.executeWithRetry(operation, 'test')
    ).rejects.toThrow('503');
    
    expect(operation).toHaveBeenCalledTimes(3);
  });
});

describe('CircuitBreaker', () => {
  let circuitBreaker: CircuitBreaker;

  beforeEach(() => {
    circuitBreaker = new CircuitBreaker('test-partner', {
      failureThreshold: 2,
      successThreshold: 3,
      timeout: 1000,
      monitoringWindow: 500
    });
  });

  it('should open after threshold failures', async () => {
    const operation = jest.fn().mockRejectedValue(new Error('failure'));
    
    // First failure
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.CLOSED);
    
    // Second failure - opens circuit
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
    
    // Third call blocked
    await expect(circuitBreaker.execute(operation)).rejects.toThrow('Circuit breaker is OPEN');
    expect(operation).toHaveBeenCalledTimes(2);
  });

  it('should transition to half-open after timeout', async () => {
    // Open the circuit
    const operation = jest.fn().mockRejectedValue(new Error('failure'));
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
    
    // Wait for timeout
    await new Promise(resolve => setTimeout(resolve, 1100));
    
    // Should attempt in half-open
    operation.mockResolvedValue('success');
    await circuitBreaker.execute(operation);
    expect(circuitBreaker.getState()).toBe(CircuitState.HALF_OPEN);
  });
});

Integration Tests

describe('Postback Delivery Integration', () => {
  let testServer: Server;
  let requestLog: any[] = [];

  beforeAll(async () => {
    // Start mock partner server
    testServer = createTestServer((req, res) => {
      requestLog.push({
        method: req.method,
        url: req.url,
        headers: req.headers,
        body: req.body
      });
      res.status(200).json({ success: true });
    });
  });

  afterAll(() => {
    testServer.close();
  });

  beforeEach(() => {
    requestLog = [];
  });

  it('should deliver postback with retry on failure', async () => {
    let attempts = 0;
    testServer.setHandler((req, res) => {
      attempts++;
      if (attempts < 3) {
        res.status(500).json({ error: 'Server error' });
      } else {
        res.status(200).json({ success: true });
      }
    });

    const conversion = createTestConversion();
    const partner = createTestPartner({ url: testServer.url });

    await deliveryService.deliver(conversion, partner);

    expect(attempts).toBe(3);
    expect(requestLog.length).toBe(3);
  });

  it('should prevent duplicate delivery with idempotency', async () => {
    const conversion = createTestConversion();
    const partner = createTestPartner({ url: testServer.url });

    // Send twice
    await deliveryService.deliver(conversion, partner);
    await deliveryService.deliver(conversion, partner);

    // Should only deliver once
    expect(requestLog.length).toBe(1);
  });

  it('should apply template correctly', async () => {
    const conversion = createTestConversion({
      id: 'conv-123',
      amount: 50.00,
      currency: 'USD'
    });

    const partner = createTestPartner({
      url: testServer.url + '/track',
      template: {
        url: '{url}?txid={CONVERSION_ID}&amount={AMOUNT}&currency={CURRENCY}',
        method: 'GET',
        macros: {
          CONVERSION_ID: 'CONVERSION_ID',
          AMOUNT: 'AMOUNT',
          CURRENCY: 'CURRENCY'
        }
      }
    });

    await deliveryService.deliver(conversion, partner);

    expect(requestLog[0].url).toContain('txid=conv-123');
    expect(requestLog[0].url).toContain('amount=50');
    expect(requestLog[0].url).toContain('currency=USD');
  });
});

Deployment Architecture

Docker Compose Setup

version: '3.8'

services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: production
      DATABASE_URL: postgresql://user:pass@postgres:5432/postback
      REDIS_URL: redis://redis:6379
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '1'
          memory: 1G

  worker:
    build: .
    command: node dist/worker.js
    environment:
      NODE_ENV: production
      DATABASE_URL: postgresql://user:pass@postgres:5432/postback
      REDIS_URL: redis://redis:6379
      WORKER_CONCURRENCY: 10
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 5
      resources:
        limits:
          cpus: '2'
          memory: 2G

  postgres:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: postback
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
    ports:
      - "6379:6379"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

volumes:
  postgres_data:
  redis_data:
  grafana_data:

Environment Configuration

# .env.production
NODE_ENV=production

# Database
DATABASE_URL=postgresql://user:pass@postgres:5432/postback
DB_POOL_MIN=5
DB_POOL_MAX=20

# Redis
REDIS_URL=redis://redis:6379
REDIS_MAX_RETRIES=3

# Queue Configuration
QUEUE_CONCURRENCY=10
QUEUE_MAX_JOBS_PER_WORKER=100

# Circuit Breaker
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=60000

# Retry Configuration
RETRY_MAX_ATTEMPTS=10
RETRY_BASE_DELAY=2000
RETRY_MAX_DELAY=60000

# Monitoring
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SENTRY_DSN=https://your-sentry-dsn

# Rate Limiting
RATE_LIMIT_PER_PARTNER=1000
RATE_LIMIT_WINDOW=60000

Real-World Lessons & Gotchas

1. The Timeout Cascade Problem

What happened: Partner's server was responding in 29 seconds. My timeout was 30 seconds. Under load, all workers got stuck waiting.

Solution: Adaptive timeouts per partner based on historical latency:

class AdaptiveTimeoutManager {
  getTimeout(partnerId: string): number {
    const metrics = metricsCollector.getMetrics(partnerId);
    if (!metrics) return 10000; // Default 10s

    // Timeout = P95 latency + 50% buffer, capped at 30s
    const timeout = Math.min(metrics.p95Latency * 1.5, 30000);
    return Math.max(timeout, 5000); // Minimum 5s
  }
}

2. The Duplicate Payment Disaster

What happened: Partner's load balancer returned 502 after processing. We retried. They charged twice.

Solution: Server-side idempotency with longer TTL:

// Store idempotency key for 7 days, not 24 hours
await idempotencyManager.checkAndStore(key, 604800);

3. The Redis Memory Explosion

What happened: Queued 1M jobs during partner outage. Redis ran out of memory.

Solution: Queue size limits with backpressure:

const queueSize = await postbackQueue.count();
if (queueSize > 100000) {
  // Stop accepting new jobs
  throw new QueueOverloadError('Postback queue is full');
}

4. The Silent Failure

What happened: Partner changed their API contract. Postbacks "succeeded" (200 OK) but weren't processed.

Solution: Validate partner responses:

async function validateResponse(response: any, partner: Partner): Promise<boolean> {
  if (partner.responseValidation) {
    // Check for expected fields
    const valid = partner.responseValidation.requiredFields.every(
      field => response[field] !== undefined
    );
    
    if (!valid) {
      logger.warn('Invalid response from partner', {
        partnerId: partner.id,
        response
      });
      return false;
    }
  }
  return true;
}

What This Enables Beyond Affiliate Marketing

The patterns you've learned apply to:

1. Payment Webhooks

// Stripe-style webhook delivery
class PaymentWebhookService extends PostbackDeliveryService {
  async notifyMerchant(transaction: Transaction, merchant: Merchant) {
    return this.deliver({
      event: 'payment.succeeded',
      transaction_id: transaction.id,
      amount: transaction.amount
    }, merchant.webhookUrl);
  }
}

2. Microservices Event Bus

// Internal event delivery between services
class EventBus extends PostbackDeliveryService {
  async publish(event: DomainEvent) {
    const subscribers = await this.getSubscribers(event.type);
    
    for (const subscriber of subscribers) {
      await this.deliver(event, subscriber.endpoint);
    }
  }
}

3. IoT Device Notifications

// Real-time device event delivery
class IoTEventDelivery extends PostbackDeliveryService {
  async notifyDevice(deviceId: string, command: DeviceCommand) {
    const device = await this.getDevice(deviceId);
    
    return this.deliver(command, device.callbackUrl);
  }
}

4. SaaS Integration Platform

// Zapier-style integration delivery
class IntegrationEngine extends PostbackDeliveryService {
  async triggerWorkflow(trigger: Trigger, data: any) {
    const workflows = await this.getWorkflows(trigger.type);
    
    for (const workflow of workflows) {
      await this.deliver(data, workflow.webhookUrl);
    }
  }
}

Performance Benchmarks

From my production system:

Metric	Value
Throughput	100K postbacks/hour
Latency (P50)	250ms
Latency (P95)	1.2s
Latency (P99)	3.5s
Success Rate	99.2%
Worker Count	5 instances
Queue Depth (avg)	2,500 jobs
Memory per Worker	512MB
CPU per Worker	0.5 cores
Cost per Million	~$15

Summary: Key Takeaways

Building a production-grade postback system taught me:

1. Never Trust the Network

Implement retries with exponential backoff
Use circuit breakers to prevent cascading failures
Always have timeouts (and make them adaptive)

2. Idempotency is Non-Negotiable

Generate unique keys for every delivery
Store them with appropriate TTL
Protect against duplicates at all costs

3. Queues are Your Safety Net

Decouple ingestion from delivery
Persist jobs to prevent data loss
Scale workers independently from API

4. Monitor Everything

Track success rates per partner
Alert on anomalies proactively
Maintain audit logs for debugging

5. Design for Failure

Failures will happen—plan for them
Have fallback strategies (DLQ, manual retry)
Test failure scenarios regularly

6. Performance Matters

Use connection pooling
Batch when possible
Cache partner configurations
Monitor resource usage

7. These Patterns are Universal

Webhooks, event buses, notifications—all need this
The principles apply to any event delivery system
Learn once, apply everywhere

What's Next?

If I were to extend this system, I'd add:

Multi-region deployment for global latency reduction
GraphQL subscriptions for real-time delivery status
ML-based anomaly detection for partner health
A/B testing framework for postback variations
Webhook debugging tools (request inspection, replay)

Resources & References

BullMQ Documentation: https://docs.bullmq.io/
Stripe Webhook Best Practices: https://stripe.com/docs/webhooks/best-practices
AWS SQS vs Redis for queues: Performance comparison
Circuit Breaker Pattern: Martin Fowler's article
Idempotency in Distributed Systems: Engineering blog posts

Built something similar? Have questions? The patterns here are battle-tested in production handling millions of events. The principles apply whether you're building affiliate tracking, payment webhooks, or any event-driven system.

Remember: Reliability isn't a feature—it's a requirement. 🚀