Building a Production-Grade Postback System: Lessons from the Trenches

2025-10-02

Building a Production-Grade Postback System: Lessons from the Trenches

While building an affiliate marketing platform, I spent three months on a feature most users will never see: the postback system. What started as "just send HTTP requests when conversions happen" turned into a fascinating journey through distributed systems, reliability patterns, and production-grade infrastructure.

This isn't about affiliate marketing—it's about reliable event delivery, a problem every developer faces when building integrations, webhooks, payment systems, or any event-driven architecture.

The Problem: Event Delivery at Scale

A postback system delivers conversion events to partner servers. When a user completes an action (sale, signup, click), you need to notify multiple partners reliably. Simple, right?

Wrong.

Here's what I learned the hard way:

  • Partner servers go down randomly (and don't tell you)
  • Networks are unreliable—timeouts happen constantly
  • Duplicate delivery can cost real money
  • A slow partner can bring down your entire system
  • You need delivery guarantees, not "best effort"

The stakes are high: Failed postbacks mean lost revenue, broken partnerships, and compliance issues.

Why This Matters Beyond Affiliate Marketing

Before we dive deep, understand that postback systems are everywhere:

Use CaseEvent TypeWhy It Needs This
Payment GatewaysTransaction confirmationsMoney is involved—duplicates cost thousands
E-commerceOrder status webhooksInventory sync, shipping triggers
Ad NetworksClick/impression callbacksAttribution accuracy, billing
SaaS IntegrationsWebhook deliveries (Stripe-style)Third-party app functionality
IoT PlatformsDevice event notificationsReal-time device monitoring
MicroservicesInter-service eventsDistributed system communication

If you've ever built webhooks, API integrations, or notification systems—this is for you.


Architecture Evolution: From Naive to Production-Grade

Level 0: The Naive Approach (Don't Do This)

// ❌ What I started with (production disaster waiting to happen)
app.post('/conversion', async (req, res) => {
  const conversion = await saveConversion(req.body);
  
  // Send postbacks synchronously
  for (const partner of partners) {
    try {
      await fetch(partner.url, {
        method: 'POST',
        body: JSON.stringify(conversion),
        headers: { 'Content-Type': 'application/json' }
      });
    } catch (error) {
      console.log('Failed:', error.message); // Lost forever!
    }
  }
  
  res.json({ success: true });
});

Why this fails:

  • User waits for all partner responses (10+ seconds)
  • One slow partner times out the entire request
  • Failed postbacks are lost forever
  • No retry mechanism
  • No idempotency protection
  • API becomes unresponsive under load

Production impact: 503 errors, angry users, lost conversions.


Level 1: Queue-Based Architecture

The first major improvement: decouple ingestion from delivery.

import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';

const connection = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  maxRetriesPerRequest: null
});

const postbackQueue = new Queue('postbacks', { connection });

// ✅ API endpoint becomes fast
app.post('/conversion', async (req, res) => {
  const conversion = await saveConversion(req.body);
  
  // Queue for async processing
  await postbackQueue.add('send-postback', {
    conversionId: conversion.id,
    partners: partners.map(p => p.id),
    timestamp: Date.now()
  });
  
  res.json({ success: true }); // Instant response
});

// Worker handles delivery
const worker = new Worker('postbacks', async (job) => {
  const { conversionId, partners } = job.data;
  const conversion = await getConversion(conversionId);
  
  for (const partnerId of partners) {
    await deliverPostback(partnerId, conversion);
  }
}, { connection });

What this solves:

  • API responds instantly (< 50ms)
  • Failures don't block user requests
  • Workers can scale independently
  • Redis persistence prevents job loss

What's still missing:

  • No retry logic
  • No idempotency
  • No delivery guarantees
  • No monitoring

Level 2: Retry Strategy with Exponential Backoff

Networks fail. Servers go down. You need intelligent retries.

interface RetryConfig {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  backoffMultiplier: number;
  jitter: boolean;
}

class RetryManager {
  constructor(private config: RetryConfig) {}

  async executeWithRetry<T>(
    operation: () => Promise<T>,
    context: string
  ): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 1; attempt <= this.config.maxAttempts; attempt++) {
      try {
        const result = await operation();
        
        if (attempt > 1) {
          // Log successful retry
          logger.info('Retry succeeded', { context, attempt });
        }
        
        return result;
      } catch (error) {
        lastError = error as Error;
        
        // Don't retry on client errors (4xx)
        if (this.isClientError(error)) {
          logger.error('Client error - no retry', { context, error });
          throw error;
        }
        
        // Don't retry on last attempt
        if (attempt === this.config.maxAttempts) {
          logger.error('Max retries reached', { context, attempt });
          throw error;
        }
        
        // Calculate delay with exponential backoff
        const delay = this.calculateDelay(attempt);
        logger.warn('Retrying after delay', { context, attempt, delay });
        
        await this.sleep(delay);
      }
    }
    
    throw lastError!;
  }

  private calculateDelay(attempt: number): number {
    // Exponential: 1s, 2s, 4s, 8s, 16s...
    let delay = this.config.baseDelay * 
                Math.pow(this.config.backoffMultiplier, attempt - 1);
    
    // Cap at maxDelay
    delay = Math.min(delay, this.config.maxDelay);
    
    // Add jitter to prevent thundering herd
    if (this.config.jitter) {
      delay = delay * (0.5 + Math.random() * 0.5);
    }
    
    return delay;
  }

  private isClientError(error: any): boolean {
    return error.response?.status >= 400 && error.response?.status < 500;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Integration with BullMQ:

const postbackQueue = new Queue('postbacks', {
  connection,
  defaultJobOptions: {
    attempts: 10,
    backoff: {
      type: 'exponential',
      delay: 2000 // Start with 2s
    },
    removeOnComplete: 100, // Keep last 100 successful jobs
    removeOnFail: 1000      // Keep last 1000 failed jobs
  }
});

const worker = new Worker('postbacks', async (job) => {
  const { conversionId, partnerId } = job.data;
  
  try {
    await retryManager.executeWithRetry(
      () => sendPostback(partnerId, conversionId),
      `postback-${partnerId}-${conversionId}`
    );
  } catch (error) {
    // After all retries failed, move to dead letter queue
    if (job.attemptsMade >= job.opts.attempts) {
      await deadLetterQueue.add('failed-postback', {
        ...job.data,
        error: error.message,
        attempts: job.attemptsMade,
        failedAt: new Date()
      });
    }
    throw error; // Let BullMQ handle the retry
  }
}, {
  connection,
  concurrency: 10 // Process 10 jobs in parallel
});

Why exponential backoff?

  • Linear backoff (1s, 2s, 3s) overwhelms recovering servers
  • Exponential backoff gives servers time to recover
  • Jitter prevents synchronized retries ("thundering herd")

The Idempotency Problem

The nightmare scenario:

  1. Partner's server processes postback successfully
  2. Returns 500 error due to internal issue (not network)
  3. Your system retries
  4. Partner processes it again
  5. Duplicate conversion = double payout = angry partners

The solution: Idempotency keys

import crypto from 'crypto';

interface PostbackEvent {
  conversionId: string;
  partnerId: string;
  amount: number;
  currency: string;
  timestamp: string;
  idempotencyKey: string;
  signature: string;
}

class IdempotencyManager {
  constructor(private redis: Redis) {}

  generateKey(conversion: Conversion, partnerId: string): string {
    // Create unique key from conversion + partner + timestamp window
    const data = `${conversion.id}:${partnerId}:${this.getTimeWindow()}`;
    return crypto.createHash('sha256').update(data).digest('hex');
  }

  async checkAndStore(key: string, ttl: number = 86400): Promise<boolean> {
    // Atomic check-and-set using Redis
    const result = await this.redis.set(
      `idempotency:${key}`,
      '1',
      'EX',
      ttl,
      'NX' // Only set if not exists
    );
    
    return result === 'OK'; // true = can proceed, false = duplicate
  }

  async isProcessed(key: string): Promise<boolean> {
    const exists = await this.redis.exists(`idempotency:${key}`);
    return exists === 1;
  }

  private getTimeWindow(): string {
    // Round to nearest hour for idempotency window
    const now = new Date();
    now.setMinutes(0, 0, 0);
    return now.toISOString();
  }
}

class PostbackDeliveryService {
  constructor(
    private idempotencyManager: IdempotencyManager,
    private retryManager: RetryManager
  ) {}

  async deliver(conversion: Conversion, partner: Partner): Promise<DeliveryResult> {
    // Generate idempotency key
    const idempotencyKey = this.idempotencyManager.generateKey(
      conversion,
      partner.id
    );
    
    // Check if already processed
    const canProceed = await this.idempotencyManager.checkAndStore(idempotencyKey);
    
    if (!canProceed) {
      logger.info('Duplicate postback detected', {
        conversionId: conversion.id,
        partnerId: partner.id,
        idempotencyKey
      });
      
      return {
        status: 'skipped',
        reason: 'duplicate',
        idempotencyKey
      };
    }
    
    // Generate HMAC signature for security
    const signature = this.generateSignature(conversion, partner.secret);
    
    // Build postback payload
    const event: PostbackEvent = {
      conversionId: conversion.id,
      partnerId: partner.id,
      amount: conversion.amount,
      currency: conversion.currency,
      timestamp: new Date().toISOString(),
      idempotencyKey,
      signature
    };
    
    // Deliver with retries
    try {
      const result = await this.retryManager.executeWithRetry(
        () => this.sendHTTP(partner.url, event),
        `postback-${partner.id}-${conversion.id}`
      );
      
      return {
        status: 'success',
        idempotencyKey,
        response: result
      };
    } catch (error) {
      // Even on failure, keep idempotency key to prevent duplicates
      return {
        status: 'failed',
        idempotencyKey,
        error: error.message
      };
    }
  }

  private generateSignature(conversion: Conversion, secret: string): string {
    const payload = `${conversion.id}:${conversion.amount}:${conversion.currency}`;
    return crypto
      .createHmac('sha256', secret)
      .update(payload)
      .digest('hex');
  }

  private async sendHTTP(url: string, event: PostbackEvent): Promise<any> {
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Idempotency-Key': event.idempotencyKey,
        'X-Signature': event.signature
      },
      body: JSON.stringify(event),
      signal: AbortSignal.timeout(10000) // 10s timeout
    });

    if (!response.ok) {
      throw new Error(`HTTP ${response.status}: ${response.statusText}`);
    }

    return response.json();
  }
}

Key design decisions:

  1. Time window: Idempotency keys include hourly window, not exact timestamp

    • Prevents issues with slight clock skew
    • Allows legitimate retries within reasonable timeframe
  2. Redis TTL: Keys expire after 24 hours

    • Balances memory usage with safety
    • Configurable per partner needs
  3. Store on attempt, not success: Even failed deliveries store key

    • Prevents retry storms on intermittent failures
    • Partner might have processed before returning error

Circuit Breaker Pattern for Partner Health

When a partner's server goes down, you don't want to keep hammering it with requests. Circuit breakers stop requests to failing services.

enum CircuitState {
  CLOSED = 'CLOSED',     // Normal operation
  OPEN = 'OPEN',         // Blocking requests
  HALF_OPEN = 'HALF_OPEN' // Testing recovery
}

interface CircuitBreakerConfig {
  failureThreshold: number;
  successThreshold: number;
  timeout: number;
  monitoringWindow: number;
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failures: number = 0;
  private successes: number = 0;
  private lastFailureTime?: number;
  private stateChangeTime: number = Date.now();

  constructor(
    private partnerId: string,
    private config: CircuitBreakerConfig
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    // Check if circuit should transition
    this.checkStateTransition();

    if (this.state === CircuitState.OPEN) {
      throw new CircuitOpenError(
        `Circuit breaker OPEN for partner ${this.partnerId}`
      );
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure(error as Error);
      throw error;
    }
  }

  private onSuccess(): void {
    this.failures = 0;

    if (this.state === CircuitState.HALF_OPEN) {
      this.successes++;
      
      // Require multiple successes before closing
      if (this.successes >= this.config.successThreshold) {
        this.transitionTo(CircuitState.CLOSED);
        this.successes = 0;
        
        logger.info('Circuit breaker closed', {
          partnerId: this.partnerId,
          downtime: Date.now() - this.stateChangeTime
        });
      }
    }
  }

  private onFailure(error: Error): void {
    this.failures++;
    this.lastFailureTime = Date.now();

    if (this.state === CircuitState.HALF_OPEN) {
      // Single failure in half-open reopens circuit
      this.transitionTo(CircuitState.OPEN);
      this.successes = 0;
      
      logger.warn('Circuit breaker reopened', {
        partnerId: this.partnerId,
        error: error.message
      });
    }

    if (this.failures >= this.config.failureThreshold) {
      this.transitionTo(CircuitState.OPEN);
      
      logger.error('Circuit breaker opened', {
        partnerId: this.partnerId,
        failures: this.failures,
        error: error.message
      });
    }
  }

  private checkStateTransition(): void {
    if (this.state === CircuitState.OPEN && this.shouldAttemptReset()) {
      this.transitionTo(CircuitState.HALF_OPEN);
      
      logger.info('Circuit breaker half-open', {
        partnerId: this.partnerId,
        downtime: Date.now() - this.stateChangeTime
      });
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    return Date.now() - this.lastFailureTime >= this.config.timeout;
  }

  private transitionTo(newState: CircuitState): void {
    const oldState = this.state;
    this.state = newState;
    this.stateChangeTime = Date.now();
    
    // Emit metrics for monitoring
    metrics.circuitBreakerStateChange(this.partnerId, oldState, newState);
  }

  getState(): CircuitState {
    return this.state;
  }

  getMetrics() {
    return {
      state: this.state,
      failures: this.failures,
      successes: this.successes,
      stateAge: Date.now() - this.stateChangeTime
    };
  }
}

class CircuitOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'CircuitOpenError';
  }
}

Integration with delivery service:

class PartnerCircuitManager {
  private breakers: Map<string, CircuitBreaker> = new Map();

  getBreaker(partnerId: string): CircuitBreaker {
    if (!this.breakers.has(partnerId)) {
      this.breakers.set(partnerId, new CircuitBreaker(partnerId, {
        failureThreshold: 5,
        successThreshold: 3,
        timeout: 60000, // 1 minute
        monitoringWindow: 10000
      }));
    }
    return this.breakers.get(partnerId)!;
  }

  async deliverWithCircuitBreaker(
    partner: Partner,
    conversion: Conversion
  ): Promise<DeliveryResult> {
    const breaker = this.getBreaker(partner.id);

    try {
      return await breaker.execute(async () => {
        return this.deliveryService.deliver(conversion, partner);
      });
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        // Don't queue for retry - circuit is open
        return {
          status: 'circuit_open',
          partnerId: partner.id,
          reason: 'Partner temporarily unavailable'
        };
      }
      throw error;
    }
  }

  getAllStates(): Map<string, CircuitState> {
    const states = new Map<string, CircuitState>();
    for (const [partnerId, breaker] of this.breakers) {
      states.set(partnerId, breaker.getState());
    }
    return states;
  }
}

Why circuit breakers matter:

  • Prevent cascading failures: One bad partner doesn't DOS your system
  • Faster failure detection: Stop wasting resources on down services
  • Automatic recovery: Tests service health and reopens when ready
  • Better user experience: Fail fast instead of timing out

Advanced: Postback Template Engine

Partners need different data formats. Building a flexible template system was crucial.

interface PostbackTemplate {
  url: string;
  method: 'GET' | 'POST';
  headers?: Record<string, string>;
  bodyFormat?: 'json' | 'form' | 'xml';
  macros: Record<string, string>;
}

class TemplateEngine {
  private macros: Map<string, (conversion: Conversion) => string> = new Map();

  constructor() {
    this.registerDefaultMacros();
  }

  private registerDefaultMacros(): void {
    // Register common macros
    this.macros.set('CONVERSION_ID', (c) => c.id);
    this.macros.set('AMOUNT', (c) => c.amount.toString());
    this.macros.set('CURRENCY', (c) => c.currency);
    this.macros.set('TIMESTAMP', (c) => new Date(c.createdAt).getTime().toString());
    this.macros.set('USER_ID', (c) => c.userId);
    this.macros.set('CLICK_ID', (c) => c.clickId | '');
    this.macros.set('IP_ADDRESS', (c) => c.ipAddress);
    this.macros.set('USER_AGENT', (c) => encodeURIComponent(c.userAgent));
    
    // Advanced macros
    this.macros.set('TRANSACTION_ID', (c) => c.transactionId | '');
    this.macros.set('PRODUCT_ID', (c) => c.productId | '');
    this.macros.set('CATEGORY', (c) => c.category | '');
    this.macros.set('COUNTRY', (c) => c.country | '');
    
    // Custom formatting
    this.macros.set('AMOUNT_CENTS', (c) => (c.amount * 100).toFixed(0));
    this.macros.set('ISO_TIMESTAMP', (c) => new Date(c.createdAt).toISOString());
  }

  compile(template: PostbackTemplate, conversion: Conversion): CompiledPostback {
    // Replace macros in URL
    let url = template.url;
    for (const [macro, value] of Object.entries(template.macros)) {
      const replacement = this.resolveMacro(macro, conversion);
      url = url.replace(new RegExp(`{${macro}}`, 'g'), replacement);
    }

    // Replace macros in headers
    const headers: Record<string, string> = {};
    if (template.headers) {
      for (const [key, value] of Object.entries(template.headers)) {
        headers[key] = this.replaceMacros(value, conversion);
      }
    }

    // Build body based on format
    let body: any = null;
    if (template.method === 'POST') {
      body = this.buildBody(template, conversion);
    }

    return { url, method: template.method, headers, body };
  }

  private resolveMacro(macro: string, conversion: Conversion): string {
    const resolver = this.macros.get(macro);
    if (!resolver) {
      logger.warn('Unknown macro', { macro });
      return '';
    }
    return resolver(conversion);
  }

  private replaceMacros(text: string, conversion: Conversion): string {
    let result = text;
    for (const [macro, resolver] of this.macros) {
      result = result.replace(
        new RegExp(`{${macro}}`, 'g'),
        resolver(conversion)
      );
    }
    return result;
  }

  private buildBody(template: PostbackTemplate, conversion: Conversion): any {
    if (template.bodyFormat === 'json') {
      return this.buildJsonBody(template.macros, conversion);
    } else if (template.bodyFormat === 'form') {
      return this.buildFormBody(template.macros, conversion);
    } else if (template.bodyFormat === 'xml') {
      return this.buildXmlBody(template.macros, conversion);
    }
    return null;
  }

  private buildJsonBody(macros: Record<string, string>, conversion: Conversion): any {
    const body: any = {};
    for (const [key, macro] of Object.entries(macros)) {
      body[key] = this.resolveMacro(macro, conversion);
    }
    return body;
  }

  private buildFormBody(macros: Record<string, string>, conversion: Conversion): string {
    const params = new URLSearchParams();
    for (const [key, macro] of Object.entries(macros)) {
      params.append(key, this.resolveMacro(macro, conversion));
    }
    return params.toString();
  }

  private buildXmlBody(macros: Record<string, string>, conversion: Conversion): string {
    let xml = '<?xml version="1.0" encoding="UTF-8"?><postback>';
    for (const [key, macro] of Object.entries(macros)) {
      const value = this.resolveMacro(macro, conversion);
      xml += `<${key}>${this.escapeXml(value)}</${key}>`;
    }
    xml += '</postback>';
    return xml;
  }

  private escapeXml(text: string): string {
    return text
      .replace(/&/g, '&amp;')
      .replace(/</g, '&lt;')
      .replace(/>/g, '&gt;')
      .replace(/"/g, '&quot;')
      .replace(/'/g, '&apos;');
  }
}

interface CompiledPostback {
  url: string;
  method: 'GET' | 'POST';
  headers: Record<string, string>;
  body: any;
}

Example usage:

// Partner 1: Simple GET request
const partner1Template: PostbackTemplate = {
  url: 'https://partner1.com/postback?id={CONVERSION_ID}&amount={AMOUNT}&currency={CURRENCY}',
  method: 'GET',
  macros: {
    CONVERSION_ID: 'CONVERSION_ID',
    AMOUNT: 'AMOUNT',
    CURRENCY: 'CURRENCY'
  }
};

// Partner 2: POST with JSON body
const partner2Template: PostbackTemplate = {
  url: 'https://partner2.com/api/conversions',
  method: 'POST',
  headers: {
    'Authorization': 'Bearer {API_KEY}',
    'Content-Type': 'application/json'
  },
  bodyFormat: 'json',
  macros: {
    transaction_id: 'CONVERSION_ID',
    amount_cents: 'AMOUNT_CENTS',
    click_id: 'CLICK_ID',
    timestamp: 'ISO_TIMESTAMP'
  }
};

// Partner 3: Form-encoded POST
const partner3Template: PostbackTemplate = {
  url: 'https://partner3.com/track',
  method: 'POST',
  headers: {
    'Content-Type': 'application/x-www-form-urlencoded'
  },
  bodyFormat: 'form',
  macros: {
    txid: 'TRANSACTION_ID',
    payout: 'AMOUNT',
    ip: 'IP_ADDRESS'
  }
};

Monitoring & Observability

You can't fix what you can't see. Comprehensive monitoring is critical.

interface DeliveryMetrics {
  partnerId: string;
  successCount: number;
  failureCount: number;
  averageLatency: number;
  p95Latency: number;
  p99Latency: number;
  lastSuccessAt?: Date;
  lastFailureAt?: Date;
  circuitState: CircuitState;
  queueDepth: number;
}

class MetricsCollector {
  private metrics: Map<string, DeliveryMetrics> = new Map();
  private latencies: Map<string, number[]> = new Map();

  recordSuccess(partnerId: string, latency: number): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.successCount++;
    metric.lastSuccessAt = new Date();
    
    this.recordLatency(partnerId, latency);
    this.updateLatencyPercentiles(partnerId);
  }

  recordFailure(partnerId: string, error: Error): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.failureCount++;
    metric.lastFailureAt = new Date();
    
    // Log to error tracking (Sentry, etc.)
    logger.error('Postback delivery failed', {
      partnerId,
      error: error.message,
      stack: error.stack
    });
  }

  recordCircuitState(partnerId: string, state: CircuitState): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.circuitState = state;
  }

  recordQueueDepth(partnerId: string, depth: number): void {
    const metric = this.getOrCreateMetric(partnerId);
    metric.queueDepth = depth;
  }

  private recordLatency(partnerId: string, latency: number): void {
    if (!this.latencies.has(partnerId)) {
      this.latencies.set(partnerId, []);
    }
    
    const latencies = this.latencies.get(partnerId)!;
    latencies.push(latency);
    
    // Keep only last 1000 measurements
    if (latencies.length > 1000) {
      latencies.shift();
    }
  }

  private updateLatencyPercentiles(partnerId: string): void {
    const latencies = this.latencies.get(partnerId);
    if (!latencies | latencies.length === 0) return;
    
    const sorted = [...latencies].sort((a, b) => a - b);
    const metric = this.getOrCreateMetric(partnerId);
    
    metric.averageLatency = latencies.reduce((a, b) => a + b, 0) / latencies.length;
    metric.p95Latency = sorted[Math.floor(sorted.length * 0.95)];
    metric.p99Latency = sorted[Math.floor(sorted.length * 0.99)];
  }

  getMetrics(partnerId: string): DeliveryMetrics | undefined {
    return this.metrics.get(partnerId);
  }

  getAllMetrics(): DeliveryMetrics[] {
    return Array.from(this.metrics.values());
  }

  getHealthCheck(): HealthCheckResult {
    const metrics = this.getAllMetrics();
    
    const openCircuits = metrics.filter(m => m.circuitState === CircuitState.OPEN);
    const slowPartners = metrics.filter(m => m.p95Latency > 5000); // > 5s
    const failingPartners = metrics.filter(m => {
      const total = m.successCount + m.failureCount;
      return total > 0 && (m.failureCount / total) > 0.1; // > 10% failure rate
    });

    return {
      healthy: openCircuits.length === 0 && failingPartners.length === 0,
      openCircuits: openCircuits.map(m => m.partnerId),
      slowPartners: slowPartners.map(m => m.partnerId),
      failingPartners: failingPartners.map(m => ({
        partnerId: m.partnerId,
        failureRate: (m.failureCount / (m.successCount + m.failureCount))
      })),
      totalQueueDepth: metrics.reduce((sum, m) => sum + m.queueDepth, 0)
    };
  }

  private getOrCreateMetric(partnerId: string): DeliveryMetrics {
    if (!this.metrics.has(partnerId)) {
      this.metrics.set(partnerId, {
        partnerId,
        successCount: 0,
        failureCount: 0,
        averageLatency: 0,
        p95Latency: 0,
        p99Latency: 0,
        circuitState: CircuitState.CLOSED,
        queueDepth: 0
      });
    }
    return this.metrics.get(partnerId)!;
  }
}

interface HealthCheckResult {
  healthy: boolean;
  openCircuits: string[];
  slowPartners: string[];
  failingPartners: Array<{ partnerId: string; failureRate: number }>;
  totalQueueDepth: number;
}

Real-time alerting:

class AlertManager {
  constructor(
    private metricsCollector: MetricsCollector,
    private slackWebhook: string
  ) {
    // Check health every minute
    setInterval(() => this.checkAndAlert(), 60000);
  }

  async checkAndAlert(): Promise<void> {
    const health = this.metricsCollector.getHealthCheck();
    
    // Alert on open circuits
    if (health.openCircuits.length > 0) {
      await this.sendAlert({
        severity: 'high',
        title: 'Circuit Breakers Open',
        message: `${health.openCircuits.length} partner(s) unavailable: ${health.openCircuits.join(', ')}`,
        color: 'danger'
      });
    }

    // Alert on high failure rates
    for (const failing of health.failingPartners) {
      if (failing.failureRate > 0.25) { // > 25%
        await this.sendAlert({
          severity: 'medium',
          title: 'High Failure Rate',
          message: `Partner ${failing.partnerId}: ${(failing.failureRate * 100).toFixed(1)}% failures`,
          color: 'warning'
        });
      }
    }

    // Alert on queue backup
    if (health.totalQueueDepth > 10000) {
      await this.sendAlert({
        severity: 'critical',
        title: 'Queue Backup',
        message: `${health.totalQueueDepth} postbacks pending delivery`,
        color: 'danger'
      });
    }

    // Alert on slow partners
    if (health.slowPartners.length > 0) {
      await this.sendAlert({
        severity: 'low',
        title: 'Slow Partners',
        message: `${health.slowPartners.length} partner(s) responding slowly: ${health.slowPartners.join(', ')}`,
        color: 'warning'
      });
    }
  }

  private async sendAlert(alert: Alert): Promise<void> {
    try {
      await fetch(this.slackWebhook, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          attachments: [{
            color: alert.color,
            title: `[${alert.severity.toUpperCase()}] ${alert.title}`,
            text: alert.message,
            ts: Math.floor(Date.now() / 1000)
          }]
        })
      });
    } catch (error) {
      logger.error('Failed to send alert', { error });
    }
  }
}

interface Alert {
  severity: 'low' | 'medium' | 'high' | 'critical';
  title: string;
  message: string;
  color: 'good' | 'warning' | 'danger';
}

Dashboard endpoint:

app.get('/api/postback-metrics', async (req, res) => {
  const metrics = metricsCollector.getAllMetrics();
  const health = metricsCollector.getHealthCheck();
  
  res.json({
    health,
    metrics: metrics.map(m => ({
      partnerId: m.partnerId,
      successRate: m.successCount / (m.successCount + m.failureCount) | 0,
      averageLatency: Math.round(m.averageLatency),
      p95Latency: Math.round(m.p95Latency),
      circuitState: m.circuitState,
      queueDepth: m.queueDepth,
      lastSuccess: m.lastSuccessAt?.toISOString(),
      lastFailure: m.lastFailureAt?.toISOString()
    }))
  });
});

Database Schema for Audit Trail

Maintain a complete audit log for compliance and debugging:

-- Postback delivery logs
CREATE TABLE postback_logs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  conversion_id UUID NOT NULL REFERENCES conversions(id),
  partner_id UUID NOT NULL REFERENCES partners(id),
  
  -- Request details
  idempotency_key VARCHAR(64) NOT NULL,
  url TEXT NOT NULL,
  method VARCHAR(10) NOT NULL,
  headers JSONB,
  body JSONB,
  
  -- Response details
  status VARCHAR(20) NOT NULL, -- 'success', 'failed', 'skipped', 'circuit_open'
  http_status_code INTEGER,
  response_body TEXT,
  error_message TEXT,
  
  -- Timing
  latency_ms INTEGER,
  attempt_number INTEGER NOT NULL DEFAULT 1,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  
  -- Indexes for queries
  INDEX idx_conversion_partner (conversion_id, partner_id),
  INDEX idx_status_created (status, created_at),
  INDEX idx_idempotency (idempotency_key)
);

-- Dead letter queue for failed deliveries
CREATE TABLE postback_dlq (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  conversion_id UUID NOT NULL,
  partner_id UUID NOT NULL,
  
  -- Failure details
  error_message TEXT NOT NULL,
  attempts INTEGER NOT NULL,
  last_attempt_at TIMESTAMP NOT NULL,
  
  -- Original payload
  payload JSONB NOT NULL,
  
  -- Status tracking
  status VARCHAR(20) NOT NULL DEFAULT 'pending', -- 'pending', 'retrying', 'resolved', 'abandoned'
  resolved_at TIMESTAMP,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  
  INDEX idx_status_created (status, created_at)
);

-- Partner health metrics (time-series data)
CREATE TABLE partner_metrics (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  partner_id UUID NOT NULL REFERENCES partners(id),
  
  -- Metrics window
  window_start TIMESTAMP NOT NULL,
  window_end TIMESTAMP NOT NULL,
  
  -- Counters
  success_count INTEGER NOT NULL DEFAULT 0,
  failure_count INTEGER NOT NULL DEFAULT 0,
  
  -- Latency stats
  avg_latency_ms INTEGER,
  p95_latency_ms INTEGER,
  p99_latency_ms INTEGER,
  
  -- Circuit breaker state
  circuit_state VARCHAR(20),
  
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  
  INDEX idx_partner_window (partner_id, window_start)
);

TypeScript repository:

class PostbackLogRepository {
  constructor(private db: Database) {}

  async logAttempt(log: PostbackLogEntry): Promise<void> {
    await this.db.query(`
      INSERT INTO postback_logs (
        conversion_id, partner_id, idempotency_key,
        url, method, headers, body,
        status, http_status_code, response_body, error_message,
        latency_ms, attempt_number
      ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13)
    `, [
      log.conversionId,
      log.partnerId,
      log.idempotencyKey,
      log.url,
      log.method,
      JSON.stringify(log.headers),
      JSON.stringify(log.body),
      log.status,
      log.httpStatusCode,
      log.responseBody,
      log.errorMessage,
      log.latencyMs,
      log.attemptNumber
    ]);
  }

  async getDeliveryHistory(
    conversionId: string,
    partnerId: string
  ): Promise<PostbackLogEntry[]> {
    const result = await this.db.query(`
      SELECT * FROM postback_logs
      WHERE conversion_id = $1 AND partner_id = $2
      ORDER BY created_at DESC
    `, [conversionId, partnerId]);
    
    return result.rows;
  }

  async getFailedDeliveries(limit: number = 100): Promise<PostbackLogEntry[]> {
    const result = await this.db.query(`
      SELECT * FROM postback_logs
      WHERE status = 'failed'
      ORDER BY created_at DESC
      LIMIT $1
    `, [limit]);
    
    return result.rows;
  }
}

Performance Optimization Techniques

1. Batch Processing for Multiple Partners

class BatchPostbackProcessor {
  async processConversion(conversion: Conversion, partners: Partner[]): Promise<void> {
    // Group partners by priority
    const highPriority = partners.filter(p => p.priority === 'high');
    const normalPriority = partners.filter(p => p.priority === 'normal');
    
    // Send high priority immediately
    await Promise.all(
      highPriority.map(partner => 
        postbackQueue.add('send-postback', {
          conversionId: conversion.id,
          partnerId: partner.id
        }, {
          priority: 1 // Higher priority in queue
        })
      )
    );
    
    // Batch normal priority (reduces queue overhead)
    if (normalPriority.length > 0) {
      await postbackQueue.add('send-batch-postback', {
        conversionId: conversion.id,
        partnerIds: normalPriority.map(p => p.id)
      }, {
        priority: 10
      });
    }
  }
}

// Worker processes batches
const worker = new Worker('postbacks', async (job) => {
  if (job.name === 'send-batch-postback') {
    const { conversionId, partnerIds } = job.data;
    
    // Process in parallel with concurrency limit
    await Promise.all(
      partnerIds.map(partnerId =>
        limiter.schedule(() => 
          deliveryService.deliver(conversionId, partnerId)
        )
      )
    );
  }
}, { connection });

// Rate limiter for parallel processing
import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  maxConcurrent: 50, // Max 50 concurrent requests
  minTime: 20 // Minimum 20ms between requests
});

2. Connection Pooling

import axios from 'axios';
import http from 'http';
import https from 'https';

// Reuse HTTP connections
const httpAgent = new http.Agent({
  keepAlive: true,
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000
});

const httpsAgent = new https.Agent({
  keepAlive: true,
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000
});

const axiosInstance = axios.create({
  httpAgent,
  httpsAgent,
  timeout: 10000
});

class HTTPClient {
  async post(url: string, data: any, headers: any): Promise<any> {
    return axiosInstance.post(url, data, { headers });
  }
}

3. Caching Partner Configurations

class PartnerConfigCache {
  private cache: Map<string, Partner> = new Map();
  private ttl: number = 300000; // 5 minutes

  async getPartner(partnerId: string): Promise<Partner> {
    // Check cache first
    const cached = this.cache.get(partnerId);
    if (cached) {
      return cached;
    }

    // Fetch from database
    const partner = await this.db.partner.findUnique({
      where: { id: partnerId }
    });

    if (!partner) {
      throw new Error(`Partner ${partnerId} not found`);
    }

    // Cache for future use
    this.cache.set(partnerId, partner);
    
    // Auto-expire after TTL
    setTimeout(() => {
      this.cache.delete(partnerId);
    }, this.ttl);

    return partner;
  }

  invalidate(partnerId: string): void {
    this.cache.delete(partnerId);
  }

  clear(): void {
    this.cache.clear();
  }
}

Testing Strategy

Unit Tests

import { describe, it, expect, jest, beforeEach } from '@jest/globals';

describe('RetryManager', () => {
  let retryManager: RetryManager;

  beforeEach(() => {
    retryManager = new RetryManager({
      maxAttempts: 3,
      baseDelay: 100,
      maxDelay: 1000,
      backoffMultiplier: 2,
      jitter: false
    });
  });

  it('should succeed on first attempt', async () => {
    const operation = jest.fn().mockResolvedValue('success');
    
    const result = await retryManager.executeWithRetry(operation, 'test');
    
    expect(result).toBe('success');
    expect(operation).toHaveBeenCalledTimes(1);
  });

  it('should retry on transient failure', async () => {
    const operation = jest.fn()
      .mockRejectedValueOnce(new Error('timeout'))
      .mockResolvedValueOnce('success');
    
    const result = await retryManager.executeWithRetry(operation, 'test');
    
    expect(result).toBe('success');
    expect(operation).toHaveBeenCalledTimes(2);
  });

  it('should not retry on client error', async () => {
    const error = new Error('400 Bad Request');
    const operation = jest.fn().mockRejectedValue(error);
    
    await expect(
      retryManager.executeWithRetry(operation, 'test')
    ).rejects.toThrow('400 Bad Request');
    
    expect(operation).toHaveBeenCalledTimes(1);
  });

  it('should throw after max attempts', async () => {
    const operation = jest.fn().mockRejectedValue(new Error('503'));
    
    await expect(
      retryManager.executeWithRetry(operation, 'test')
    ).rejects.toThrow('503');
    
    expect(operation).toHaveBeenCalledTimes(3);
  });
});

describe('CircuitBreaker', () => {
  let circuitBreaker: CircuitBreaker;

  beforeEach(() => {
    circuitBreaker = new CircuitBreaker('test-partner', {
      failureThreshold: 2,
      successThreshold: 3,
      timeout: 1000,
      monitoringWindow: 500
    });
  });

  it('should open after threshold failures', async () => {
    const operation = jest.fn().mockRejectedValue(new Error('failure'));
    
    // First failure
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.CLOSED);
    
    // Second failure - opens circuit
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
    
    // Third call blocked
    await expect(circuitBreaker.execute(operation)).rejects.toThrow('Circuit breaker is OPEN');
    expect(operation).toHaveBeenCalledTimes(2);
  });

  it('should transition to half-open after timeout', async () => {
    // Open the circuit
    const operation = jest.fn().mockRejectedValue(new Error('failure'));
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    await expect(circuitBreaker.execute(operation)).rejects.toThrow();
    expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
    
    // Wait for timeout
    await new Promise(resolve => setTimeout(resolve, 1100));
    
    // Should attempt in half-open
    operation.mockResolvedValue('success');
    await circuitBreaker.execute(operation);
    expect(circuitBreaker.getState()).toBe(CircuitState.HALF_OPEN);
  });
});

Integration Tests

describe('Postback Delivery Integration', () => {
  let testServer: Server;
  let requestLog: any[] = [];

  beforeAll(async () => {
    // Start mock partner server
    testServer = createTestServer((req, res) => {
      requestLog.push({
        method: req.method,
        url: req.url,
        headers: req.headers,
        body: req.body
      });
      res.status(200).json({ success: true });
    });
  });

  afterAll(() => {
    testServer.close();
  });

  beforeEach(() => {
    requestLog = [];
  });

  it('should deliver postback with retry on failure', async () => {
    let attempts = 0;
    testServer.setHandler((req, res) => {
      attempts++;
      if (attempts < 3) {
        res.status(500).json({ error: 'Server error' });
      } else {
        res.status(200).json({ success: true });
      }
    });

    const conversion = createTestConversion();
    const partner = createTestPartner({ url: testServer.url });

    await deliveryService.deliver(conversion, partner);

    expect(attempts).toBe(3);
    expect(requestLog.length).toBe(3);
  });

  it('should prevent duplicate delivery with idempotency', async () => {
    const conversion = createTestConversion();
    const partner = createTestPartner({ url: testServer.url });

    // Send twice
    await deliveryService.deliver(conversion, partner);
    await deliveryService.deliver(conversion, partner);

    // Should only deliver once
    expect(requestLog.length).toBe(1);
  });

  it('should apply template correctly', async () => {
    const conversion = createTestConversion({
      id: 'conv-123',
      amount: 50.00,
      currency: 'USD'
    });

    const partner = createTestPartner({
      url: testServer.url + '/track',
      template: {
        url: '{url}?txid={CONVERSION_ID}&amount={AMOUNT}&currency={CURRENCY}',
        method: 'GET',
        macros: {
          CONVERSION_ID: 'CONVERSION_ID',
          AMOUNT: 'AMOUNT',
          CURRENCY: 'CURRENCY'
        }
      }
    });

    await deliveryService.deliver(conversion, partner);

    expect(requestLog[0].url).toContain('txid=conv-123');
    expect(requestLog[0].url).toContain('amount=50');
    expect(requestLog[0].url).toContain('currency=USD');
  });
});

Deployment Architecture

Docker Compose Setup

version: '3.8'

services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: production
      DATABASE_URL: postgresql://user:pass@postgres:5432/postback
      REDIS_URL: redis://redis:6379
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '1'
          memory: 1G

  worker:
    build: .
    command: node dist/worker.js
    environment:
      NODE_ENV: production
      DATABASE_URL: postgresql://user:pass@postgres:5432/postback
      REDIS_URL: redis://redis:6379
      WORKER_CONCURRENCY: 10
    depends_on:
      - postgres
      - redis
    deploy:
      replicas: 5
      resources:
        limits:
          cpus: '2'
          memory: 2G

  postgres:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: postback
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
    ports:
      - "6379:6379"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

volumes:
  postgres_data:
  redis_data:
  grafana_data:

Environment Configuration

# .env.production
NODE_ENV=production

# Database
DATABASE_URL=postgresql://user:pass@postgres:5432/postback
DB_POOL_MIN=5
DB_POOL_MAX=20

# Redis
REDIS_URL=redis://redis:6379
REDIS_MAX_RETRIES=3

# Queue Configuration
QUEUE_CONCURRENCY=10
QUEUE_MAX_JOBS_PER_WORKER=100

# Circuit Breaker
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=60000

# Retry Configuration
RETRY_MAX_ATTEMPTS=10
RETRY_BASE_DELAY=2000
RETRY_MAX_DELAY=60000

# Monitoring
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SENTRY_DSN=https://your-sentry-dsn

# Rate Limiting
RATE_LIMIT_PER_PARTNER=1000
RATE_LIMIT_WINDOW=60000

Real-World Lessons & Gotchas

1. The Timeout Cascade Problem

What happened: Partner's server was responding in 29 seconds. My timeout was 30 seconds. Under load, all workers got stuck waiting.

Solution: Adaptive timeouts per partner based on historical latency:

class AdaptiveTimeoutManager {
  getTimeout(partnerId: string): number {
    const metrics = metricsCollector.getMetrics(partnerId);
    if (!metrics) return 10000; // Default 10s

    // Timeout = P95 latency + 50% buffer, capped at 30s
    const timeout = Math.min(metrics.p95Latency * 1.5, 30000);
    return Math.max(timeout, 5000); // Minimum 5s
  }
}

2. The Duplicate Payment Disaster

What happened: Partner's load balancer returned 502 after processing. We retried. They charged twice.

Solution: Server-side idempotency with longer TTL:

// Store idempotency key for 7 days, not 24 hours
await idempotencyManager.checkAndStore(key, 604800);

3. The Redis Memory Explosion

What happened: Queued 1M jobs during partner outage. Redis ran out of memory.

Solution: Queue size limits with backpressure:

const queueSize = await postbackQueue.count();
if (queueSize > 100000) {
  // Stop accepting new jobs
  throw new QueueOverloadError('Postback queue is full');
}

4. The Silent Failure

What happened: Partner changed their API contract. Postbacks "succeeded" (200 OK) but weren't processed.

Solution: Validate partner responses:

async function validateResponse(response: any, partner: Partner): Promise<boolean> {
  if (partner.responseValidation) {
    // Check for expected fields
    const valid = partner.responseValidation.requiredFields.every(
      field => response[field] !== undefined
    );
    
    if (!valid) {
      logger.warn('Invalid response from partner', {
        partnerId: partner.id,
        response
      });
      return false;
    }
  }
  return true;
}

What This Enables Beyond Affiliate Marketing

The patterns you've learned apply to:

1. Payment Webhooks

// Stripe-style webhook delivery
class PaymentWebhookService extends PostbackDeliveryService {
  async notifyMerchant(transaction: Transaction, merchant: Merchant) {
    return this.deliver({
      event: 'payment.succeeded',
      transaction_id: transaction.id,
      amount: transaction.amount
    }, merchant.webhookUrl);
  }
}

2. Microservices Event Bus

// Internal event delivery between services
class EventBus extends PostbackDeliveryService {
  async publish(event: DomainEvent) {
    const subscribers = await this.getSubscribers(event.type);
    
    for (const subscriber of subscribers) {
      await this.deliver(event, subscriber.endpoint);
    }
  }
}

3. IoT Device Notifications

// Real-time device event delivery
class IoTEventDelivery extends PostbackDeliveryService {
  async notifyDevice(deviceId: string, command: DeviceCommand) {
    const device = await this.getDevice(deviceId);
    
    return this.deliver(command, device.callbackUrl);
  }
}

4. SaaS Integration Platform

// Zapier-style integration delivery
class IntegrationEngine extends PostbackDeliveryService {
  async triggerWorkflow(trigger: Trigger, data: any) {
    const workflows = await this.getWorkflows(trigger.type);
    
    for (const workflow of workflows) {
      await this.deliver(data, workflow.webhookUrl);
    }
  }
}

Performance Benchmarks

From my production system:

MetricValue
Throughput100K postbacks/hour
Latency (P50)250ms
Latency (P95)1.2s
Latency (P99)3.5s
Success Rate99.2%
Worker Count5 instances
Queue Depth (avg)2,500 jobs
Memory per Worker512MB
CPU per Worker0.5 cores
Cost per Million~$15

Summary: Key Takeaways

Building a production-grade postback system taught me:

1. Never Trust the Network

  • Implement retries with exponential backoff
  • Use circuit breakers to prevent cascading failures
  • Always have timeouts (and make them adaptive)

2. Idempotency is Non-Negotiable

  • Generate unique keys for every delivery
  • Store them with appropriate TTL
  • Protect against duplicates at all costs

3. Queues are Your Safety Net

  • Decouple ingestion from delivery
  • Persist jobs to prevent data loss
  • Scale workers independently from API

4. Monitor Everything

  • Track success rates per partner
  • Alert on anomalies proactively
  • Maintain audit logs for debugging

5. Design for Failure

  • Failures will happen—plan for them
  • Have fallback strategies (DLQ, manual retry)
  • Test failure scenarios regularly

6. Performance Matters

  • Use connection pooling
  • Batch when possible
  • Cache partner configurations
  • Monitor resource usage

7. These Patterns are Universal

  • Webhooks, event buses, notifications—all need this
  • The principles apply to any event delivery system
  • Learn once, apply everywhere

What's Next?

If I were to extend this system, I'd add:

  1. Multi-region deployment for global latency reduction
  2. GraphQL subscriptions for real-time delivery status
  3. ML-based anomaly detection for partner health
  4. A/B testing framework for postback variations
  5. Webhook debugging tools (request inspection, replay)

Resources & References


Built something similar? Have questions? The patterns here are battle-tested in production handling millions of events. The principles apply whether you're building affiliate tracking, payment webhooks, or any event-driven system.

Remember: Reliability isn't a feature—it's a requirement. 🚀

← Back to Home