Building a Production-Grade Postback System: Lessons from the Trenches
2025-10-02
Building a Production-Grade Postback System: Lessons from the Trenches
While building an affiliate marketing platform, I spent three months on a feature most users will never see: the postback system. What started as "just send HTTP requests when conversions happen" turned into a fascinating journey through distributed systems, reliability patterns, and production-grade infrastructure.
This isn't about affiliate marketing—it's about reliable event delivery, a problem every developer faces when building integrations, webhooks, payment systems, or any event-driven architecture.
The Problem: Event Delivery at Scale
A postback system delivers conversion events to partner servers. When a user completes an action (sale, signup, click), you need to notify multiple partners reliably. Simple, right?
Wrong.
Here's what I learned the hard way:
- Partner servers go down randomly (and don't tell you)
- Networks are unreliable—timeouts happen constantly
- Duplicate delivery can cost real money
- A slow partner can bring down your entire system
- You need delivery guarantees, not "best effort"
The stakes are high: Failed postbacks mean lost revenue, broken partnerships, and compliance issues.
Why This Matters Beyond Affiliate Marketing
Before we dive deep, understand that postback systems are everywhere:
| Use Case | Event Type | Why It Needs This |
|---|---|---|
| Payment Gateways | Transaction confirmations | Money is involved—duplicates cost thousands |
| E-commerce | Order status webhooks | Inventory sync, shipping triggers |
| Ad Networks | Click/impression callbacks | Attribution accuracy, billing |
| SaaS Integrations | Webhook deliveries (Stripe-style) | Third-party app functionality |
| IoT Platforms | Device event notifications | Real-time device monitoring |
| Microservices | Inter-service events | Distributed system communication |
If you've ever built webhooks, API integrations, or notification systems—this is for you.
Architecture Evolution: From Naive to Production-Grade
Level 0: The Naive Approach (Don't Do This)
// ❌ What I started with (production disaster waiting to happen)
app.post('/conversion', async (req, res) => {
const conversion = await saveConversion(req.body);
// Send postbacks synchronously
for (const partner of partners) {
try {
await fetch(partner.url, {
method: 'POST',
body: JSON.stringify(conversion),
headers: { 'Content-Type': 'application/json' }
});
} catch (error) {
console.log('Failed:', error.message); // Lost forever!
}
}
res.json({ success: true });
});
Why this fails:
- User waits for all partner responses (10+ seconds)
- One slow partner times out the entire request
- Failed postbacks are lost forever
- No retry mechanism
- No idempotency protection
- API becomes unresponsive under load
Production impact: 503 errors, angry users, lost conversions.
Level 1: Queue-Based Architecture
The first major improvement: decouple ingestion from delivery.
import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';
const connection = new Redis({
host: process.env.REDIS_HOST,
port: 6379,
maxRetriesPerRequest: null
});
const postbackQueue = new Queue('postbacks', { connection });
// ✅ API endpoint becomes fast
app.post('/conversion', async (req, res) => {
const conversion = await saveConversion(req.body);
// Queue for async processing
await postbackQueue.add('send-postback', {
conversionId: conversion.id,
partners: partners.map(p => p.id),
timestamp: Date.now()
});
res.json({ success: true }); // Instant response
});
// Worker handles delivery
const worker = new Worker('postbacks', async (job) => {
const { conversionId, partners } = job.data;
const conversion = await getConversion(conversionId);
for (const partnerId of partners) {
await deliverPostback(partnerId, conversion);
}
}, { connection });
What this solves:
- API responds instantly (< 50ms)
- Failures don't block user requests
- Workers can scale independently
- Redis persistence prevents job loss
What's still missing:
- No retry logic
- No idempotency
- No delivery guarantees
- No monitoring
Level 2: Retry Strategy with Exponential Backoff
Networks fail. Servers go down. You need intelligent retries.
interface RetryConfig {
maxAttempts: number;
baseDelay: number;
maxDelay: number;
backoffMultiplier: number;
jitter: boolean;
}
class RetryManager {
constructor(private config: RetryConfig) {}
async executeWithRetry<T>(
operation: () => Promise<T>,
context: string
): Promise<T> {
let lastError: Error;
for (let attempt = 1; attempt <= this.config.maxAttempts; attempt++) {
try {
const result = await operation();
if (attempt > 1) {
// Log successful retry
logger.info('Retry succeeded', { context, attempt });
}
return result;
} catch (error) {
lastError = error as Error;
// Don't retry on client errors (4xx)
if (this.isClientError(error)) {
logger.error('Client error - no retry', { context, error });
throw error;
}
// Don't retry on last attempt
if (attempt === this.config.maxAttempts) {
logger.error('Max retries reached', { context, attempt });
throw error;
}
// Calculate delay with exponential backoff
const delay = this.calculateDelay(attempt);
logger.warn('Retrying after delay', { context, attempt, delay });
await this.sleep(delay);
}
}
throw lastError!;
}
private calculateDelay(attempt: number): number {
// Exponential: 1s, 2s, 4s, 8s, 16s...
let delay = this.config.baseDelay *
Math.pow(this.config.backoffMultiplier, attempt - 1);
// Cap at maxDelay
delay = Math.min(delay, this.config.maxDelay);
// Add jitter to prevent thundering herd
if (this.config.jitter) {
delay = delay * (0.5 + Math.random() * 0.5);
}
return delay;
}
private isClientError(error: any): boolean {
return error.response?.status >= 400 && error.response?.status < 500;
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
Integration with BullMQ:
const postbackQueue = new Queue('postbacks', {
connection,
defaultJobOptions: {
attempts: 10,
backoff: {
type: 'exponential',
delay: 2000 // Start with 2s
},
removeOnComplete: 100, // Keep last 100 successful jobs
removeOnFail: 1000 // Keep last 1000 failed jobs
}
});
const worker = new Worker('postbacks', async (job) => {
const { conversionId, partnerId } = job.data;
try {
await retryManager.executeWithRetry(
() => sendPostback(partnerId, conversionId),
`postback-${partnerId}-${conversionId}`
);
} catch (error) {
// After all retries failed, move to dead letter queue
if (job.attemptsMade >= job.opts.attempts) {
await deadLetterQueue.add('failed-postback', {
...job.data,
error: error.message,
attempts: job.attemptsMade,
failedAt: new Date()
});
}
throw error; // Let BullMQ handle the retry
}
}, {
connection,
concurrency: 10 // Process 10 jobs in parallel
});
Why exponential backoff?
- Linear backoff (1s, 2s, 3s) overwhelms recovering servers
- Exponential backoff gives servers time to recover
- Jitter prevents synchronized retries ("thundering herd")
The Idempotency Problem
The nightmare scenario:
- Partner's server processes postback successfully
- Returns 500 error due to internal issue (not network)
- Your system retries
- Partner processes it again
- Duplicate conversion = double payout = angry partners
The solution: Idempotency keys
import crypto from 'crypto';
interface PostbackEvent {
conversionId: string;
partnerId: string;
amount: number;
currency: string;
timestamp: string;
idempotencyKey: string;
signature: string;
}
class IdempotencyManager {
constructor(private redis: Redis) {}
generateKey(conversion: Conversion, partnerId: string): string {
// Create unique key from conversion + partner + timestamp window
const data = `${conversion.id}:${partnerId}:${this.getTimeWindow()}`;
return crypto.createHash('sha256').update(data).digest('hex');
}
async checkAndStore(key: string, ttl: number = 86400): Promise<boolean> {
// Atomic check-and-set using Redis
const result = await this.redis.set(
`idempotency:${key}`,
'1',
'EX',
ttl,
'NX' // Only set if not exists
);
return result === 'OK'; // true = can proceed, false = duplicate
}
async isProcessed(key: string): Promise<boolean> {
const exists = await this.redis.exists(`idempotency:${key}`);
return exists === 1;
}
private getTimeWindow(): string {
// Round to nearest hour for idempotency window
const now = new Date();
now.setMinutes(0, 0, 0);
return now.toISOString();
}
}
class PostbackDeliveryService {
constructor(
private idempotencyManager: IdempotencyManager,
private retryManager: RetryManager
) {}
async deliver(conversion: Conversion, partner: Partner): Promise<DeliveryResult> {
// Generate idempotency key
const idempotencyKey = this.idempotencyManager.generateKey(
conversion,
partner.id
);
// Check if already processed
const canProceed = await this.idempotencyManager.checkAndStore(idempotencyKey);
if (!canProceed) {
logger.info('Duplicate postback detected', {
conversionId: conversion.id,
partnerId: partner.id,
idempotencyKey
});
return {
status: 'skipped',
reason: 'duplicate',
idempotencyKey
};
}
// Generate HMAC signature for security
const signature = this.generateSignature(conversion, partner.secret);
// Build postback payload
const event: PostbackEvent = {
conversionId: conversion.id,
partnerId: partner.id,
amount: conversion.amount,
currency: conversion.currency,
timestamp: new Date().toISOString(),
idempotencyKey,
signature
};
// Deliver with retries
try {
const result = await this.retryManager.executeWithRetry(
() => this.sendHTTP(partner.url, event),
`postback-${partner.id}-${conversion.id}`
);
return {
status: 'success',
idempotencyKey,
response: result
};
} catch (error) {
// Even on failure, keep idempotency key to prevent duplicates
return {
status: 'failed',
idempotencyKey,
error: error.message
};
}
}
private generateSignature(conversion: Conversion, secret: string): string {
const payload = `${conversion.id}:${conversion.amount}:${conversion.currency}`;
return crypto
.createHmac('sha256', secret)
.update(payload)
.digest('hex');
}
private async sendHTTP(url: string, event: PostbackEvent): Promise<any> {
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Idempotency-Key': event.idempotencyKey,
'X-Signature': event.signature
},
body: JSON.stringify(event),
signal: AbortSignal.timeout(10000) // 10s timeout
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
return response.json();
}
}
Key design decisions:
-
Time window: Idempotency keys include hourly window, not exact timestamp
- Prevents issues with slight clock skew
- Allows legitimate retries within reasonable timeframe
-
Redis TTL: Keys expire after 24 hours
- Balances memory usage with safety
- Configurable per partner needs
-
Store on attempt, not success: Even failed deliveries store key
- Prevents retry storms on intermittent failures
- Partner might have processed before returning error
Circuit Breaker Pattern for Partner Health
When a partner's server goes down, you don't want to keep hammering it with requests. Circuit breakers stop requests to failing services.
enum CircuitState {
CLOSED = 'CLOSED', // Normal operation
OPEN = 'OPEN', // Blocking requests
HALF_OPEN = 'HALF_OPEN' // Testing recovery
}
interface CircuitBreakerConfig {
failureThreshold: number;
successThreshold: number;
timeout: number;
monitoringWindow: number;
}
class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private failures: number = 0;
private successes: number = 0;
private lastFailureTime?: number;
private stateChangeTime: number = Date.now();
constructor(
private partnerId: string,
private config: CircuitBreakerConfig
) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
// Check if circuit should transition
this.checkStateTransition();
if (this.state === CircuitState.OPEN) {
throw new CircuitOpenError(
`Circuit breaker OPEN for partner ${this.partnerId}`
);
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure(error as Error);
throw error;
}
}
private onSuccess(): void {
this.failures = 0;
if (this.state === CircuitState.HALF_OPEN) {
this.successes++;
// Require multiple successes before closing
if (this.successes >= this.config.successThreshold) {
this.transitionTo(CircuitState.CLOSED);
this.successes = 0;
logger.info('Circuit breaker closed', {
partnerId: this.partnerId,
downtime: Date.now() - this.stateChangeTime
});
}
}
}
private onFailure(error: Error): void {
this.failures++;
this.lastFailureTime = Date.now();
if (this.state === CircuitState.HALF_OPEN) {
// Single failure in half-open reopens circuit
this.transitionTo(CircuitState.OPEN);
this.successes = 0;
logger.warn('Circuit breaker reopened', {
partnerId: this.partnerId,
error: error.message
});
}
if (this.failures >= this.config.failureThreshold) {
this.transitionTo(CircuitState.OPEN);
logger.error('Circuit breaker opened', {
partnerId: this.partnerId,
failures: this.failures,
error: error.message
});
}
}
private checkStateTransition(): void {
if (this.state === CircuitState.OPEN && this.shouldAttemptReset()) {
this.transitionTo(CircuitState.HALF_OPEN);
logger.info('Circuit breaker half-open', {
partnerId: this.partnerId,
downtime: Date.now() - this.stateChangeTime
});
}
}
private shouldAttemptReset(): boolean {
if (!this.lastFailureTime) return false;
return Date.now() - this.lastFailureTime >= this.config.timeout;
}
private transitionTo(newState: CircuitState): void {
const oldState = this.state;
this.state = newState;
this.stateChangeTime = Date.now();
// Emit metrics for monitoring
metrics.circuitBreakerStateChange(this.partnerId, oldState, newState);
}
getState(): CircuitState {
return this.state;
}
getMetrics() {
return {
state: this.state,
failures: this.failures,
successes: this.successes,
stateAge: Date.now() - this.stateChangeTime
};
}
}
class CircuitOpenError extends Error {
constructor(message: string) {
super(message);
this.name = 'CircuitOpenError';
}
}
Integration with delivery service:
class PartnerCircuitManager {
private breakers: Map<string, CircuitBreaker> = new Map();
getBreaker(partnerId: string): CircuitBreaker {
if (!this.breakers.has(partnerId)) {
this.breakers.set(partnerId, new CircuitBreaker(partnerId, {
failureThreshold: 5,
successThreshold: 3,
timeout: 60000, // 1 minute
monitoringWindow: 10000
}));
}
return this.breakers.get(partnerId)!;
}
async deliverWithCircuitBreaker(
partner: Partner,
conversion: Conversion
): Promise<DeliveryResult> {
const breaker = this.getBreaker(partner.id);
try {
return await breaker.execute(async () => {
return this.deliveryService.deliver(conversion, partner);
});
} catch (error) {
if (error instanceof CircuitOpenError) {
// Don't queue for retry - circuit is open
return {
status: 'circuit_open',
partnerId: partner.id,
reason: 'Partner temporarily unavailable'
};
}
throw error;
}
}
getAllStates(): Map<string, CircuitState> {
const states = new Map<string, CircuitState>();
for (const [partnerId, breaker] of this.breakers) {
states.set(partnerId, breaker.getState());
}
return states;
}
}
Why circuit breakers matter:
- Prevent cascading failures: One bad partner doesn't DOS your system
- Faster failure detection: Stop wasting resources on down services
- Automatic recovery: Tests service health and reopens when ready
- Better user experience: Fail fast instead of timing out
Advanced: Postback Template Engine
Partners need different data formats. Building a flexible template system was crucial.
interface PostbackTemplate {
url: string;
method: 'GET' | 'POST';
headers?: Record<string, string>;
bodyFormat?: 'json' | 'form' | 'xml';
macros: Record<string, string>;
}
class TemplateEngine {
private macros: Map<string, (conversion: Conversion) => string> = new Map();
constructor() {
this.registerDefaultMacros();
}
private registerDefaultMacros(): void {
// Register common macros
this.macros.set('CONVERSION_ID', (c) => c.id);
this.macros.set('AMOUNT', (c) => c.amount.toString());
this.macros.set('CURRENCY', (c) => c.currency);
this.macros.set('TIMESTAMP', (c) => new Date(c.createdAt).getTime().toString());
this.macros.set('USER_ID', (c) => c.userId);
this.macros.set('CLICK_ID', (c) => c.clickId | '');
this.macros.set('IP_ADDRESS', (c) => c.ipAddress);
this.macros.set('USER_AGENT', (c) => encodeURIComponent(c.userAgent));
// Advanced macros
this.macros.set('TRANSACTION_ID', (c) => c.transactionId | '');
this.macros.set('PRODUCT_ID', (c) => c.productId | '');
this.macros.set('CATEGORY', (c) => c.category | '');
this.macros.set('COUNTRY', (c) => c.country | '');
// Custom formatting
this.macros.set('AMOUNT_CENTS', (c) => (c.amount * 100).toFixed(0));
this.macros.set('ISO_TIMESTAMP', (c) => new Date(c.createdAt).toISOString());
}
compile(template: PostbackTemplate, conversion: Conversion): CompiledPostback {
// Replace macros in URL
let url = template.url;
for (const [macro, value] of Object.entries(template.macros)) {
const replacement = this.resolveMacro(macro, conversion);
url = url.replace(new RegExp(`{${macro}}`, 'g'), replacement);
}
// Replace macros in headers
const headers: Record<string, string> = {};
if (template.headers) {
for (const [key, value] of Object.entries(template.headers)) {
headers[key] = this.replaceMacros(value, conversion);
}
}
// Build body based on format
let body: any = null;
if (template.method === 'POST') {
body = this.buildBody(template, conversion);
}
return { url, method: template.method, headers, body };
}
private resolveMacro(macro: string, conversion: Conversion): string {
const resolver = this.macros.get(macro);
if (!resolver) {
logger.warn('Unknown macro', { macro });
return '';
}
return resolver(conversion);
}
private replaceMacros(text: string, conversion: Conversion): string {
let result = text;
for (const [macro, resolver] of this.macros) {
result = result.replace(
new RegExp(`{${macro}}`, 'g'),
resolver(conversion)
);
}
return result;
}
private buildBody(template: PostbackTemplate, conversion: Conversion): any {
if (template.bodyFormat === 'json') {
return this.buildJsonBody(template.macros, conversion);
} else if (template.bodyFormat === 'form') {
return this.buildFormBody(template.macros, conversion);
} else if (template.bodyFormat === 'xml') {
return this.buildXmlBody(template.macros, conversion);
}
return null;
}
private buildJsonBody(macros: Record<string, string>, conversion: Conversion): any {
const body: any = {};
for (const [key, macro] of Object.entries(macros)) {
body[key] = this.resolveMacro(macro, conversion);
}
return body;
}
private buildFormBody(macros: Record<string, string>, conversion: Conversion): string {
const params = new URLSearchParams();
for (const [key, macro] of Object.entries(macros)) {
params.append(key, this.resolveMacro(macro, conversion));
}
return params.toString();
}
private buildXmlBody(macros: Record<string, string>, conversion: Conversion): string {
let xml = '<?xml version="1.0" encoding="UTF-8"?><postback>';
for (const [key, macro] of Object.entries(macros)) {
const value = this.resolveMacro(macro, conversion);
xml += `<${key}>${this.escapeXml(value)}</${key}>`;
}
xml += '</postback>';
return xml;
}
private escapeXml(text: string): string {
return text
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
}
interface CompiledPostback {
url: string;
method: 'GET' | 'POST';
headers: Record<string, string>;
body: any;
}
Example usage:
// Partner 1: Simple GET request
const partner1Template: PostbackTemplate = {
url: 'https://partner1.com/postback?id={CONVERSION_ID}&amount={AMOUNT}¤cy={CURRENCY}',
method: 'GET',
macros: {
CONVERSION_ID: 'CONVERSION_ID',
AMOUNT: 'AMOUNT',
CURRENCY: 'CURRENCY'
}
};
// Partner 2: POST with JSON body
const partner2Template: PostbackTemplate = {
url: 'https://partner2.com/api/conversions',
method: 'POST',
headers: {
'Authorization': 'Bearer {API_KEY}',
'Content-Type': 'application/json'
},
bodyFormat: 'json',
macros: {
transaction_id: 'CONVERSION_ID',
amount_cents: 'AMOUNT_CENTS',
click_id: 'CLICK_ID',
timestamp: 'ISO_TIMESTAMP'
}
};
// Partner 3: Form-encoded POST
const partner3Template: PostbackTemplate = {
url: 'https://partner3.com/track',
method: 'POST',
headers: {
'Content-Type': 'application/x-www-form-urlencoded'
},
bodyFormat: 'form',
macros: {
txid: 'TRANSACTION_ID',
payout: 'AMOUNT',
ip: 'IP_ADDRESS'
}
};
Monitoring & Observability
You can't fix what you can't see. Comprehensive monitoring is critical.
interface DeliveryMetrics {
partnerId: string;
successCount: number;
failureCount: number;
averageLatency: number;
p95Latency: number;
p99Latency: number;
lastSuccessAt?: Date;
lastFailureAt?: Date;
circuitState: CircuitState;
queueDepth: number;
}
class MetricsCollector {
private metrics: Map<string, DeliveryMetrics> = new Map();
private latencies: Map<string, number[]> = new Map();
recordSuccess(partnerId: string, latency: number): void {
const metric = this.getOrCreateMetric(partnerId);
metric.successCount++;
metric.lastSuccessAt = new Date();
this.recordLatency(partnerId, latency);
this.updateLatencyPercentiles(partnerId);
}
recordFailure(partnerId: string, error: Error): void {
const metric = this.getOrCreateMetric(partnerId);
metric.failureCount++;
metric.lastFailureAt = new Date();
// Log to error tracking (Sentry, etc.)
logger.error('Postback delivery failed', {
partnerId,
error: error.message,
stack: error.stack
});
}
recordCircuitState(partnerId: string, state: CircuitState): void {
const metric = this.getOrCreateMetric(partnerId);
metric.circuitState = state;
}
recordQueueDepth(partnerId: string, depth: number): void {
const metric = this.getOrCreateMetric(partnerId);
metric.queueDepth = depth;
}
private recordLatency(partnerId: string, latency: number): void {
if (!this.latencies.has(partnerId)) {
this.latencies.set(partnerId, []);
}
const latencies = this.latencies.get(partnerId)!;
latencies.push(latency);
// Keep only last 1000 measurements
if (latencies.length > 1000) {
latencies.shift();
}
}
private updateLatencyPercentiles(partnerId: string): void {
const latencies = this.latencies.get(partnerId);
if (!latencies | latencies.length === 0) return;
const sorted = [...latencies].sort((a, b) => a - b);
const metric = this.getOrCreateMetric(partnerId);
metric.averageLatency = latencies.reduce((a, b) => a + b, 0) / latencies.length;
metric.p95Latency = sorted[Math.floor(sorted.length * 0.95)];
metric.p99Latency = sorted[Math.floor(sorted.length * 0.99)];
}
getMetrics(partnerId: string): DeliveryMetrics | undefined {
return this.metrics.get(partnerId);
}
getAllMetrics(): DeliveryMetrics[] {
return Array.from(this.metrics.values());
}
getHealthCheck(): HealthCheckResult {
const metrics = this.getAllMetrics();
const openCircuits = metrics.filter(m => m.circuitState === CircuitState.OPEN);
const slowPartners = metrics.filter(m => m.p95Latency > 5000); // > 5s
const failingPartners = metrics.filter(m => {
const total = m.successCount + m.failureCount;
return total > 0 && (m.failureCount / total) > 0.1; // > 10% failure rate
});
return {
healthy: openCircuits.length === 0 && failingPartners.length === 0,
openCircuits: openCircuits.map(m => m.partnerId),
slowPartners: slowPartners.map(m => m.partnerId),
failingPartners: failingPartners.map(m => ({
partnerId: m.partnerId,
failureRate: (m.failureCount / (m.successCount + m.failureCount))
})),
totalQueueDepth: metrics.reduce((sum, m) => sum + m.queueDepth, 0)
};
}
private getOrCreateMetric(partnerId: string): DeliveryMetrics {
if (!this.metrics.has(partnerId)) {
this.metrics.set(partnerId, {
partnerId,
successCount: 0,
failureCount: 0,
averageLatency: 0,
p95Latency: 0,
p99Latency: 0,
circuitState: CircuitState.CLOSED,
queueDepth: 0
});
}
return this.metrics.get(partnerId)!;
}
}
interface HealthCheckResult {
healthy: boolean;
openCircuits: string[];
slowPartners: string[];
failingPartners: Array<{ partnerId: string; failureRate: number }>;
totalQueueDepth: number;
}
Real-time alerting:
class AlertManager {
constructor(
private metricsCollector: MetricsCollector,
private slackWebhook: string
) {
// Check health every minute
setInterval(() => this.checkAndAlert(), 60000);
}
async checkAndAlert(): Promise<void> {
const health = this.metricsCollector.getHealthCheck();
// Alert on open circuits
if (health.openCircuits.length > 0) {
await this.sendAlert({
severity: 'high',
title: 'Circuit Breakers Open',
message: `${health.openCircuits.length} partner(s) unavailable: ${health.openCircuits.join(', ')}`,
color: 'danger'
});
}
// Alert on high failure rates
for (const failing of health.failingPartners) {
if (failing.failureRate > 0.25) { // > 25%
await this.sendAlert({
severity: 'medium',
title: 'High Failure Rate',
message: `Partner ${failing.partnerId}: ${(failing.failureRate * 100).toFixed(1)}% failures`,
color: 'warning'
});
}
}
// Alert on queue backup
if (health.totalQueueDepth > 10000) {
await this.sendAlert({
severity: 'critical',
title: 'Queue Backup',
message: `${health.totalQueueDepth} postbacks pending delivery`,
color: 'danger'
});
}
// Alert on slow partners
if (health.slowPartners.length > 0) {
await this.sendAlert({
severity: 'low',
title: 'Slow Partners',
message: `${health.slowPartners.length} partner(s) responding slowly: ${health.slowPartners.join(', ')}`,
color: 'warning'
});
}
}
private async sendAlert(alert: Alert): Promise<void> {
try {
await fetch(this.slackWebhook, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
attachments: [{
color: alert.color,
title: `[${alert.severity.toUpperCase()}] ${alert.title}`,
text: alert.message,
ts: Math.floor(Date.now() / 1000)
}]
})
});
} catch (error) {
logger.error('Failed to send alert', { error });
}
}
}
interface Alert {
severity: 'low' | 'medium' | 'high' | 'critical';
title: string;
message: string;
color: 'good' | 'warning' | 'danger';
}
Dashboard endpoint:
app.get('/api/postback-metrics', async (req, res) => {
const metrics = metricsCollector.getAllMetrics();
const health = metricsCollector.getHealthCheck();
res.json({
health,
metrics: metrics.map(m => ({
partnerId: m.partnerId,
successRate: m.successCount / (m.successCount + m.failureCount) | 0,
averageLatency: Math.round(m.averageLatency),
p95Latency: Math.round(m.p95Latency),
circuitState: m.circuitState,
queueDepth: m.queueDepth,
lastSuccess: m.lastSuccessAt?.toISOString(),
lastFailure: m.lastFailureAt?.toISOString()
}))
});
});
Database Schema for Audit Trail
Maintain a complete audit log for compliance and debugging:
-- Postback delivery logs
CREATE TABLE postback_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversion_id UUID NOT NULL REFERENCES conversions(id),
partner_id UUID NOT NULL REFERENCES partners(id),
-- Request details
idempotency_key VARCHAR(64) NOT NULL,
url TEXT NOT NULL,
method VARCHAR(10) NOT NULL,
headers JSONB,
body JSONB,
-- Response details
status VARCHAR(20) NOT NULL, -- 'success', 'failed', 'skipped', 'circuit_open'
http_status_code INTEGER,
response_body TEXT,
error_message TEXT,
-- Timing
latency_ms INTEGER,
attempt_number INTEGER NOT NULL DEFAULT 1,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
-- Indexes for queries
INDEX idx_conversion_partner (conversion_id, partner_id),
INDEX idx_status_created (status, created_at),
INDEX idx_idempotency (idempotency_key)
);
-- Dead letter queue for failed deliveries
CREATE TABLE postback_dlq (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversion_id UUID NOT NULL,
partner_id UUID NOT NULL,
-- Failure details
error_message TEXT NOT NULL,
attempts INTEGER NOT NULL,
last_attempt_at TIMESTAMP NOT NULL,
-- Original payload
payload JSONB NOT NULL,
-- Status tracking
status VARCHAR(20) NOT NULL DEFAULT 'pending', -- 'pending', 'retrying', 'resolved', 'abandoned'
resolved_at TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_status_created (status, created_at)
);
-- Partner health metrics (time-series data)
CREATE TABLE partner_metrics (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
partner_id UUID NOT NULL REFERENCES partners(id),
-- Metrics window
window_start TIMESTAMP NOT NULL,
window_end TIMESTAMP NOT NULL,
-- Counters
success_count INTEGER NOT NULL DEFAULT 0,
failure_count INTEGER NOT NULL DEFAULT 0,
-- Latency stats
avg_latency_ms INTEGER,
p95_latency_ms INTEGER,
p99_latency_ms INTEGER,
-- Circuit breaker state
circuit_state VARCHAR(20),
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_partner_window (partner_id, window_start)
);
TypeScript repository:
class PostbackLogRepository {
constructor(private db: Database) {}
async logAttempt(log: PostbackLogEntry): Promise<void> {
await this.db.query(`
INSERT INTO postback_logs (
conversion_id, partner_id, idempotency_key,
url, method, headers, body,
status, http_status_code, response_body, error_message,
latency_ms, attempt_number
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13)
`, [
log.conversionId,
log.partnerId,
log.idempotencyKey,
log.url,
log.method,
JSON.stringify(log.headers),
JSON.stringify(log.body),
log.status,
log.httpStatusCode,
log.responseBody,
log.errorMessage,
log.latencyMs,
log.attemptNumber
]);
}
async getDeliveryHistory(
conversionId: string,
partnerId: string
): Promise<PostbackLogEntry[]> {
const result = await this.db.query(`
SELECT * FROM postback_logs
WHERE conversion_id = $1 AND partner_id = $2
ORDER BY created_at DESC
`, [conversionId, partnerId]);
return result.rows;
}
async getFailedDeliveries(limit: number = 100): Promise<PostbackLogEntry[]> {
const result = await this.db.query(`
SELECT * FROM postback_logs
WHERE status = 'failed'
ORDER BY created_at DESC
LIMIT $1
`, [limit]);
return result.rows;
}
}
Performance Optimization Techniques
1. Batch Processing for Multiple Partners
class BatchPostbackProcessor {
async processConversion(conversion: Conversion, partners: Partner[]): Promise<void> {
// Group partners by priority
const highPriority = partners.filter(p => p.priority === 'high');
const normalPriority = partners.filter(p => p.priority === 'normal');
// Send high priority immediately
await Promise.all(
highPriority.map(partner =>
postbackQueue.add('send-postback', {
conversionId: conversion.id,
partnerId: partner.id
}, {
priority: 1 // Higher priority in queue
})
)
);
// Batch normal priority (reduces queue overhead)
if (normalPriority.length > 0) {
await postbackQueue.add('send-batch-postback', {
conversionId: conversion.id,
partnerIds: normalPriority.map(p => p.id)
}, {
priority: 10
});
}
}
}
// Worker processes batches
const worker = new Worker('postbacks', async (job) => {
if (job.name === 'send-batch-postback') {
const { conversionId, partnerIds } = job.data;
// Process in parallel with concurrency limit
await Promise.all(
partnerIds.map(partnerId =>
limiter.schedule(() =>
deliveryService.deliver(conversionId, partnerId)
)
)
);
}
}, { connection });
// Rate limiter for parallel processing
import Bottleneck from 'bottleneck';
const limiter = new Bottleneck({
maxConcurrent: 50, // Max 50 concurrent requests
minTime: 20 // Minimum 20ms between requests
});
2. Connection Pooling
import axios from 'axios';
import http from 'http';
import https from 'https';
// Reuse HTTP connections
const httpAgent = new http.Agent({
keepAlive: true,
maxSockets: 100,
maxFreeSockets: 10,
timeout: 60000
});
const httpsAgent = new https.Agent({
keepAlive: true,
maxSockets: 100,
maxFreeSockets: 10,
timeout: 60000
});
const axiosInstance = axios.create({
httpAgent,
httpsAgent,
timeout: 10000
});
class HTTPClient {
async post(url: string, data: any, headers: any): Promise<any> {
return axiosInstance.post(url, data, { headers });
}
}
3. Caching Partner Configurations
class PartnerConfigCache {
private cache: Map<string, Partner> = new Map();
private ttl: number = 300000; // 5 minutes
async getPartner(partnerId: string): Promise<Partner> {
// Check cache first
const cached = this.cache.get(partnerId);
if (cached) {
return cached;
}
// Fetch from database
const partner = await this.db.partner.findUnique({
where: { id: partnerId }
});
if (!partner) {
throw new Error(`Partner ${partnerId} not found`);
}
// Cache for future use
this.cache.set(partnerId, partner);
// Auto-expire after TTL
setTimeout(() => {
this.cache.delete(partnerId);
}, this.ttl);
return partner;
}
invalidate(partnerId: string): void {
this.cache.delete(partnerId);
}
clear(): void {
this.cache.clear();
}
}
Testing Strategy
Unit Tests
import { describe, it, expect, jest, beforeEach } from '@jest/globals';
describe('RetryManager', () => {
let retryManager: RetryManager;
beforeEach(() => {
retryManager = new RetryManager({
maxAttempts: 3,
baseDelay: 100,
maxDelay: 1000,
backoffMultiplier: 2,
jitter: false
});
});
it('should succeed on first attempt', async () => {
const operation = jest.fn().mockResolvedValue('success');
const result = await retryManager.executeWithRetry(operation, 'test');
expect(result).toBe('success');
expect(operation).toHaveBeenCalledTimes(1);
});
it('should retry on transient failure', async () => {
const operation = jest.fn()
.mockRejectedValueOnce(new Error('timeout'))
.mockResolvedValueOnce('success');
const result = await retryManager.executeWithRetry(operation, 'test');
expect(result).toBe('success');
expect(operation).toHaveBeenCalledTimes(2);
});
it('should not retry on client error', async () => {
const error = new Error('400 Bad Request');
const operation = jest.fn().mockRejectedValue(error);
await expect(
retryManager.executeWithRetry(operation, 'test')
).rejects.toThrow('400 Bad Request');
expect(operation).toHaveBeenCalledTimes(1);
});
it('should throw after max attempts', async () => {
const operation = jest.fn().mockRejectedValue(new Error('503'));
await expect(
retryManager.executeWithRetry(operation, 'test')
).rejects.toThrow('503');
expect(operation).toHaveBeenCalledTimes(3);
});
});
describe('CircuitBreaker', () => {
let circuitBreaker: CircuitBreaker;
beforeEach(() => {
circuitBreaker = new CircuitBreaker('test-partner', {
failureThreshold: 2,
successThreshold: 3,
timeout: 1000,
monitoringWindow: 500
});
});
it('should open after threshold failures', async () => {
const operation = jest.fn().mockRejectedValue(new Error('failure'));
// First failure
await expect(circuitBreaker.execute(operation)).rejects.toThrow();
expect(circuitBreaker.getState()).toBe(CircuitState.CLOSED);
// Second failure - opens circuit
await expect(circuitBreaker.execute(operation)).rejects.toThrow();
expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
// Third call blocked
await expect(circuitBreaker.execute(operation)).rejects.toThrow('Circuit breaker is OPEN');
expect(operation).toHaveBeenCalledTimes(2);
});
it('should transition to half-open after timeout', async () => {
// Open the circuit
const operation = jest.fn().mockRejectedValue(new Error('failure'));
await expect(circuitBreaker.execute(operation)).rejects.toThrow();
await expect(circuitBreaker.execute(operation)).rejects.toThrow();
expect(circuitBreaker.getState()).toBe(CircuitState.OPEN);
// Wait for timeout
await new Promise(resolve => setTimeout(resolve, 1100));
// Should attempt in half-open
operation.mockResolvedValue('success');
await circuitBreaker.execute(operation);
expect(circuitBreaker.getState()).toBe(CircuitState.HALF_OPEN);
});
});
Integration Tests
describe('Postback Delivery Integration', () => {
let testServer: Server;
let requestLog: any[] = [];
beforeAll(async () => {
// Start mock partner server
testServer = createTestServer((req, res) => {
requestLog.push({
method: req.method,
url: req.url,
headers: req.headers,
body: req.body
});
res.status(200).json({ success: true });
});
});
afterAll(() => {
testServer.close();
});
beforeEach(() => {
requestLog = [];
});
it('should deliver postback with retry on failure', async () => {
let attempts = 0;
testServer.setHandler((req, res) => {
attempts++;
if (attempts < 3) {
res.status(500).json({ error: 'Server error' });
} else {
res.status(200).json({ success: true });
}
});
const conversion = createTestConversion();
const partner = createTestPartner({ url: testServer.url });
await deliveryService.deliver(conversion, partner);
expect(attempts).toBe(3);
expect(requestLog.length).toBe(3);
});
it('should prevent duplicate delivery with idempotency', async () => {
const conversion = createTestConversion();
const partner = createTestPartner({ url: testServer.url });
// Send twice
await deliveryService.deliver(conversion, partner);
await deliveryService.deliver(conversion, partner);
// Should only deliver once
expect(requestLog.length).toBe(1);
});
it('should apply template correctly', async () => {
const conversion = createTestConversion({
id: 'conv-123',
amount: 50.00,
currency: 'USD'
});
const partner = createTestPartner({
url: testServer.url + '/track',
template: {
url: '{url}?txid={CONVERSION_ID}&amount={AMOUNT}¤cy={CURRENCY}',
method: 'GET',
macros: {
CONVERSION_ID: 'CONVERSION_ID',
AMOUNT: 'AMOUNT',
CURRENCY: 'CURRENCY'
}
}
});
await deliveryService.deliver(conversion, partner);
expect(requestLog[0].url).toContain('txid=conv-123');
expect(requestLog[0].url).toContain('amount=50');
expect(requestLog[0].url).toContain('currency=USD');
});
});
Deployment Architecture
Docker Compose Setup
version: '3.8'
services:
app:
build: .
ports:
- "3000:3000"
environment:
NODE_ENV: production
DATABASE_URL: postgresql://user:pass@postgres:5432/postback
REDIS_URL: redis://redis:6379
depends_on:
- postgres
- redis
deploy:
replicas: 3
resources:
limits:
cpus: '1'
memory: 1G
worker:
build: .
command: node dist/worker.js
environment:
NODE_ENV: production
DATABASE_URL: postgresql://user:pass@postgres:5432/postback
REDIS_URL: redis://redis:6379
WORKER_CONCURRENCY: 10
depends_on:
- postgres
- redis
deploy:
replicas: 5
resources:
limits:
cpus: '2'
memory: 2G
postgres:
image: postgres:16-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: postback
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
ports:
- "5432:5432"
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
ports:
- "6379:6379"
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
postgres_data:
redis_data:
grafana_data:
Environment Configuration
# .env.production
NODE_ENV=production
# Database
DATABASE_URL=postgresql://user:pass@postgres:5432/postback
DB_POOL_MIN=5
DB_POOL_MAX=20
# Redis
REDIS_URL=redis://redis:6379
REDIS_MAX_RETRIES=3
# Queue Configuration
QUEUE_CONCURRENCY=10
QUEUE_MAX_JOBS_PER_WORKER=100
# Circuit Breaker
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=60000
# Retry Configuration
RETRY_MAX_ATTEMPTS=10
RETRY_BASE_DELAY=2000
RETRY_MAX_DELAY=60000
# Monitoring
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SENTRY_DSN=https://your-sentry-dsn
# Rate Limiting
RATE_LIMIT_PER_PARTNER=1000
RATE_LIMIT_WINDOW=60000
Real-World Lessons & Gotchas
1. The Timeout Cascade Problem
What happened: Partner's server was responding in 29 seconds. My timeout was 30 seconds. Under load, all workers got stuck waiting.
Solution: Adaptive timeouts per partner based on historical latency:
class AdaptiveTimeoutManager {
getTimeout(partnerId: string): number {
const metrics = metricsCollector.getMetrics(partnerId);
if (!metrics) return 10000; // Default 10s
// Timeout = P95 latency + 50% buffer, capped at 30s
const timeout = Math.min(metrics.p95Latency * 1.5, 30000);
return Math.max(timeout, 5000); // Minimum 5s
}
}
2. The Duplicate Payment Disaster
What happened: Partner's load balancer returned 502 after processing. We retried. They charged twice.
Solution: Server-side idempotency with longer TTL:
// Store idempotency key for 7 days, not 24 hours
await idempotencyManager.checkAndStore(key, 604800);
3. The Redis Memory Explosion
What happened: Queued 1M jobs during partner outage. Redis ran out of memory.
Solution: Queue size limits with backpressure:
const queueSize = await postbackQueue.count();
if (queueSize > 100000) {
// Stop accepting new jobs
throw new QueueOverloadError('Postback queue is full');
}
4. The Silent Failure
What happened: Partner changed their API contract. Postbacks "succeeded" (200 OK) but weren't processed.
Solution: Validate partner responses:
async function validateResponse(response: any, partner: Partner): Promise<boolean> {
if (partner.responseValidation) {
// Check for expected fields
const valid = partner.responseValidation.requiredFields.every(
field => response[field] !== undefined
);
if (!valid) {
logger.warn('Invalid response from partner', {
partnerId: partner.id,
response
});
return false;
}
}
return true;
}
What This Enables Beyond Affiliate Marketing
The patterns you've learned apply to:
1. Payment Webhooks
// Stripe-style webhook delivery
class PaymentWebhookService extends PostbackDeliveryService {
async notifyMerchant(transaction: Transaction, merchant: Merchant) {
return this.deliver({
event: 'payment.succeeded',
transaction_id: transaction.id,
amount: transaction.amount
}, merchant.webhookUrl);
}
}
2. Microservices Event Bus
// Internal event delivery between services
class EventBus extends PostbackDeliveryService {
async publish(event: DomainEvent) {
const subscribers = await this.getSubscribers(event.type);
for (const subscriber of subscribers) {
await this.deliver(event, subscriber.endpoint);
}
}
}
3. IoT Device Notifications
// Real-time device event delivery
class IoTEventDelivery extends PostbackDeliveryService {
async notifyDevice(deviceId: string, command: DeviceCommand) {
const device = await this.getDevice(deviceId);
return this.deliver(command, device.callbackUrl);
}
}
4. SaaS Integration Platform
// Zapier-style integration delivery
class IntegrationEngine extends PostbackDeliveryService {
async triggerWorkflow(trigger: Trigger, data: any) {
const workflows = await this.getWorkflows(trigger.type);
for (const workflow of workflows) {
await this.deliver(data, workflow.webhookUrl);
}
}
}
Performance Benchmarks
From my production system:
| Metric | Value |
|---|---|
| Throughput | 100K postbacks/hour |
| Latency (P50) | 250ms |
| Latency (P95) | 1.2s |
| Latency (P99) | 3.5s |
| Success Rate | 99.2% |
| Worker Count | 5 instances |
| Queue Depth (avg) | 2,500 jobs |
| Memory per Worker | 512MB |
| CPU per Worker | 0.5 cores |
| Cost per Million | ~$15 |
Summary: Key Takeaways
Building a production-grade postback system taught me:
1. Never Trust the Network
- Implement retries with exponential backoff
- Use circuit breakers to prevent cascading failures
- Always have timeouts (and make them adaptive)
2. Idempotency is Non-Negotiable
- Generate unique keys for every delivery
- Store them with appropriate TTL
- Protect against duplicates at all costs
3. Queues are Your Safety Net
- Decouple ingestion from delivery
- Persist jobs to prevent data loss
- Scale workers independently from API
4. Monitor Everything
- Track success rates per partner
- Alert on anomalies proactively
- Maintain audit logs for debugging
5. Design for Failure
- Failures will happen—plan for them
- Have fallback strategies (DLQ, manual retry)
- Test failure scenarios regularly
6. Performance Matters
- Use connection pooling
- Batch when possible
- Cache partner configurations
- Monitor resource usage
7. These Patterns are Universal
- Webhooks, event buses, notifications—all need this
- The principles apply to any event delivery system
- Learn once, apply everywhere
What's Next?
If I were to extend this system, I'd add:
- Multi-region deployment for global latency reduction
- GraphQL subscriptions for real-time delivery status
- ML-based anomaly detection for partner health
- A/B testing framework for postback variations
- Webhook debugging tools (request inspection, replay)
Resources & References
- BullMQ Documentation: https://docs.bullmq.io/
- Stripe Webhook Best Practices: https://stripe.com/docs/webhooks/best-practices
- AWS SQS vs Redis for queues: Performance comparison
- Circuit Breaker Pattern: Martin Fowler's article
- Idempotency in Distributed Systems: Engineering blog posts
Built something similar? Have questions? The patterns here are battle-tested in production handling millions of events. The principles apply whether you're building affiliate tracking, payment webhooks, or any event-driven system.
Remember: Reliability isn't a feature—it's a requirement. 🚀