Sendry — Event-Driven API Observability Platform

Overview

Sendry is a production-grade, self-hosted API observability platform engineered to capture, aggregate, and visualize high-throughput API metrics without introducing latency overhead or blocking application request loops.

The platform solves a fundamental problem in modern software engineering: how to monitor systems comprehensively while maintaining zero performance impact on the applications being monitored.

By leveraging an event-driven, fully decoupled architecture, Sendry ingests API telemetry in under 2 milliseconds, publishes events to a durable message queue, and processes analytics asynchronously without touching the critical path of production requests.

Core Design Principle:

Observability systems must never compromise the reliability or performance of the systems they monitor.

The Problem

Most observability platforms fall into one of three categories:

Category 1: Expensive SaaS Solutions Enterprise APM tools charge premium rates per host or per-event. For teams with high traffic volumes, costs become prohibitive. Monthly bills can exceed infrastructure costs themselves.

Category 2: Complex Self-Hosted Systems Open-source alternatives like Prometheus or ELK require significant operational expertise and infrastructure investment. They demand dedicated DevOps resources to maintain, scale, and troubleshoot.

Category 3: Performance-Blocking Solutions Synchronous instrumentation introduces measurable latency. Every database write, every HTTP call, and every aggregation operation during request execution delays the response to users. This creates an unacceptable trade-off between visibility and performance.

What Engineers Actually Need

Teams don't require enterprise APM complexity. They need answers to operational questions:

Which API endpoints generate the most traffic?
What is the current system error rate?
Which services experience performance degradation?
Where are the reliability bottlenecks?

Sendry was designed to provide precisely these insights without the complexity, cost, or performance penalty of existing solutions.

System Architecture

Design Overview

Sendry implements a four-layer event-driven architecture that completely decouples telemetry ingestion from data persistence and analytics processing:

Production Application
        │
        ├─ Sendry SDK Middleware (Captures metrics)
        │
        ▼
Ingestion API (Express.js)
        │
        ├─ API Key Validation
        ├─ Payload Parsing
        │
        ▼
RabbitMQ Message Queue
        │
        ├─ Durable buffering
        ├─ Traffic spike absorption
        │
        ▼
Background Consumer Worker
        │
        ├─ Circuit breaker protection
        ├─ Retry mechanisms
        ├─ Deduplication logic
        │
        ├──────────────────┬──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
    MongoDB          PostgreSQL         Dead-Letter Queue
   (Raw Events)    (Aggregated Metrics)   (Failed Events)
        │                  │                  │
        └──────────────────┴──────────────────┘
                     │
                     ▼
              Analytics API
                     │
                     ▼
            Real-Time Dashboard

Layer Responsibilities

Ingest Layer: Validates incoming telemetry, verifies API credentials, and publishes events to the message queue. Returns immediately with a 202 Accepted response, guaranteeing sub-2ms response times regardless of downstream processing load.

Queue Layer: RabbitMQ absorbs traffic spikes and provides durable storage. If consumer workers become overwhelmed, the queue buffers events without blocking the production application. Implements manual acknowledgments to ensure no events are lost during failures.

Processing Layer: Background consumer workers pull events from the queue, validate schemas, perform deduplication, and coordinate dual-write persistence. Implements exponential backoff with jitter to handle transient database failures gracefully.

Storage Layer: MongoDB stores raw, unstructured event payloads with automatic 30-day expiration. PostgreSQL maintains pre-aggregated hourly time-series metrics optimized for analytical queries. This separation allows independent optimization of write-heavy and read-heavy workloads.

Architecture Diagrams

System Architecture

Ingestion Sequence

Core Technical Features

Lightweight Monitoring SDK

Applications integrate a minimal Express.js middleware that automatically captures relevant telemetry:

Request endpoint and HTTP method
Response status code
Latency measurement (sub-millisecond precision)
Client IP address and user agent
Request size and response size

Critically, metrics are collected and sent asynchronously after the HTTP response has already been delivered to the client. This guarantees zero impact on application response times.

Decoupled Event Processing Pipeline

Rather than processing telemetry synchronously, Sendry implements an event-driven architecture:

Ingest API receives telemetry payload
Validates API key against credential store
Publishes event to RabbitMQ queue
Returns HTTP 202 Accepted immediately
Consumer pulls from queue independently
Processes and persists to dual databases

This separation means traffic spikes are absorbed by the queue without overwhelming the application. Peak loads are handled gracefully through buffering rather than backpressure.

Production-Grade Message Processing

RabbitMQ consumer implements multiple reliability patterns:

Durable queues with persistent message storage
Manual acknowledgments to prevent message loss
Exponential backoff retry strategy with randomized jitter
Dead-letter queues for unprocessable messages
Graceful shutdown procedures

Unprocessable events (schema validation failures or repeated processing errors) are automatically routed to a dead-letter queue for manual inspection rather than blocking the primary queue.

Circuit Breaker Pattern

To prevent cascading failures, both the ingestion API and consumer implement three-state circuit breakers:

Closed State: Normal operation, requests flow to downstream systems
Open State: Threshold exceeded (5 errors in 10 seconds), requests immediately fail without attempting downstream calls
Half-Open State: After 30-second cooldown, test messages verify recovery

If RabbitMQ crashes, the Ingest API fails open and returns 503 Service Unavailable without hanging or blocking the production application. This guarantees that monitoring infrastructure failures never propagate to user-facing systems.

Idempotent Event Processing

Message brokers guarantee at-least-once delivery, meaning duplicate messages can occur. Sendry handles this safely through:

Unique event identifiers on every message
In-memory cache of processed event IDs (capped at 100,000 entries)
Deduplication check before database writes

This ensures analytics remain accurate even when network retries deliver duplicate messages.

Dual Database Architecture

Different data access patterns require different storage engines:

MongoDB: Stores raw, unstructured event payloads with full request/response bodies. Enables schemaless ingestion and flexible queries. Automatically expires records after 30 days using TTL indexes, preventing unbounded storage growth.

PostgreSQL: Stores pre-aggregated hourly time-series metrics. Enables fast analytical queries optimized for dashboard rendering. Uses efficient UPSERT operations to atomically merge raw events into hourly buckets.

This separation allows optimization of write throughput (MongoDB) independently from query performance (PostgreSQL).

Database Schema Visualization

Database Entity Relationships

Real-Time Analytics Dashboard

The React-based dashboard provides operational visibility:

Request volume trends over time
Error rate distribution by status code
Average, p50, p95, p99 latency metrics
Top endpoints ranked by traffic
Service health status indicators
Endpoint performance comparison tables

All metrics are served from pre-aggregated PostgreSQL tables, ensuring dashboard queries complete in under 100 milliseconds regardless of total event volume.

Dashboard Preview

Sendry Dashboard

Engineering Challenges and Solutions

Challenge: Reliable Event Processing Under Network Failures

Problem: Network partitions, server crashes, and transient failures can cause event loss if not handled carefully. The system must provide guarantees that telemetry is never discarded.

Solution:

Implemented durable RabbitMQ queues with persistent storage to disk
Consumer uses manual acknowledgments instead of auto-acknowledgment
Events remain in queue until explicitly acknowledged after successful database writes
Graceful shutdown procedures allow in-flight operations to complete before termination
Dead-letter queues capture unprocessable messages for manual investigation

Result: Even if consumer workers crash mid-processing, events are not lost and will be reprocessed after recovery.

Challenge: Absorbing Traffic Spikes Without Backpressure

Problem: If a production application sends 10,000 requests per second during a traffic spike, the monitoring system must not block or slow down the application waiting for these requests to be processed.

Solution:

Implemented RabbitMQ as a durable buffer between ingestion and processing
Ingestion API publishes to queue and returns immediately (202 Accepted)
Consumer pulls from queue at its own pace, independent of ingestion rate
Horizontal scaling: additional consumer processes can be spawned to drain queue faster
Prefetch controls prevent any single worker from being overwhelmed

Result: Traffic spikes are absorbed by the queue without propagating backpressure to production systems.

Challenge: Handling Duplicate Message Delivery

Problem: Message broker retry mechanisms can deliver the same message multiple times, causing duplicate entries in analytics. This corrupts metrics accuracy.

Solution:

Every event receives a unique hash identifier
Consumer maintains an in-memory Set of processed event IDs
Before writing to databases, consumer checks Set for duplicates
Set is capped at 100,000 entries to prevent memory leaks
Capped LRU structure discards oldest entries when size limit is reached

Result: Duplicate messages are transparently filtered with zero additional database overhead.

Challenge: Efficient Analytics on Massive Event Volumes

Problem: Scanning billions of raw events for every dashboard query is prohibitively expensive. Real-time dashboards would timeout regularly.

Solution:

Implemented incremental aggregation during event processing
Consumer groups raw events into hourly buckets (service, endpoint, method, status code)
PostgreSQL UPSERT operations atomically merge into hourly metrics table
Dashboard queries scan pre-aggregated metrics tables instead of raw events
Storage size reduced by 98% through aggregation

Result: Dashboard queries complete in 50-100ms regardless of total event volume.

Challenge: Cross-Domain Session Management in Development

Problem: When testing locally (frontend on localhost:5173, API on localhost:5000), browsers block session cookies due to SameSite security policies. Testing required complex workarounds.

Solution:

Implemented environment-aware cookie negotiation in authentication middleware
Development environment: SameSite=Lax, Secure=false
Production environment: SameSite=None, Secure=true
Automatic detection based on NODE_ENV

Result: Development and production environments work seamlessly without manual configuration changes.

Production-Grade System Design

The platform implements several critical patterns required for reliable, scalable systems:

Security & Access Control

JWT-based authentication with HTTP-only secure cookies
API Key management with per-client credential rotation
Role-based access control (RBAC) with admin, operator, and viewer roles
API key validation on every incoming telemetry request
Automatic token expiration and refresh mechanisms

Reliability Patterns

Circuit breaker pattern with three states (closed, open, half-open)
Exponential backoff with randomized jitter for retry handling
Dead-letter queues for handling unprocessable events
Health checks on all critical services
Graceful shutdown procedures with in-flight operation completion

Observability

Structured logging in JSON format for easy parsing
Request tracing with unique IDs for debugging
Performance metrics collection on all critical paths
Error aggregation and alerting

Infrastructure

Containerized deployments using Docker
Docker Compose for local development environment
Support for horizontal scaling (multiple consumer workers)
Connection pooling for database and message broker
Automated database migrations

Multi-Tenant Architecture

Namespace isolation between customers
Per-customer API keys and rate limiting
Separate metric aggregations by customer
Billing-ready event counting infrastructure

Technology Stack & Architecture

Frontend Layer

React 18 with Vite for optimized bundling
TanStack Query for server state management
ApexCharts for real-time metric visualization
Tailwind CSS for responsive design
Axios for REST API communication

Backend Services

Node.js runtime for high concurrency
Express.js framework for HTTP routing
JWT middleware for authentication
Structured logging middleware

Message Queue & Streaming

RabbitMQ 3.x for durable event queuing
Manual acknowledgments for delivery guarantees
Dead-letter exchanges for failure handling
Topic-based routing for multi-tenant isolation

Data Storage

MongoDB 6.0 for schemaless event storage
TTL indexes for automatic data expiration
PostgreSQL 15 for time-series aggregations
Connection pooling and query optimization

Deployment & Operations

Docker containerization for reproducible deployments
Docker Compose for local development
Environment-based configuration management
Graceful shutdown handling

Development Tools

ESLint for code quality
Jest for unit testing
Postman collections for API testing
Docker for isolated local environment

Key Architectural Insights

Building Sendry provided deep experience with production system design patterns. The project demonstrated how architectural decisions made early in development have enormous downstream consequences.

Decoupled Architecture Enables Scalability

The fundamental insight was separating concerns into independent layers. By keeping database writes out of the request path through message queueing, the system became horizontally scalable without architectural changes. Additional consumer processes can be spawned to increase throughput without modifying the ingestion layer.

Traffic Spike Absorption

Message brokers exist primarily to absorb traffic variance. During peak loads, the queue buffers events naturally, preventing cascading failures. Bottlenecks appear gracefully as queue depth increases rather than as application errors. This provides visibility into when capacity should be expanded.

Idempotency Eliminates Whole Classes of Bugs

By designing all operations to be idempotent, the system became dramatically more resilient. Events can be replayed, retried, or reprocessed without corrupting state. This single design decision eliminated entire categories of subtle concurrency bugs.

Event-Driven Architecture Requires Careful State Management

The tradeoff for decoupling is increased complexity in state management. Monitoring the health of each layer independently became critical. Dead-letter queues surfaced issues that synchronous systems would expose immediately through exceptions.

Observability Must Be Built In

Building a monitoring system while lacking built-in monitoring capabilities was challenging. The project benefited enormously from structured logging, unique request IDs, and per-component health checks. These weren't afterthoughts but architectural requirements.

Database Trade-offs

The dual-database approach involved accepting higher operational complexity for significant performance gains. MongoDB handled write throughput that PostgreSQL alone would have struggled with. The trade-off was justified by the scale (billions of events) and access patterns (write-heavy raw events, read-heavy aggregations).

Circuit Breakers Prevent Chaos

Without circuit breakers, infrastructure failures propagate as cascading errors. With them, failures are contained and visible. The three-state pattern is remarkably elegant for handling transient failures differently from persistent outages.

Impact and Outcomes

The platform demonstrates how careful architectural design can solve operational problems that most teams try to solve with money (expensive SaaS platforms) or complexity (heavyweight open-source solutions).

Key achievements:

Zero latency impact on monitored applications
Handles 10,000+ events per second with commodity hardware
Processes events with sub-2ms ingestion latency
Scales horizontally by adding consumer workers
Gracefully degrades under failure
Provides real-time operational visibility
Self-hosted, reducing long-term costs

The project reinforced a core engineering principle: reliability is not a feature added after launch, but a property built into the architecture from day one.

Technology Stack

Overview

The Problem

What Engineers Actually Need

System Architecture

Design Overview

Layer Responsibilities

Architecture Diagrams

Core Technical Features

Lightweight Monitoring SDK

Decoupled Event Processing Pipeline

Production-Grade Message Processing

Circuit Breaker Pattern

Idempotent Event Processing

Dual Database Architecture

Database Schema Visualization

Real-Time Analytics Dashboard

Dashboard Preview

Engineering Challenges and Solutions

Challenge: Reliable Event Processing Under Network Failures

Challenge: Absorbing Traffic Spikes Without Backpressure

Challenge: Handling Duplicate Message Delivery

Challenge: Efficient Analytics on Massive Event Volumes

Challenge: Cross-Domain Session Management in Development

Production-Grade System Design

Security & Access Control

Reliability Patterns

Observability

Infrastructure

Multi-Tenant Architecture

Technology Stack & Architecture

Frontend Layer

Backend Services

Message Queue & Streaming

Data Storage

Deployment & Operations

Development Tools

Key Architectural Insights

Decoupled Architecture Enables Scalability

Traffic Spike Absorption

Idempotency Eliminates Whole Classes of Bugs

Event-Driven Architecture Requires Careful State Management

Observability Must Be Built In

Database Trade-offs

Circuit Breakers Prevent Chaos

Impact and Outcomes

Related Projects

ClinicFlow AI — AI Receptionist & Appointment Automation

InterviewMate — Voice-First AI Mock Interview Platform