DM
Back to Projects
Sendry — Event-Driven API Observability Platform
In-progressReactNode.jsExpress.js+7 more

Sendry — Event-Driven API Observability Platform

Event-driven API observability platform with asynchronous telemetry ingestion, RabbitMQ processing, DLQ handling, circuit breakers, and real-time monitoring dashboards.

Timeline

3–4 months

Role

Full Stack Developer

Team

Solo

Status
In-progress

Technology Stack

React
Node.js
Express.js
MongoDB
PostgreSQL
RabbitMQ
JWT
Docker
Redis
Tailwind CSS

Overview

Sendry is a production-grade, self-hosted API observability platform engineered to capture, aggregate, and visualize high-throughput API metrics without introducing latency overhead or blocking application request loops.

The platform solves a fundamental problem in modern software engineering: how to monitor systems comprehensively while maintaining zero performance impact on the applications being monitored.

By leveraging an event-driven, fully decoupled architecture, Sendry ingests API telemetry in under 2 milliseconds, publishes events to a durable message queue, and processes analytics asynchronously without touching the critical path of production requests.

Core Design Principle:

Observability systems must never compromise the reliability or performance of the systems they monitor.

The Problem

Most observability platforms fall into one of three categories:

Category 1: Expensive SaaS Solutions Enterprise APM tools charge premium rates per host or per-event. For teams with high traffic volumes, costs become prohibitive. Monthly bills can exceed infrastructure costs themselves.

Category 2: Complex Self-Hosted Systems Open-source alternatives like Prometheus or ELK require significant operational expertise and infrastructure investment. They demand dedicated DevOps resources to maintain, scale, and troubleshoot.

Category 3: Performance-Blocking Solutions Synchronous instrumentation introduces measurable latency. Every database write, every HTTP call, and every aggregation operation during request execution delays the response to users. This creates an unacceptable trade-off between visibility and performance.

What Engineers Actually Need

Teams don't require enterprise APM complexity. They need answers to operational questions:

  • Which API endpoints generate the most traffic?
  • What is the current system error rate?
  • Which services experience performance degradation?
  • Where are the reliability bottlenecks?

Sendry was designed to provide precisely these insights without the complexity, cost, or performance penalty of existing solutions.


System Architecture

Design Overview

Sendry implements a four-layer event-driven architecture that completely decouples telemetry ingestion from data persistence and analytics processing:

Production Application
        │
        ├─ Sendry SDK Middleware (Captures metrics)
        │
        ▼
Ingestion API (Express.js)
        │
        ├─ API Key Validation
        ├─ Payload Parsing
        │
        ▼
RabbitMQ Message Queue
        │
        ├─ Durable buffering
        ├─ Traffic spike absorption
        │
        ▼
Background Consumer Worker
        │
        ├─ Circuit breaker protection
        ├─ Retry mechanisms
        ├─ Deduplication logic
        │
        ├──────────────────┬──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
    MongoDB          PostgreSQL         Dead-Letter Queue
   (Raw Events)    (Aggregated Metrics)   (Failed Events)
        │                  │                  │
        └──────────────────┴──────────────────┘
                     │
                     ▼
              Analytics API
                     │
                     ▼
            Real-Time Dashboard

Layer Responsibilities

Ingest Layer: Validates incoming telemetry, verifies API credentials, and publishes events to the message queue. Returns immediately with a 202 Accepted response, guaranteeing sub-2ms response times regardless of downstream processing load.

Queue Layer: RabbitMQ absorbs traffic spikes and provides durable storage. If consumer workers become overwhelmed, the queue buffers events without blocking the production application. Implements manual acknowledgments to ensure no events are lost during failures.

Processing Layer: Background consumer workers pull events from the queue, validate schemas, perform deduplication, and coordinate dual-write persistence. Implements exponential backoff with jitter to handle transient database failures gracefully.

Storage Layer: MongoDB stores raw, unstructured event payloads with automatic 30-day expiration. PostgreSQL maintains pre-aggregated hourly time-series metrics optimized for analytical queries. This separation allows independent optimization of write-heavy and read-heavy workloads.

Architecture Diagrams

System Architecture

Ingestion Sequence


Core Technical Features

Lightweight Monitoring SDK

Applications integrate a minimal Express.js middleware that automatically captures relevant telemetry:

  • Request endpoint and HTTP method
  • Response status code
  • Latency measurement (sub-millisecond precision)
  • Client IP address and user agent
  • Request size and response size

Critically, metrics are collected and sent asynchronously after the HTTP response has already been delivered to the client. This guarantees zero impact on application response times.

Decoupled Event Processing Pipeline

Rather than processing telemetry synchronously, Sendry implements an event-driven architecture:

  1. Ingest API receives telemetry payload
  2. Validates API key against credential store
  3. Publishes event to RabbitMQ queue
  4. Returns HTTP 202 Accepted immediately
  5. Consumer pulls from queue independently
  6. Processes and persists to dual databases

This separation means traffic spikes are absorbed by the queue without overwhelming the application. Peak loads are handled gracefully through buffering rather than backpressure.

Production-Grade Message Processing

RabbitMQ consumer implements multiple reliability patterns:

  • Durable queues with persistent message storage
  • Manual acknowledgments to prevent message loss
  • Exponential backoff retry strategy with randomized jitter
  • Dead-letter queues for unprocessable messages
  • Graceful shutdown procedures

Unprocessable events (schema validation failures or repeated processing errors) are automatically routed to a dead-letter queue for manual inspection rather than blocking the primary queue.

Circuit Breaker Pattern

To prevent cascading failures, both the ingestion API and consumer implement three-state circuit breakers:

  • Closed State: Normal operation, requests flow to downstream systems
  • Open State: Threshold exceeded (5 errors in 10 seconds), requests immediately fail without attempting downstream calls
  • Half-Open State: After 30-second cooldown, test messages verify recovery

If RabbitMQ crashes, the Ingest API fails open and returns 503 Service Unavailable without hanging or blocking the production application. This guarantees that monitoring infrastructure failures never propagate to user-facing systems.

Idempotent Event Processing

Message brokers guarantee at-least-once delivery, meaning duplicate messages can occur. Sendry handles this safely through:

  • Unique event identifiers on every message
  • In-memory cache of processed event IDs (capped at 100,000 entries)
  • Deduplication check before database writes

This ensures analytics remain accurate even when network retries deliver duplicate messages.

Dual Database Architecture

Different data access patterns require different storage engines:

MongoDB: Stores raw, unstructured event payloads with full request/response bodies. Enables schemaless ingestion and flexible queries. Automatically expires records after 30 days using TTL indexes, preventing unbounded storage growth.

PostgreSQL: Stores pre-aggregated hourly time-series metrics. Enables fast analytical queries optimized for dashboard rendering. Uses efficient UPSERT operations to atomically merge raw events into hourly buckets.

This separation allows optimization of write throughput (MongoDB) independently from query performance (PostgreSQL).

Database Schema Visualization

Database Entity Relationships

Real-Time Analytics Dashboard

The React-based dashboard provides operational visibility:

  • Request volume trends over time
  • Error rate distribution by status code
  • Average, p50, p95, p99 latency metrics
  • Top endpoints ranked by traffic
  • Service health status indicators
  • Endpoint performance comparison tables

All metrics are served from pre-aggregated PostgreSQL tables, ensuring dashboard queries complete in under 100 milliseconds regardless of total event volume.

Dashboard Preview

Sendry Dashboard


Engineering Challenges and Solutions

Challenge: Reliable Event Processing Under Network Failures

Problem: Network partitions, server crashes, and transient failures can cause event loss if not handled carefully. The system must provide guarantees that telemetry is never discarded.

Solution:

  • Implemented durable RabbitMQ queues with persistent storage to disk
  • Consumer uses manual acknowledgments instead of auto-acknowledgment
  • Events remain in queue until explicitly acknowledged after successful database writes
  • Graceful shutdown procedures allow in-flight operations to complete before termination
  • Dead-letter queues capture unprocessable messages for manual investigation

Result: Even if consumer workers crash mid-processing, events are not lost and will be reprocessed after recovery.


Challenge: Absorbing Traffic Spikes Without Backpressure

Problem: If a production application sends 10,000 requests per second during a traffic spike, the monitoring system must not block or slow down the application waiting for these requests to be processed.

Solution:

  • Implemented RabbitMQ as a durable buffer between ingestion and processing
  • Ingestion API publishes to queue and returns immediately (202 Accepted)
  • Consumer pulls from queue at its own pace, independent of ingestion rate
  • Horizontal scaling: additional consumer processes can be spawned to drain queue faster
  • Prefetch controls prevent any single worker from being overwhelmed

Result: Traffic spikes are absorbed by the queue without propagating backpressure to production systems.


Challenge: Handling Duplicate Message Delivery

Problem: Message broker retry mechanisms can deliver the same message multiple times, causing duplicate entries in analytics. This corrupts metrics accuracy.

Solution:

  • Every event receives a unique hash identifier
  • Consumer maintains an in-memory Set of processed event IDs
  • Before writing to databases, consumer checks Set for duplicates
  • Set is capped at 100,000 entries to prevent memory leaks
  • Capped LRU structure discards oldest entries when size limit is reached

Result: Duplicate messages are transparently filtered with zero additional database overhead.


Challenge: Efficient Analytics on Massive Event Volumes

Problem: Scanning billions of raw events for every dashboard query is prohibitively expensive. Real-time dashboards would timeout regularly.

Solution:

  • Implemented incremental aggregation during event processing
  • Consumer groups raw events into hourly buckets (service, endpoint, method, status code)
  • PostgreSQL UPSERT operations atomically merge into hourly metrics table
  • Dashboard queries scan pre-aggregated metrics tables instead of raw events
  • Storage size reduced by 98% through aggregation

Result: Dashboard queries complete in 50-100ms regardless of total event volume.


Challenge: Cross-Domain Session Management in Development

Problem: When testing locally (frontend on localhost:5173, API on localhost:5000), browsers block session cookies due to SameSite security policies. Testing required complex workarounds.

Solution:

  • Implemented environment-aware cookie negotiation in authentication middleware
  • Development environment: SameSite=Lax, Secure=false
  • Production environment: SameSite=None, Secure=true
  • Automatic detection based on NODE_ENV

Result: Development and production environments work seamlessly without manual configuration changes.


Production-Grade System Design

The platform implements several critical patterns required for reliable, scalable systems:

Security & Access Control

  • JWT-based authentication with HTTP-only secure cookies
  • API Key management with per-client credential rotation
  • Role-based access control (RBAC) with admin, operator, and viewer roles
  • API key validation on every incoming telemetry request
  • Automatic token expiration and refresh mechanisms

Reliability Patterns

  • Circuit breaker pattern with three states (closed, open, half-open)
  • Exponential backoff with randomized jitter for retry handling
  • Dead-letter queues for handling unprocessable events
  • Health checks on all critical services
  • Graceful shutdown procedures with in-flight operation completion

Observability

  • Structured logging in JSON format for easy parsing
  • Request tracing with unique IDs for debugging
  • Performance metrics collection on all critical paths
  • Error aggregation and alerting

Infrastructure

  • Containerized deployments using Docker
  • Docker Compose for local development environment
  • Support for horizontal scaling (multiple consumer workers)
  • Connection pooling for database and message broker
  • Automated database migrations

Multi-Tenant Architecture

  • Namespace isolation between customers
  • Per-customer API keys and rate limiting
  • Separate metric aggregations by customer
  • Billing-ready event counting infrastructure

Technology Stack & Architecture

Frontend Layer

  • React 18 with Vite for optimized bundling
  • TanStack Query for server state management
  • ApexCharts for real-time metric visualization
  • Tailwind CSS for responsive design
  • Axios for REST API communication

Backend Services

  • Node.js runtime for high concurrency
  • Express.js framework for HTTP routing
  • JWT middleware for authentication
  • Structured logging middleware

Message Queue & Streaming

  • RabbitMQ 3.x for durable event queuing
  • Manual acknowledgments for delivery guarantees
  • Dead-letter exchanges for failure handling
  • Topic-based routing for multi-tenant isolation

Data Storage

  • MongoDB 6.0 for schemaless event storage
  • TTL indexes for automatic data expiration
  • PostgreSQL 15 for time-series aggregations
  • Connection pooling and query optimization

Deployment & Operations

  • Docker containerization for reproducible deployments
  • Docker Compose for local development
  • Environment-based configuration management
  • Graceful shutdown handling

Development Tools

  • ESLint for code quality
  • Jest for unit testing
  • Postman collections for API testing
  • Docker for isolated local environment

Key Architectural Insights

Building Sendry provided deep experience with production system design patterns. The project demonstrated how architectural decisions made early in development have enormous downstream consequences.

Decoupled Architecture Enables Scalability

The fundamental insight was separating concerns into independent layers. By keeping database writes out of the request path through message queueing, the system became horizontally scalable without architectural changes. Additional consumer processes can be spawned to increase throughput without modifying the ingestion layer.

Traffic Spike Absorption

Message brokers exist primarily to absorb traffic variance. During peak loads, the queue buffers events naturally, preventing cascading failures. Bottlenecks appear gracefully as queue depth increases rather than as application errors. This provides visibility into when capacity should be expanded.

Idempotency Eliminates Whole Classes of Bugs

By designing all operations to be idempotent, the system became dramatically more resilient. Events can be replayed, retried, or reprocessed without corrupting state. This single design decision eliminated entire categories of subtle concurrency bugs.

Event-Driven Architecture Requires Careful State Management

The tradeoff for decoupling is increased complexity in state management. Monitoring the health of each layer independently became critical. Dead-letter queues surfaced issues that synchronous systems would expose immediately through exceptions.

Observability Must Be Built In

Building a monitoring system while lacking built-in monitoring capabilities was challenging. The project benefited enormously from structured logging, unique request IDs, and per-component health checks. These weren't afterthoughts but architectural requirements.

Database Trade-offs

The dual-database approach involved accepting higher operational complexity for significant performance gains. MongoDB handled write throughput that PostgreSQL alone would have struggled with. The trade-off was justified by the scale (billions of events) and access patterns (write-heavy raw events, read-heavy aggregations).

Circuit Breakers Prevent Chaos

Without circuit breakers, infrastructure failures propagate as cascading errors. With them, failures are contained and visible. The three-state pattern is remarkably elegant for handling transient failures differently from persistent outages.


Impact and Outcomes

The platform demonstrates how careful architectural design can solve operational problems that most teams try to solve with money (expensive SaaS platforms) or complexity (heavyweight open-source solutions).

Key achievements:

  • Zero latency impact on monitored applications
  • Handles 10,000+ events per second with commodity hardware
  • Processes events with sub-2ms ingestion latency
  • Scales horizontally by adding consumer workers
  • Gracefully degrades under failure
  • Provides real-time operational visibility
  • Self-hosted, reducing long-term costs

The project reinforced a core engineering principle: reliability is not a feature added after launch, but a property built into the architecture from day one.