Logging Insanity

How complex can you make log forwarding?

You’d imagine not very much, after all - what’s there to complicate?

You get the logs, move them to a processor and then just write them into the persistence - right?

You’d be surprised how complex you can make that…

This is a simplified view of what I discovered in the wild.

  +---------------+                   +-----------------+
  |  Application  | ---- Stdout ----> |  Docker Engine  |
  +---------------+                   +-----------------+
                                               |
                                              \./
  +---------------+                   +-----------------+
  |    Logspout   | <--- Listens ---> |  All stdout/s   |
  +---------------+                   +-----------------+
          |
         \./
  +---------------+                   +-----------------+
  |   Fluentbit   | --- Forward ----> |     HAProxy     |
  +---------------+                   +-----------------+
                                               |
                                              \./
  +---------------+                   +-----------------+
  | ElasticSearch | <---- Writes ---- |     Fluentd     |
  +---------------+                   +-----------------+

I absolutely fell from my chair when I saw this. What in everything that’s holy….

There are so many moving parts to this setup… And the worst part is, they’re not independently scalable.

Instead they’re all 1:1 tied to either VMs or containers…

This hidious setup is a huge pain to maintain and scale.

It left me no rest.

I had no choice but to entirely rewrite the whole thing.

The Rewrite

Ideally I wanted to implement a GELF approach where the Application would log directly to fluentd for processing. And in a sense that’s still the goal!

Intermediate Setup

  +---------------+                   +------------------+
  |  Application  | ---- Stdout ----> | Docker LogDriver |
  +---------------+                   +------------------+
                                               |
                                              \./
  +---------------+                   +-----------------+
  |    Fluentd    | <--- Forwards --- |    AWS NLB      |
  +---------------+                   +-----------------+
          |
         \./
  +---------------+
  | ElasticSearch |
  +---------------+

Now this is more like it!

The Docker Engine will use the fluentd logDriver to forward the raw logs to the fluentd service hosted on AWS Fargate with AutoScaling and ultimately gets written into ElasticSearch after processing.

This new setup has increased the resiliency and reliability of the logging infrastructure by almost entirely detaching it from the underlying instances and breaking it into a more decentralized architecture.

However I’m not 100% happy with this… It can be better…

Final Setup

  +---------------+                   +------------------+
  |  Application  | ----- GELF -----> |     AWS NLB      |
  +---------------+                   +------------------+
                                               |
                                              \./
  +---------------+                   +-----------------+
  | ElasticSearch | <---- Writes ---- |    Fluentd      |
  +---------------+                   +-----------------+

This would/will be ideal. The Application/s themselves can send raw logs directly to the processor using GELF and freely add any additional fields to the log.

This will now only have 2 moving parts:

The ECS hosted Application
The Fargate hosted Fluentd

The simplicity of this solution outweights the slight added extra compute overhead for implement GELF.

However, there is a real and possible issue for log-loss as there is no “local” log-aggregator that could spool logs if the NLB becomes unavailable.

This is a niche case and could be handled by adding a local fluentbit as proxy but the probability of this happening is very low.

So low that it’s not worth the effort of adding fluentbit locally.

If you run 6 or more instances of fluentd you will have single redundancy per AZ further minimizing the risk of log-loss.

Furthermore if you enable Cross-AZ Loadbalancing in NLB you will have 1+5 redundancy, which makes the risk of log-loss virtually zero.

Next Step?

I’ll likely rewrite the fluentd processing pipelines as they seem to be overly complex for little reason… Once the GELF switch is done a lot of hydrated information can be taken verbatim from the actual source of truth - The Application.

So let’s see what I end up doing actually!