import * as React from 'react'
  /* @jsx mdx */
import { mdx } from '@mdx-js/react';
/* @jsxRuntime classic */

/* @jsx mdx */

export const _frontmatter = {
  "title": "Monitoring serverless apps",
  "description": "Server code is invisible, how do you know when it breaks?",
  "image": "./img/serverless-monitoring.png"
};
const layoutProps = {
  _frontmatter
};
const MDXLayout = "wrapper";
export default function MDXContent({
  components,
  ...props
}) {
  return <MDXLayout {...layoutProps} {...props} components={components} mdxType="MDXLayout">
    <h1 {...{
      "id": "monitoring-serverless-apps"
    }}>{`Monitoring serverless apps`}</h1>
    <p><img parentName="p" {...{
        "src": "/ea377120be207d9fd91175b9de46a425/serverless-monitoring.svg",
        "alt": null
      }}></img></p>
    <p>{`When clients have a bug, you can tell. Go through the flow, click around, see what happens.`}</p>
    <p>{`But code on the server is invisible. And `}<a parentName="p" {...{
        "href": "/robust-backend-design"
      }}>{`always broken`}</a>{`. A distributed system is never 100% error-free.`}</p>
    <p>{`A `}<a parentName="p" {...{
        "href": "/serverless-architecture-principles"
      }}>{`good architecture`}</a>{` lets you ignore many errors. The system recovers on its own.`}</p>
    <p>{`What about the bad errors? And how do you debug code you can't see?`}</p>
    <h2 {...{
      "id": "observability"
    }}>{`Observability`}</h2>
    <p>{`Observability is the art of understanding the internal state of a system based on its outputs. It's a continuous process.`}</p>
    <p>{`A good system lets you:`}</p>
    <ul>
      <li parentName="ul">{`understand what's going on`}</li>
      <li parentName="ul">{`see trends`}</li>
      <li parentName="ul">{`figure out what happened after an error`}</li>
      <li parentName="ul"><em parentName="li">{`predict`}</em>{` errors`}</li>
      <li parentName="ul">{`know when there's an emergency`}</li>
      <li parentName="ul">{`understand how to fix an emergency`}</li>
    </ul>
    <p>{`Those are design goals. There is no right answer. Observability is an art and getting it right takes practice.`}</p>
    <p>{`But there are guidelines you can follow. You'll need:`}</p>
    <ul>
      <li parentName="ul"><strong parentName="li">{`logs`}</strong>{` are immutable events that happened in your system. They follow a structured format and offer information about what happened where and when.`}</li>
      <li parentName="ul"><strong parentName="li">{`metrics`}</strong>{` are aggregate events over time. They tell you how much of what is happening, how long it takes, and help you understand trends.`}</li>
      <li parentName="ul"><strong parentName="li">{`traces`}</strong>{` are journeys through the system. A sequence of events that contributed to a bigger result.`}</li>
    </ul>
    <div id="lock" />
    <h3 {...{
      "id": "questions-to-ask-yourself"
    }}>{`Questions to ask yourself`}</h3>
    <p>{`Observability is part of your development process. You can't tack it on later. `}</p>
    <p>{`I like to ask these questions when building:`}</p>
    <ol>
      <li parentName="ol">{`How will you know this works?`}</li>
      <li parentName="ol">{`How will you know this broke?`}</li>
      <li parentName="ol">{`How will you deduce where it broke?`}</li>
      <li parentName="ol">{`How will you figure out how it broke?`}</li>
      <li parentName="ol">{`How will you know which payload broke it?`}</li>
    </ol>
    <p>{`There is no right answer. It takes a few emergencies to dial it in and save the info you need.`}</p>
    <h3 {...{
      "id": "when-do-you-need-observability"
    }}>{`When do you need observability?`}</h3>
    <p>{`Always. 😛`}</p>
    <p>{`It depends. How critical is your software?`}</p>
    <p>{`When you have 10 users, eh I'd focus on getting users. When you have 100 users, eh they'll tell you when there's a bug.`}</p>
    <p>{`You'll see stranger and stranger bugs the more users you have. A 1-in-1000 bug happens every day when you have 1000 users. At Google scale, tiny impossible-to-reproduce bugs happen every minute.`}</p>
    <p>{`That's when observability shines. Understanding bugs you can't reproduce.`}</p>
    <p><em parentName="p">{`PS: you don't need traditional "monitoring" in a serverless system. Your server is never down, your memory is never full, your storage never runs out, your CPU is never busy.`}</em></p>
    <h2 {...{
      "id": "what-to-measure"
    }}>{`What to measure`}</h2>
    <p>{`Deciding what to measure is a art. You'll get it wrong.`}</p>
    <p>{`You build a system asking the 5 questions we mentioned. Add a bunch of measurements and walk away.`}</p>
    <p>{`A few days pass and something goes wrong.`}</p>
    <p>{`You go through the logs. There's too many. You ignore 80%. `}</p>
    <p>{`You realize the 20% that are useful don't have enough info. Despite your best efforts, you can't be certain what happened.`}</p>
    <p>{`Adjust what you log, add the info you wish you had, remove the info you didn't need. Next time will be better.`}</p>
    <p>{`It's an iterative process :)`}</p>
    <h3 {...{
      "id": "typical-useful-logs-and-events"
    }}>{`Typical useful logs and events`}</h3>
    <p>{`You need two types of logs:`}</p>
    <ol>
      <li parentName="ol">{`The system is ticking along`}</li>
      <li parentName="ol">{`Errors`}</li>
    </ol>
    <p>{`Happy logs work like breadcrumbs. `}</p>
    <p>{`You leave them behind so you can later trace a path through the system. How did this user get into that state? Are we seeing bottlenecks? Did event B that always comes after event A suddenly stop coming?`}</p>
    <p>{`You want to know when a typical behavior changes.`}</p>
    <p>{`Errors – always log errors. Add as much debugging info as possible. Print the whole stack trace, the exact error, and any identifiers you'll need to reproduce the bug.`}</p>
    <h3 {...{
      "id": "metrics-to-track"
    }}>{`Metrics to track`}</h3>
    <p>{`Specific metrics depend on what you care about. `}</p>
    <p>{`Got a function that needs to be fast? Measure its speed. Got a suspected bottleneck? Measure requests waiting. Got a flaky process? Measure error rate.`}</p>
    <p>{`At the least, you'll want to measure 3 metrics for each part of your system:`}</p>
    <ol>
      <li parentName="ol"><strong parentName="li">{`Throughput`}</strong>{` – how many requests are you processing`}</li>
      <li parentName="ol"><strong parentName="li">{`Error rate`}</strong>{` – how many errors happen`}</li>
      <li parentName="ol"><strong parentName="li">{`Failure rate`}</strong>{` – how many requests never succeed`}</li>
    </ol>
    <p><em parentName="p">{`"Part of the system"`}</em>{` means an end-to-end process as seen by a user. Don't sweat individual pieces unless you identify a problem that needs a closer look.`}</p>
    <h3 {...{
      "id": "when-to-alarm"
    }}>{`When to alarm`}</h3>
    <p>{`Metrics help when you look at them, logs help when you're solving a problem. Alarms come to you.`}</p>
    <p>{`An escalating system works best:`}</p>
    <ul>
      <li parentName="ul">{`email for small problems`}</li>
      <li parentName="ul">{`slack when the fire grows`}</li>
      <li parentName="ul">{`SMS for critical issues`}</li>
    </ul>
    <p>{`You'll want to set alarms for high error and failure rates (depends what you consider high) and anomalies in throughput. When a 100/hour event drops to zero, something's wrong.`}</p>
    <p>{`How to set alarms depends on your tool of choice. On AWS, CloudWatch offers basic support and I've loved DataDog in the past. Anomaly detection on DataDog is wonderful. 👌`}</p>
    <h2 {...{
      "id": "distributed-logging"
    }}>{`Distributed logging`}</h2>
    <p>{`Logging is the core of your observability toolbox. Metrics and traces build on top of logs.`}</p>
    <p>{`In a serverless system, you can't sign into a server to see the logs. There's no server and your system is distributed across many services.`}</p>
    <p>{`You'll need a distributed logging system.`}</p>
    <p>{`On AWS, you can achieve this through CloudWatch. A service that collects output from your lambdas and offers a basic UI.`}</p>
    <h3 {...{
      "id": "statsd"
    }}>{`StatsD`}</h3>
    <p>{`When you outgrow default CloudWatch metrics and need deeper insights, there's a rich ecosystem of tools and resources waiting for you. All built on top of a de facto standard: `}<a parentName="p" {...{
        "href": "https://github.com/statsd/statsd"
      }}>{`StatsD`}</a>{`.`}</p>
    <p>{`StatsD is an open source agent that listens for prints in a specific format to collect as metrics. It sends those to a central location without interfering with your code.`}</p>
    <p>{`You can `}<a parentName="p" {...{
        "href": "https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-custom-metrics-statsd.html"
      }}>{`use StatsD with CloudWatch`}</a>{` to collect custom metrics. These show up in the CloudWatch UI.`}</p>
    <p>{`Print logs like this:`}</p>
    <pre><code parentName="pre" {...{}}>{`console.log("MetricName:value|type|sample_rate|tag1,tag2")
`}</code></pre>
    <p>{`Sample rate and tags are optional.`}</p>
    <p>{`When you print in that format, you can connect a number of 3rd party tools that give you power beyond the CloudWatch UI. `}<a parentName="p" {...{
        "href": "https://www.datadoghq.com"
      }}>{`DataDog`}</a>{` has been a great choice for me.`}</p>
    <h2 {...{
      "id": "make-a-pulse-dashboard"
    }}>{`Make a pulse dashboard`}</h2>
    <p>{`The final step in your observability journey is a nice dashboard. Something you can look at, see wiggly lines, and say `}<em parentName="p">{`"Yep, system's working"`}</em></p>
    <p>{`Dashboard your critical metrics. What you're focusing on `}<em parentName="p">{`right now`}</em>{`. 5 or 6 at most.`}</p>
    <p>{`You can build detailed dashboards for specific parts of your system. That comes when your project grows.`}</p>
    <p>{`A typical core dashboard includes:`}</p>
    <ul>
      <li parentName="ul">{`global request throughput`}</li>
      <li parentName="ul">{`global error rate`}</li>
      <li parentName="ul">{`global failure rate`}</li>
      <li parentName="ul">{`response times`}</li>
    </ul>
    <p>{`And remember: A metric that isn't actionable is pure vanity and should be removed.`}</p>
    <p>{`Next chapter we look at the split between localhost and production.`}</p>

    </MDXLayout>;
}
;
MDXContent.isMDXComponent = true;
      