Merge pull request #90 from sumit419/monitoring_and_metrics

Monitoring and metrics
2021-02-09 09:19:32 +05:30 · 2021-02-09 09:19:32 +05:30 · d6f15c3ea4
parent bbd0cd38b5 094d5c37cb
commit d6f15c3ea4
21 changed files with 702 additions and 2 deletions
--- a/courses/index.md
+++ b/courses/index.md
@ -2,7 +2,7 @@

 <img src="img/sos.png" width=200 >

-In early 2019, we started visiting campuses across India to recruit the best and brightest minds to ensure LinkedIn, and all the services that make up its complex technology stack, is always available for everyone. This critical function at Linkedin falls under the purview of the Site Engineering team and Site Reliability Engineers (SREs) who are Software Engineers specializing in reliability. SREs apply the principles of computer science and engineering to the design, development and operation of computer systems: generally, large scale, distributed ones
+In early 2019, we started visiting campuses across India to recruit the best and brightest minds to ensure LinkedIn, and all the services that make up its complex technology stack, is always available for everyone. This critical function at LinkedIn falls under the purview of the Site Engineering team and Site Reliability Engineers (SREs) who are Software Engineers specializing in reliability. SREs apply the principles of computer science and engineering to the design, development and operation of computer systems: generally, large scale, distributed ones

 As we continued on this journey we started getting a lot of questions from these campuses on what exactly the site reliability engineering role entails? And, how could someone learn the skills and the disciplines involved to become a successful site reliability engineer? Fast forward a few months, and a few of these campus students had joined LinkedIn either as interns or as full-time engineers to become a part of the Site Engineering team; we also had a few lateral hires who joined our organization who were not from a traditional SRE background. That's when a few of us got together and started to think about how we can onboard new graduate engineers to the Site Engineering team.

@ -20,9 +20,10 @@ In this course, we are focusing on building strong foundational skills. The cour
    -   [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
    -   [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
 -   [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
+-   [Metrics and Monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/)
 -   [Security](https://linkedin.github.io/school-of-sre/security/intro/)

 We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.

-At Linkedin, we are using this curriculum for onboarding our non-traditional hires and new college grads into the SRE role. We had multiple rounds of successful onboarding experience with new employees and the course helped them be productive in a very short period of time. This motivated us to open source the content for helping other organizations in onboarding new engineers into the role and provide guidance for aspiring individuals to get into the role. We realize that the initial content we created is just a starting point and we hope that the community can help in the journey of refining and expanding the content. Checkout [the contributing guide](./CONTRIBUTING.md) to get started.
+At LinkedIn, we are using this curriculum for onboarding our non-traditional hires and new college grads into the SRE role. We had multiple rounds of successful onboarding experience with new employees and the course helped them be productive in a very short period of time. This motivated us to open source the content for helping other organizations in onboarding new engineers into the role and provide guidance for aspiring individuals to get into the role. We realize that the initial content we created is just a starting point and we hope that the community can help in the journey of refining and expanding the content. Checkout [the contributing guide](./CONTRIBUTING.md) to get started.

--- a/courses/metrics_and_monitoring/alerts.md
+++ b/courses/metrics_and_monitoring/alerts.md
@ -0,0 +1,29 @@
+##
+
+# Proactive monitoring using alerts
+Earlier we discussed different ways to collect key metric data points
+from a service and its underlying infrastructure. This data gives us a
+better understanding of how the service is performing. One of the main
+objectives of monitoring is to detect any service degradations early
+(reduce Mean Time To Detect) and notify stakeholders so that the issues
+are either avoided or can be fixed early, thus reducing Mean Time To
+Recover (MTTR). For example, if you are notified when resource usage by
+a service exceeds 90 percent, you can take preventive measures to avoid
+any service breakdown due to a shortage of resources. On the other hand,
+when a service goes down due to an issue, early detection and
+notification of such incidents can help you quickly fix the issue.
+
+![An alert notification received on Slack](images/image11.png) 
+<p align="center"> Figure 8: An alert notification received on Slack </p>
+
+Today most of the monitoring services available provide a mechanism to
+set up alerts on one or a combination of metrics to actively monitor the
+service health. These alerts have a set of defined rules or conditions,
+and when the rule is broken, you are notified. These rules can be as
+simple as notifying when the metric value exceeds n to as complex as a
+week over week (WoW) comparison of standard deviation over a period of
+time. Monitoring tools notify you about an active alert, and most of
+these tools support instant messaging (IM) platforms, SMS, email, or
+phone calls. Figure 8 shows a sample alert notification received on
+Slack for memory usage exceeding 90 percent of total RAM space on the
+host.
--- a/courses/metrics_and_monitoring/best_practices.md
+++ b/courses/metrics_and_monitoring/best_practices.md
@ -0,0 +1,40 @@
+##
+
+# Best practices for monitoring
+
+When setting up monitoring for a service, keep the following best
+practices in mind.
+
+-   **Use the right metric type** -- Most of the libraries available
+     today offer various metric types. Choose the appropriate metric
+     type for monitoring your system. Following are the types of
+     metrics and their purposes.
+
+    -   **Gauge --** *Gauge* is a constant type of metric. After the
+         metric is initialized, the metric value does not change unless
+         you intentionally update it.
+
+    -   **Timer --** *Timer* measures the time taken to complete a
+         task.
+
+    -   **Counter --** *Counter* counts the number of occurrences of a
+         particular event.
+
+ For more information about these metric types, see [Data
+ Types](https://statsd.readthedocs.io/en/v0.5.0/types.html).
+
+-   **Avoid over-monitoring** -- Monitoring can be a significant
+     engineering endeavor***.*** Therefore, be sure not to spend too
+     much time and resources on monitoring services, yet make sure all
+     important metrics are captured.
+
+-   **Prevent alert fatigue** -- Set alerts for metrics that are
+     important and actionable. If you receive too many non-critical
+     alerts, you might start ignoring alert notifications over time. As
+     a result, critical alerts might get overlooked.
+
+-   **Have a runbook for alerts** -- For every alert, make sure you have
+     a document explaining what actions and checks need to be performed
+     when the alert fires. This enables any engineer on the team to
+     handle the alert and take necessary actions, without any help from
+     others.
--- a/courses/metrics_and_monitoring/command-line_tools.md
+++ b/courses/metrics_and_monitoring/command-line_tools.md
@ -0,0 +1,101 @@
+##
+
+# Command-line tools
+Most of the Linux distributions today come with a set of tools that
+monitor the system's performance. These tools help you measure and
+understand various subsystem statistics (CPU, memory, network, and so
+on). Let's look at some of the tools that are predominantly used.
+
+-   `ps/top `-- The process status command (ps) displays information
+     about all the currently running processes in a Linux system. The
+     top command is similar to the ps command, but it periodically
+     updates the information displayed until the program is terminated.
+     An advanced version of top, called htop, has a more user-friendly
+     interface and some additional features. These command-line
+     utilities come with options to modify the operation and output of
+     the command. Following are some important options supported by the
+     ps command.
+
+    -   `-p <pid1, pid2,...>` -- Displays information about processes
+         that match the specified process IDs. Similarly, you can use
+         `-u <uid>` and `-g <gid>` to display information about
+         processes belonging to a specific user or group.
+
+    -   `-a` -- Displays information about other users' processes, as well
+         as one's own.
+
+    -   `-x` -- When displaying processes matched by other options,
+         includes processes that do not have a controlling terminal.
+
+ ![Results of top command](images/image12.png) 
+ <p align="center"> Figure 2: Results of top command </p>
+
+-   `ss` -- The socket statistics command (ss) displays information
+     about network sockets on the system. This tool is the successor of
+     [netstat](https://man7.org/linux/man-pages/man8/netstat.8.html),
+     which is deprecated. Following are some command-line options
+     supported by the ss command:
+
+    -   `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP
+         sockets, `-x` is for UNIX domain sockets, and so on.
+
+    -   `-l` -- Displays only listening sockets.
+
+    -   `-n` -- Instructs the command to not resolve service names.
+         Instead displays the port numbers.
+
+![List of listening sockets on a system](images/image8.png) <p align="center"> Figure
+3: List of listening sockets on a system </p>
+
+-   `free` -- The free command displays memory usage statistics on the
+     host like available memory, used memory, and free memory. Most often,
+     this command is used with the `-h` command-line option, which
+     displays the statistics in a human-readable format.
+
+![Memory statistics on a host in human-readable form](images/image6.png) 
+<p align="center"> Figure 4: Memory statistics on a host in human-readable form </p>
+
+-   `df --` The df command displays disk space usage statistics. The
+     `-i` command-line option is also often used to display
+     [inode](https://en.wikipedia.org/wiki/Inode) usage
+     statistics. The `-h` command-line option is used for displaying
+     statistics in a human-readable format.
+
+![Disk usage statistics on a system in human-readable form](images/image9.png) 
+<p align="center"> Figure 5:
+ Disk usage statistics on a system in human-readable form </p>
+
+-   `sar` -- The sar utility monitors various subsystems, such as CPU
+     and memory, in real time. This data can be stored in a file
+     specified with the `-o` option. This tool helps to identify
+     anomalies.
+
+-   `iftop` -- The interface top command (`iftop`) displays bandwidth
+     utilization by a host on an interface. This command is often used
+     to identify bandwidth usage by active connections. The `-i` option
+     specifies which network interface to watch.
+
+![Network bandwidth usage by
+  active connection on the host](images/image2.png) 
+  <p align="center"> Figure 6: Network bandwidth usage by
+active connection on the host </p>
+
+-   `tcpdump` -- The tcpdump command is a network monitoring tool that
+     captures network packets flowing over the network and displays a
+     description of the captured packets. The following options are
+     available:
+
+    -   `-i <interface>` -- Interface to listen on
+
+    -   `host <IP/hostname>` -- Filters traffic going to or from the
+         specified host
+
+    -   `src/dst` -- Displays one-way traffic from the source (src) or to
+         the destination (dst)
+
+    -   `port <port number>` -- Filters traffic to or from a particular
+         port
+
+![tcpdump of packets on an interface](images/image10.png) 
+<p align="center"> Figure 7: *tcpdump* of packets on *docker0*
+interface on a host </p>
--- a/courses/metrics_and_monitoring/conclusion.md
+++ b/courses/metrics_and_monitoring/conclusion.md
@ -0,0 +1,52 @@
+# Conclusion
+
+A robust monitoring and alerting system is necessary for maintaining and
+troubleshooting a system. A dashboard with key metrics can give you an
+overview of service performance, all in one place. Well-defined alerts
+(with realistic thresholds and notifications) further enable you to
+quickly identify any anomalies in the service infrastructure and in
+resource saturation. By taking necessary actions, you can avoid any
+service degradations and decrease MTTD for service breakdowns.
+
+In addition to in-house monitoring, monitoring real user experience can
+help you to understand service performance as perceived by the users.
+Many modules are involved in serving the user, and most of them are out
+of your control. Therefore, you need to have real-user monitoring in
+place.
+
+Metrics give very abstract details on service performance. To get a
+better understanding of the system and for faster recovery during
+incidents, you might want to implement the other two pillars of
+observability: logs and tracing. Logs and trace data can help you
+understand what led to service failure or degradation.
+
+Following are some resources to learn more about monitoring and
+observability:
+
+-   [Google SRE book: Monitoring Distributed
+     Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
+
+-   [Mastering Distributed Tracing by Yuri
+     Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
+
+
+
+## References
+
+-   [Google SRE book: Monitoring Distributed
+     Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
+
+-   [Mastering Distributed Tracing, by Yuri
+     Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
+
+-   [Monitoring and
+     Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c)
+
+-   [Three PIllars with Zero
+     Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8)
+
+-   Engineering blogs on
+         [LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring),
+         [Grafana](https://grafana.com/blog/),
+         [Elastic.co](https://www.elastic.co/blog/),
+         [OpenTelemetry](https://medium.com/opentelemetry)
--- a/courses/metrics_and_monitoring/images/image1.jpg
+++ b/courses/metrics_and_monitoring/images/image1.jpg
--- a/courses/metrics_and_monitoring/images/image10.png
+++ b/courses/metrics_and_monitoring/images/image10.png
--- a/courses/metrics_and_monitoring/images/image11.png
+++ b/courses/metrics_and_monitoring/images/image11.png
--- a/courses/metrics_and_monitoring/images/image12.png
+++ b/courses/metrics_and_monitoring/images/image12.png
--- a/courses/metrics_and_monitoring/images/image2.png
+++ b/courses/metrics_and_monitoring/images/image2.png
--- a/courses/metrics_and_monitoring/images/image3.jpg
+++ b/courses/metrics_and_monitoring/images/image3.jpg
--- a/courses/metrics_and_monitoring/images/image4.jpg
+++ b/courses/metrics_and_monitoring/images/image4.jpg
--- a/courses/metrics_and_monitoring/images/image5.jpg
+++ b/courses/metrics_and_monitoring/images/image5.jpg
--- a/courses/metrics_and_monitoring/images/image6.png
+++ b/courses/metrics_and_monitoring/images/image6.png
--- a/courses/metrics_and_monitoring/images/image7.png
+++ b/courses/metrics_and_monitoring/images/image7.png
--- a/courses/metrics_and_monitoring/images/image8.png
+++ b/courses/metrics_and_monitoring/images/image8.png
--- a/courses/metrics_and_monitoring/images/image9.png
+++ b/courses/metrics_and_monitoring/images/image9.png
--- a/courses/metrics_and_monitoring/introduction.md
+++ b/courses/metrics_and_monitoring/introduction.md
@ -0,0 +1,281 @@
+##
+
+# Prerequisites
+
+-   [Linux  Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/)
+
+-   [Python and the Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
+
+-   [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
+
+-   [Linux Networking Fundamentals](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
+
+
+## What to expect from this course
+
+Monitoring is an integral part of any system. As an SRE, you need to
+have a basic understanding of monitoring a service infrastructure. By
+the end of this course, you will gain a better understanding of the
+following topics:
+
+-   What is monitoring?
+
+    -   What needs to be measured
+
+    -   How the metrics gathered can be used to improve business decisions and overall reliability
+
+    -   Proactive monitoring with alerts
+
+    -   Log processing and its importance
+
+-   What is observability?
+
+    -   Distributed tracing
+
+    -   Logs
+
+    -   Metrics
+
+## What is not covered in this course
+
+-   Guide to setting up a monitoring infrastructure
+
+-   Deep dive into different monitoring technologies and benchmarking or comparison of any tools
+
+
+## Course content
+
+-   [Introduction](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/#introduction)
+
+    -   [Four golden signals of monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/#four-golden-signals-of-monitoring)
+
+    -   [Why is monitoring important?](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/introduction/#why-is-monitoring-important)
+
+-   [Command-line tools](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/command-line_tools/)
+
+-   [Third-party monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/third-party_monitoring/)
+
+-   [Proactive monitoring using alerts](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/alerts/)
+
+-   [Best practices for monitoring](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/best_practices/)
+
+-   [Observability](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/observability/)
+
+    -   [Logs](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/observability/#logs)
+    -   [Tracing](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/bservability/#tracing)
+
+[Conclusion](https://linkedin.github.io/school-of-sre/metrics_and_monitoring/conclusion/)
+
+
+##
+
+# Introduction
+
+Monitoring is a process of collecting real-time performance metrics from
+a system, analyzing the data to derive meaningful information, and
+displaying the data to the users. In simple terms, you measure various
+metrics regularly to understand the state of the system, including but
+not limited to, user requests, latency, and error rate. *What gets
+measured, gets fixed*---if you can measure something, you can reason
+about it, understand it, discuss it, and act upon it with confidence.
+
+
+## Four golden signals of monitoring
+
+When setting up monitoring for a system, you need to decide what to
+measure. The four golden signals of monitoring provide a good
+understanding of service performance and lay a foundation for monitoring
+a system. These four golden signals are
+
+-   Traffic
+
+-   Latency
+
+-   Error
+
+-   Saturation
+
+These metrics help you to understand the system performance and
+bottlenecks, and to create a better end-user experience. As discussed in
+the [Google SRE
+book](https://sre.google/sre-book/monitoring-distributed-systems/),
+if you can measure only four metrics of your service, focus on these
+four. Let's look at each of the four golden signals.
+
+-   **Traffic** -- *Traffic* gives a better understanding of the service
+     demand. Often referred to as *service QPS* (queries per second),
+     traffic is a measure of requests served by the service. This
+     signal helps you to decide when a service needs to be scaled up to
+     handle increasing customer demand and scaled down to be
+     cost-effective.
+
+-   **Latency** -- *Latency* is the measure of time taken by the service
+     to process the incoming request and send the response. Measuring
+     service latency helps in the early detection of slow degradation
+     of the service. Distinguishing between the latency of successful
+     requests and the latency of failed requests is important. For
+     example, an [HTTP 5XX
+     error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)
+     triggered due to loss of connection to a database or other
+     critical backend might be served very quickly. However, because an
+     HTTP 500 error indicates a failed request, factoring 500s into
+     overall latency might result in misleading calculations.
+
+-   **Error (rate)** -- *Error* is the measure of failed client
+     requests. These failures can be easily identified based on the
+     response codes ([HTTP 5XX
+     error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)).
+     There might be cases where the response is considered erroneous
+     due to wrong result data or due to policy violations. For example,
+     you might get an [HTTP
+     200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200)
+     response, but the body has incomplete data, or response time is
+     breaching the agreed-upon
+     [SLA](https://en.wikipedia.org/wiki/Service-level_agreement)s.
+     Therefore, you need to have other mechanisms (code logic or
+     [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)))
+     in place to capture errors in addition to the response codes.
+
+-   **Saturation** -- *Saturation* is a measure of the resource
+     utilization by a service. This signal tells you the state of
+     service resources and how full they are. These resources include
+     memory, compute, network I/O, and so on. Service performance
+     slowly degrades even before resource utilization is at 100
+     percent. Therefore, having a utilization target is important. An
+     increase in latency is a good indicator of saturation; measuring
+     the [99th
+     percentile](https://medium.com/@ankur_anand/an-in-depth-introduction-to-99-percentile-for-programmers-22e83a00caf)
+     of latency can help in the early detection of saturation.
+
+Depending on the type of service, you can measure these signals in
+different ways. For example, you might measure queries per second served
+for a web server. In contrast, for a database server, transactions
+performed and database sessions created give you an idea about the
+traffic handled by the database server. With the help of additional code
+logic (monitoring libraries and instrumentation), you can measure these
+signals periodically and store them for future analysis. Although these
+metrics give you an idea about the performance at the service end, you
+need to also ensure that the same user experience is delivered at the
+client end. Therefore, you might need to monitor the service from
+outside the service infrastructure, which is discussed under third-party
+monitoring.
+
+## Why is monitoring important?
+
+Monitoring plays a key role in the success of a service. As discussed
+earlier, monitoring provides performance insights for understanding
+service health. With access to historical data collected over time, you
+can build intelligent applications to address specific needs. Some of
+the key use cases follow:
+
+-   **Reduction in time to resolve issues** -- With a good monitoring
+     infrastructure in place, you can identify issues quickly and
+     resolve them, which reduces the impact caused by the issues.
+
+-   **Business decisions** -- Data collected over a period of time can
+     help you make business decisions such as determining the product
+     release cycle, which features to invest in, and geographical areas
+     to focus on. Decisions based on long-term data can improve the
+     overall product experience.
+
+-   **Resource planning** -- By analyzing historical data, you can
+     forecast service compute-resource demands, and you can properly
+     allocate resources. This allows financially effective decisions,
+     with no compromise in end-user experience.
+
+Before we dive deeper into monitoring, let's understand some basic
+terminologies.
+
+-   **Metric** -- A metric is a quantitative measure of a particular
+     system attribute---for example, memory or CPU
+
+-   **Node or host** -- A physical server, virtual machine, or container
+     where an application is running
+
+-   **QPS** -- *Queries Per Second*, a measure of traffic served by the
+     service per second
+
+-   **Latency** -- The time interval between user action and the
+     response from the server---for example, time spent after sending a
+     query to a database before the first response bit is received
+
+-   **Error** **rate** -- Number of errors observed over a particular
+     time period (usually a second)
+
+-   **Graph** -- In monitoring, a graph is a representation of one or
+     more values of metrics collected over time
+
+-   **Dashboard** -- A dashboard is a collection of graphs that provide
+     an overview of system health
+
+-   **Incident** -- An incident is an event that disrupts the normal
+     operations of a system
+
+-   **MTTD** -- *Mean Time To Detect* is the time interval between the
+     beginning of a service failure and the detection of such failure
+
+-   **MTTR** -- Mean Time To Resolve is the time spent to fix a service
+     failure and bring the service back to its normal state
+
+Before we discuss monitoring an application, let us look at the
+monitoring infrastructure. Following is an illustration of a basic
+monitoring system.
+
+![Illustration of a monitoring infrastructure](images/image1.jpg) 
+<p align="center"> Figure 1: Illustration of a monitoring infrastructure </p>
+
+Figure 1 shows a monitoring infrastructure mechanism for aggregating
+metrics on the system, and collecting and storing the data for display.
+In addition, a monitoring infrastructure includes alert subsystems for
+notifying concerned parties during any abnormal behavior. Let's look at
+each of these infrastructure components:
+
+-   **Host metrics agent --** A *host metrics agent* is a process
+     running on the host that collects performance statistics for host
+     subsystems such as memory, CPU, and network. These metrics are
+     regularly relayed to a metrics collector for storage and
+     visualization. Some examples are
+     [collectd](https://collectd.org/),
+     [telegraf](https://www.influxdata.com/time-series-platform/telegraf/),
+     and [metricbeat](https://www.elastic.co/beats/metricbeat).
+
+-   **Metric aggregator --** A *metric aggregator* is a process running
+     on the host. Applications running on the host collect service
+     metrics using
+     [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)).
+     Collected metrics are sent either to the aggregator process or
+     directly to the metrics collector over API, if available. Received
+     metrics are aggregated periodically and relayed to the metrics
+     collector in batches. An example is
+     [StatsD](https://github.com/statsd/statsd).
+
+-   **Metrics collector --** A *metrics collector* process collects all
+     the metrics from the metric aggregators running on multiple hosts.
+     The collector takes care of decoding and stores this data on the
+     database. Metric collection and storage might be taken care of by
+     one single service such as
+     [InfluxDB](https://www.influxdata.com/), which we discuss
+     next. An example is [carbon
+     daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html).
+
+-   **Storage --** A time-series database stores all of these metrics.
+     Examples are [OpenTSDB](http://opentsdb.net/),
+     [Whisper](https://graphite.readthedocs.io/en/stable/whisper.html),
+     and [InfluxDB](https://www.influxdata.com/).
+
+-   **Metrics server --** A *metrics server* can be as basic as a web
+     server that graphically renders metric data. In addition, the
+     metrics server provides aggregation functionalities and APIs for
+     fetching metric data programmatically. Some examples are
+     [Grafana](https://github.com/grafana/grafana) and
+     [Graphite-Web](https://github.com/graphite-project/graphite-web).
+
+-   **Alert manager --** The *alert manager* regularly polls metric data
+     available and, if there are any anomalies detected, notifies you.
+     Each alert has a set of rules for identifying such anomalies.
+     Today many metrics servers such as
+     [Grafana](https://github.com/grafana/grafana) support alert
+     management. We discuss alerting [in detail
+     later](#proactive-monitoring-using-alerts). Examples are
+     [Grafana](https://github.com/grafana/grafana) and
+     [Icinga](https://icinga.com/).
--- a/courses/metrics_and_monitoring/observability.md
+++ b/courses/metrics_and_monitoring/observability.md
@ -0,0 +1,151 @@
+##
+
+# Observability
+
+Engineers often use observability when referring to building reliable
+systems. *Observability* is a term derived from control theory, It is a
+measure of how well internal states of a system can be inferred from
+knowledge of its external outputs. Service infrastructures used on a
+daily basis are becoming more and more complex; proactive monitoring
+alone is not sufficient to quickly resolve issues causing application
+failures. With monitoring, you can keep known past failures from
+recurring, but with a complex service architecture, many unknown factors
+can cause potential problems. To address such cases, you can make the
+service observable. An observable system provides highly granular
+insights into the implicit failure modes. In addition, an observable
+system furnishes ample context about its inner workings, which unlocks
+the ability to uncover deeper systemic issues.
+
+Monitoring enables failure detection; observability helps in gaining a
+better understanding of the system. Among engineers, there is a common
+misconception that monitoring and observability are two different
+things. Actually, observability is the superset to monitoring; that is,
+monitoring improves service observability. The goal of observability is
+not only to detect problems, but also to understand where the issue is
+and what is causing it. In addition to metrics, observability has two
+more pillars: logs and traces, as shown in Figure 9. Although these
+three components do not make a system 100 percent observable, these are
+the most important and powerful components that give a better
+understanding of the system. Each of these pillars has its flaws, which
+are described in [Three Pillars with Zero
+Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8).
+
+![Three pillars of observability](images/image7.png) <p align="center"> Figure 9: 
+Three pillars of observability </p>
+
+Because we have covered metrics already, let's look at the other two
+pillars (logs and traces).
+
+#### Logs 
+
+Logs (often referred to as *events*) are a record of activities
+performed by a service during its run time, with a corresponding
+timestamp. Metrics give abstract information about degradations in a
+system, and logs give a detailed view of what is causing these
+degradations. Logs created by the applications and infrastructure
+components help in effectively understanding the behavior of the system
+by providing details on application errors, exceptions, and event
+timelines. Logs help you to go back in time to understand the events
+that led to a failure. Therefore, examining logs is essential to
+troubleshooting system failures.
+
+Log processing involves the aggregation of different logs from
+individual applications and their subsequent shipment to central
+storage. Moving logs to central storage helps to preserve the logs, in
+case the application instances are inaccessible, or the application
+crashes due to a failure. After the logs are available in a central
+place, you can analyze the logs to derive sensible information from
+them. For audit and compliance purposes, you archive these logs on the
+central storage for a certain period of time. Log analyzers fetch useful
+information from log lines, such as request user information, request
+URL (feature), and response headers (such as content length) and
+response time. This information is grouped based on these attributes and
+made available to you through a visualization tool for quick
+understanding.
+
+You might be wondering how this log information helps. This information
+gives a holistic view of activities performed on all the involved
+entities. For example, let's say someone is performing a DoS (denial of
+service) attack on a web application. With the help of log processing,
+you can quickly look at top client IPs derived from access logs and
+identify where the attack is coming from.
+
+Similarly, if a feature in an application is causing a high error rate
+when accessed with a particular request parameter value, the results of
+log analysis can help you to quickly identify the misbehaving parameter
+value and take further action.
+
+![Log processing and analysis using ELK stack](images/image4.jpg) 
+<p align="center"> Figure 10: Log processing and analysis using ELK stack </p>
+
+Figure 10 shows a log processing platform using ELK (Elasticsearch,
+Logstash, Kibana), which provides centralized log processing. Beats is a
+collection of lightweight data shippers that can ship logs, audit data,
+network data, and so on over the network. In this use case specifically,
+we are using filebeat as a log shipper. Filebeat watches service log
+files and ships the log data to Logstash. Logstash parses these logs and
+transforms the data, preparing it to store on Elasticsearch. Transformed
+log data is stored on Elasticsearch and indexed for fast retrieval.
+Kibana searches and displays log data stored on Elasticsearch. Kibana
+also provides a set of visualizations for graphically displaying
+summaries derived from log data.
+
+Storing logs is expensive. And extensive logging of every event on the
+server is costly and takes up more storage space. With an increasing
+number of services, this cost can increase proportionally to the number
+of services.
+
+#### Tracing
+
+So far, we covered the importance of metrics and logging. Metrics give
+an abstract overview of the system, and logging gives a record of events
+that occurred. Imagine a complex distributed system with multiple
+microservices, where a user request is processed by multiple
+microservices in the system. Metrics and logging give you some
+information about how these requests are being handled by the system,
+but they fail to provide detailed information across all the
+microservices and how they affect a particular client request. If a slow
+downstream microservice is leading to increased response times, you need
+to have detailed visibility across all involved microservices to
+identify such microservice. The answer to this need is a request tracing
+mechanism.
+
+A trace is a series of spans, where each span is a record of events
+performed by different microservices to serve the client's request. In
+simple terms, a trace is a log of client-request serving derived from
+various microservices across different physical machines. Each span
+includes span metadata such as trace ID and span ID, and context, which
+includes information about transactions performed.
+
+![Trace and spans for a URL shortener request](images/image3.jpg) 
+<p align="center"> Figure 11: Trace and spans for a URL shortener request </p>
+
+Figure 11 is a graphical representation of a trace captured on the [URL
+shortener](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/)
+example we covered earlier while learning Python.
+
+Similar to monitoring, the tracing infrastructure comprises a few
+modules for collecting traces, storing them, and accessing them. Each
+microservice runs a tracing library that collects traces in the
+background, creates in-memory batches, and submits the tracing backend.
+The tracing backend normalizes received trace data and stores it on
+persistent storage. Tracing data comes from multiple different
+microservices; therefore, trace storage is often organized to store data
+incrementally and is indexed by trace identifier. This organization
+helps in the reconstruction of trace data and in visualization. Figure
+12 illustrates the anatomy of the distributed system.
+
+![Anatomy of distributed tracing](images/image5.jpg)
+<p align="center"> Figure 12: Anatomy of distributed tracing </p>
+
+Today a set of tools and frameworks are available for building
+distributed tracing solutions. Following are some of the popular tools:
+
+-   [OpenTelemetry](https://opentelemetry.io/): Observability
+     framework for cloud-native software
+
+-   [Jaeger](https://www.jaegertracing.io/): Open-source
+     distributed tracing solution
+
+-   [Zipkin](https://zipkin.io/): Open-source distributed tracing
+     solution
--- a/courses/metrics_and_monitoring/third-party_monitoring.md
+++ b/courses/metrics_and_monitoring/third-party_monitoring.md
@ -0,0 +1,37 @@
+##
+
+# Third-party monitoring
+
+Today most cloud providers offer a variety of monitoring solutions. In
+addition, a number of companies such as
+[Datadog](https://www.datadoghq.com/) offer
+monitoring-as-a-service. In this section, we are not covering
+monitoring-as-a-service in depth.
+
+In recent years, more and more people have access to the internet. Many
+services are offered online to cater to the increasing user base. As a
+result, web pages are becoming larger, with increased client-side
+scripts. Users want these services to be fast and error-free. From the
+service point of view, when the response body is composed, an HTTP 200
+OK response is sent, and everything looks okay. But there might be
+errors during transmission or on the client side. As previously
+mentioned, monitoring services from within the service infrastructure
+give good visibility into service health, but this is not enough. You
+need to monitor user experience, specifically the availability of
+services for clients. A number of third-party services such asf
+[Catchpoint](https://www.catchpoint.com/),
+[Pingdom](https://www.pingdom.com/), and so on are available for
+achieving this goal.
+
+Third-party monitoring services can generate synthetic traffic
+simulating user requests from various parts of the world, to ensure the
+service is globally accessible. Other third-party monitoring solutions
+for real user monitoring (RUM) provide performance statistics such as
+service uptime and response time, from different geographical locations.
+This allows you to monitor the user experience from these locations,
+which might have different internet backbones, different operating
+systems, and different browsers and browser versions. [Catchpoint
+Global Monitoring
+Network](https://pages.catchpoint.com/overview-video) is a
+comprehensive 3-minute video that explains the importance of monitoring
+the client experience.
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -57,6 +57,14 @@ nav:
    - Availability: systems_design/availability.md
    - Fault Tolerance: systems_design/fault-tolerance.md
    - Conclusion: systems_design/conclusion.md
+- Metrics and Monitoring:
+    - Introduction: metrics_and_monitoring/introduction.md
+    - Command-line Tools: metrics_and_monitoring/command-line_tools.md
+    - Third-party Monitoring: metrics_and_monitoring/third-party_monitoring.md
+    - Proactive Monitoring with Alerts: metrics_and_monitoring/alerts.md
+    - Best Practices for Monitoring: metrics_and_monitoring/best_practices.md
+    - Observability: metrics_and_monitoring/observability.md
+    - Conclusion: metrics_and_monitoring/conclusion.md
 - Security:
    - Introduction: security/intro.md
    - Fundamentals of Security: security/fundamentals.md