Tuesday, August 8, 2023

HashiCorp Vault observability: Monitoring Vault at scale

Observability is the ability to measure the internal states of a system by examining its outputs. In the context of HashiCorp Vault, the key outputs to examine are log files, telemetry metrics, and data scraped from API endpoints.

A mature Vault monitoring and observability strategy simplifies finding answers to important Vault questions. For example:

  • Are there any security threats to secret data or the service itself?
  • Are all compliance requirements satisfied?
  • Who has access to what? Is secret access meeting ‘least privilege’ requirements?
  • Is the Vault cluster healthy and adequately handling current workloads?
  • Can the Vault cluster handle anticipated workloads and growth?
  • What Vault features are being used the most?
  • Are there opportunities to optimize client usage patterns to reduce the operating cost of Vault?
  • Are the service-level agreement (SLA) and operational-level agreement (OLA) for consumers of the Vault service being met?

This post will walk through how to architect a well-rounded Vault monitoring strategy with log analysis, telemetry analysis, and API/synthetic monitoring

Vault monitoring strategy

A comprehensive, production-grade HashiCorp Vault monitoring strategy should include three major components:

  1. Log analysis: Detecting runtime errors, granular usage monitoring, and audit request activity
  2. Telemetry analysis: Monitoring the health of the various Vault internals, and aggregated usage data
  3. API and synthetic monitoring: Monitoring the actual response times users are experiencing, SLA/OLA reporting, and ensuring Vault service/API endpoints are available

While there is some overlap in capabilities between the components, they are each focused on a different aspect of observability. When combined, they enable quick identification, analysis, and resolution of issues.

For example, a Vault operator might receive an alert from a synthetic monitor that detected a breached SLA in authenticating to Vault and reading a secret. Telemetry data might show that Vault is experiencing an abnormally high login request volume, resulting in high disk I/O. A subsequent audit log analysis would identify the source of the login traffic.

Perhaps the offender is a runaway application or team not conforming to Vault usage best practices. The operator could then have a conversation with that team or quickly push a fix to that app. They could also use telemetry and log data to set a reasonable rate limit for the app in question, preventing this situation from occurring again.

Infrastructure monitoring

Infrastructure monitoring is a critical component of a comprehensive monitoring/observability strategy. Infrastructure and host-level events that should be adequately logged and observed include:

  • Hardware failures
  • Network failures
  • Kernel errors/warnings
  • File handle exhaustion
  • Storage consumption
  • Remote access events
  • Events relevant to the health and security of the Vault hosts

If the infrastructure hosting Vault is not healthy or stable, it will most likely impact its reliability and performance. Logs provided by the operating system in use, as well as metrics/telemetry agents, can be used as a source for these events.

There are many publicly available guides to system/host-level monitoring, and organizations should follow industry standards and best practices. If an organization already uses a specific log and metrics analysis solution, that vendor likely provides useful guidance. Here are some helpful guides:

Vault performance and health

To ensure adherence to SLAs defined by the organization, it is important to know how the Vault software is performing on top of the platform or infrastructure on which it is hosted. This includes an understanding of how much of the allocated system resources Vault is using on average and during busy periods such as a large deployment or other events that drive high request rates to Vault. It is also important to know the rate and timing of requests being made to Vault. The speed of Vault request handling and the times of high request volume are key factors to monitor for anomalous activity.

Vault service consumption

In order to know how Vault is being used, it is important to understand that Vault handles requests from both applications and users. Analyzing the requests that Vault is handling can help answer some important questions, such as:

  • What types of operations are being performed?
  • What features of Vault are being leveraged?
  • Which teams, entities, or applications are using Vault the most?

An understanding of Vault service consumption can help Vault operators be sure that teams and applications are using Vault properly. It can help them discover patterns of use that are not efficient or that go against best practices. For example, teams may not be practicing proper token hygiene or limiting time-to-live (TTLs) reasonably. A team might be unnecessarily relying on Vault during the run time of their application instead of just fetching secrets once at deploy time, causing a high reliance on Vault and a high volume of requests.

If a company uses chargebacks to recoup the cost of running Vault within the organization, Vault service consumption data can help determine each team’s bill. Service consumption data can help Vault operators identify and partner with the top teams by usage or with teams that use a particular feature to test Vault changes or upgrades in a development environment before going to production.

Log analysis

There are two types of Vault logs: the Vault operational log and the Vault audit log. Both logs contain useful and important information for teams operating a Vault service. It is important to understand what information exists in each of these logs to understand why and how they should be monitored.

The Vault logs should be sent to a log-analysis tool that allows analysis, search, and building of reports and dashboards using the log data. Examples of suitable log aggregation and analysis tools include:

Vault operational log

Like many modern apps, Vault writes details about its internal operation and subsystem to standard output and standard error. On systemd-based Linux distributions, the journald daemon automatically captures Vault’s output to the system journal. Depending on the Linux distribution and specific journald configuration, the journald logs are typically found in log files matching one of these patterns:

  •  /var/log/journal*
  •  /var/log/syslog*
  •  /var/log/messages*

It’s also possible to configure systemd to send the logs from a specific unit to a separate file, such as /var/log/vault*.

The events logged in the Vault operational log match the format of many other common system logs and are time-stamped and categorized by severity.

Example event types

Important event types logged in the Vault operational log include:

  • Vault Enterprise license expiry
  • Vault seal/unseal events
  • Replication-related events
  • Vault cluster events (raft, quorum, active/passive nodes, leader election)
  • Audit log failures
  • Network connection issues
  • Secrets engine errors
  • Storage backend events
  • TLS certificate errors

It is important to note that some of these events are also exposed in Vault telemetry. When the events are available in both logs and telemetry, it is up to the team implementing the monitoring to determine which source to use for monitoring/alerting purposes. The logs will likely have more context as they are often more verbose and can be compared with other log events occurring just before or just after the triggered event. For more information, see Vault operational log details in our documentation.

Vault audit log

This log keeps a detailed record of all requests to Vault, and the associated responses, in JSON format. Sensitive fields in the request and response events are hashed with a salt value using HMAC-SHA256 before being written to the log.

The Vault audit log is not enabled by default; it will need to be specifically configured and enabled using audit logging settings within Vault. Supported audit devices include file, syslog, and socket.

Because audit device failures can block Vault from processing further requests, we recommend configuring at least two audit devices. The audit devices can be of the same type. For example, it is possible to configure two file audit devices with each on a separate and independent disk volume.

Example event types

Events that can be found in the Vault audit log include:

  • Authentication successes/failures for both humans and machines
  • The details of any request made to Vault
    • Read/write/update/delete/list
    • Changes to Vault configuration or Vault policies
    • Secret engines enabled/disabled
    • Dynamic credentials creation
    • Interactions with root protected endpoints
  • For each request, the response that Vault sends to the client is also captured
    • Successes versus failures
    • Access denied or allowed

Typically, the events logged in the Vault audit log are authenticated requests or attempts to authenticate. Unauthenticated actions can be found in the Vault operational log.

For more information, see Audit device notes in our documentation.

Telemetry

Vault telemetry provides both real-time and interval-based metrics about the status and usage of each Vault deployment. It is useful for determining current cluster health and identifying issues before they become critical.

Some metrics should be observed for anomalous values whereas others have specific recommended values on which to trigger alerts. Profile telemetry data over time to observe any abnormalities in resource usage, consumption patterns, and overall load. Establish baselines and trends, with any significant deviation indicating a potential problem.

Please note that new metrics are added periodically in new Vault releases, so some may be unavailable for teams using older versions of Vault. You can view available metrics by selecting your Vault version from the dropdown on the Vault telemetry internals page.

Telemetry configuration

Telemetry is enabled and configured using the telemetry stanza in Vault’s configuration file. Like most changes to the config file, enabling telemetry requires a restart of the Vault service on each node. Vault provides built-in support for multiple telemetry providers, including:

  • Circonus
  • DogStatsD
  • Prometheus
  • Stackdriver
  • StatsD
  • Statsite

It is up to each organization to select an appropriate telemetry provider compatible with its chosen monitoring tool. Depending on the selection, one can either stream telemetry to an available monitoring endpoint or scrape this data from the Prometheus-compatible /v1/sys/metrics API endpoint.

Vault’s server process aggregates runtime metrics about performance every 10 seconds. It also includes high-cardinality usage data such as token, entity, and secret counts. These high cardinality items are aggregated every 10 minutes, by default, but this frequency is tunable by adjusting the usage_gauge_period property in the telemetry stanza. Bear in mind that high-cardinality metrics put a larger load on Vault than real-time metrics. For this reason, it is best to avoid collecting them more frequently than the default without performance testing and a good reason.

We also recommend avoiding providers that don’t support labels (such as vanilla StatsD), as this results in a flattened metric key that requires additional processing to be useful. For example, the vault.token.count.by_policy metric would display as separate metrics (shown below) instead of a single metric with multiple labels that can be can split or filtered on.

vault.token.count.by_policy.mycluster.ns1.policy1
vault.token.count.by_policy.mycluster.ns1.policy2
vault.token.count.by_policy.mycluster.ns2.policy3
vault.token.count.by_policy.mycluster.ns2.policy4
…

A detailed write-up on one Vault monitoring pattern option is available in our documentation at Monitor telemetry & audit device log data.

Critical metrics

As a starting point, the most critical metrics that could indicate an immediate threat to Vault stability are listed below. Create alerts for these metrics.

Operational

  •  vault.core.unsealed
  •  vault.core.leadership_lost
  •  vault.core.leadership_setup_failed
  •  vault.core.license.expiration_time_epoch
  •  vault.autopilot.node.healthy
  •  vault.raft.leader.lastContact
  •  vault.raft.commitTime
  •  vault.audit.log_request_failure
  •  vault.audit.log_response_failure
  •  vault.autosnapshots.save.errors
  •  vault.runtime.total_gc_pause_ns
  •  vault.wal.flushReady
  •  vault.wal.persistWALs

System

  • File descriptors
  • Memory usage
  • CPU usage and CPU IO wait
  • Disk IO latency and remaining disk capacity

Usage

To better understand the request load on Vault, start with the metrics below. You might alert on anomalous changes and sudden spikes in request load.

  •  vault.token.creation
  •  vault.expire.num_leases
  •  vault.core.in_flight_requests
  •  vault.core.handle_request.count
  •  vault.core.handle_login_request.count

Note: Unauthenticated requests against endpoints that are not handled at Vault’s outer HTTP layer, like sys/replication/status, are also captured in the vault_core_handle_login_request metric. This means the metric may display authentication requests in an otherwise idle cluster that is not receiving any client authentication requests.

For specific telemetry monitoring recommendations, please see our Telemetry metrics reference. Specifics on values that should trigger an alert are called out in the “what to look for” section of key metrics on that page.

A note on metric names

Metric names and how they are formatted can vary depending on monitoring tool, telemetry provider, and whether the metrics are coming from HashiCorp-managed HCP Vault or from a self-managed Vault deployment.

HCP Vault emits a subset of the metrics available in the self-hosted Vault Enterprise release. This is meant to simplify monitoring by exposing only metrics that are actionable by operators while abstracting away those that are ultimately HashiCorp’s responsibility as a service provider. These metric names may appear slightly different from those emitted by a self-managed Vault and will be prefixed with hcp. For more information, please reference the HCP Vault metrics guide.

Missing metrics

While each node in a cluster emits many metrics, there are exceptions.

Certain metrics are emitted only when there is a matching event, so it is normal to be “missing” data in some areas. Examples include the vault.core.leadership_setup_failed and vault.core.leadership_lost metrics.

Furthermore, some metrics emit only from the current cluster leader node because only the leader actively handles write operations and various other tasks. In a typical Vault cluster, non-leader nodes are in a standby state where they service read requests and forward all write requests to the leader.

Examples of metrics emitted only by the leader include:

  1. Replication metrics like vault.wal.flushready, vault.wal.persistWALs, and vault.replication.wal.last_wal
  2. Lease metrics like vault.expire.*
  3. Leadership metrics like vault.core.leadership_lost

In Vault versions prior to 1.13.0, 1.12.3, and 1.11.12, a further metrics subset is not emitted by non-performance standby nodes and is only emitted from the leader. This applies to all disaster recovery (DR) secondary clusters on earlier versions. One such example is the vault.core.unsealed metric, which is reported only by the leader in a DR secondary cluster. This is important to note when viewing dashboards and configuring alerts.

Synthetic monitoring

Synthetic monitoring involves simulating user interactions instead of relying on real user traffic to a service. This type of monitoring is valuable in measuring the performance of, and detecting issues with, the Vault service. The data collected provides a snapshot of what users are actually experiencing when interacting with Vault, which is particularly useful for SLA/OLA reporting.

Some monitoring solutions (e.g. Datadog, Dynatrace, and Splunk) provide out-of-the-box support for synthetic monitoring. If your chosen tool does not, you can build a simple script, run it on a recurring schedule, and stream the results to a service endpoint such as the Splunk HTTP Event Collector.

For each run, measure and track:

  1. Successes
  2. Failures
  3. Total execution time
  4. Execution time of each individual step (Auth/read/write/delete/token revocation)

Using these measures, you can build an accurate understanding of how the Vault service is performing, from the perspective of the people and machines consuming it.

Synthetic monitoring recommendations

  1. Run monitors at a defined interval. One minute is a reasonable starting point.
    1. Consider your Vault SLA/OLA/SLO, usage patterns, and infrastructure hardware when selecting your interval.
  2. Run monitors from the location(s) where clients are running so that results reflect what client applications and users are experiencing.
  3. Follow best practices around lease and token creation. Use short TTLs and explicit revocation wherever possible.
  4. Store critical data, like success/failure and execution time, long-term.
  5. Aggregate this data in your monitoring solution for evaluation and analysis.
  6. Visualize the data on relevant dashboards.
  7. Create new monitors as usage patterns expand, new secret engines are added, and new auth methods are enabled.
  8. Review and update existing monitors to ensure they’re still relevant and useful.
  9. Consider sending an alert upon an SLA/OLA/SLO breach.

Synthetic monitoring examples

To effectively monitor HashiCorp Vault, the organization’s platform team should design comprehensive synthetic monitoring scenarios. These scenarios should mimic real-world user interactions and include critical functionality and features of Vault. Here are some starter examples:

Simple K/V

  1. Authenticate with Vault using an enabled auth method and retrieve a service token.
  2. Use the service token to write a secret.
  3. Use the service token to retrieve a secret.
  4. Compare the secret to expected values.
  5. Use the service token to delete the secret.
  6. Revoke the service token.

Replication

  1. Authenticate with the Vault primary and retrieve a service token.
  2. Use the service token to write a secret to the Vault primary.
  3. Use the service token to retrieve the secret from a Vault performance secondary.
  4. Compare the secret to expected values.

Dashboarding

When designing a dashboard, focus on how to visualize data in an easily consumable format. Consider the target user group of each dashboard. The best way to construct and break out dashboards depends on an enterprise’s choice of tools, service architecture, and team skill set. What is natural and obvious to one team may be unclear to another.

A dashboard should satisfy a particular need or answer a particular question, such as:

  • What is the current high-level health state of each Vault cluster?
  • Are SLA/OLA/SLO measures being met?
  • How is the Vault service being consumed? Which teams or apps are the top consumers?
  • How can I analyze and troubleshoot an issue occurring in one of the deployments?
  • Where are high TTL tokens created?
  • Are any quotas or rate limits being violated?

Include dashboard-wide filters for different dimensions such as:

  • Cluster
  • Storage backend
  • Environment
  • Host

Consider also including the following dashboard-wide filters on a consumption-focused dashboard:

  • Namespace
  • Auth method
  • Mount point
  • Creation TTL
  • Token type
  • Secret engine

After completing prototype dashboards, build automation around the deployment and ongoing maintenance of them. It is important to drive consistency in naming and tile configuration across environments (i.e. dev should look the same as prod). An operator having to hunt for important data complicates analysis and wastes valuable time during an outage.

Reporting

Similar to dashboarding, generating reports based on Vault data provides operators and management insights into compliance, security, access patterns, performance, and Vault adoption. Teams typically generate reports using a combination of the previously mentioned observability mechanisms and, in some cases, custom scripts that extract data directly from Vault.

Generating reports

Most enterprises find it valuable to correlate key indicators with organizational constructs (e.g. team, business unit, application, service, etc.). This is considerably easier when working with a well-defined path structure, naming convention, and tagging standard.

Path structure should tie back to how teams are organized within the enterprise. This makes it easier to generate a report that is split by each team or business unit. Reference the Vault namespace and mount structuring guide for more information.

Make naming consistent wherever possible. The most critical items are paths, policies, and namespaces. This not only eases report generation, but also allows humans to more quickly understand the structure and data within Vault.

Many Vault constructs — like namespaces, entities, entity aliases, and KV secrets — support custom metadata tagging. We recommend seeding key organizational information associated with each of these constructs via custom tags. This provides another way to map data back to the many dimensions of your business.

Reporting recommendations

The reporting needs of each organization vary, but the examples below are common across many of our customers.

Security

  • Anomalous request activity
  • Active root tokens
  • Long-lived leases or tokens by team

Usage

  • Number of transactions by team and overall
  • Number of leases by team
  • Long-lived leases or tokens by team
  • Number and size of KV secrets by team

Performance

  • SLA/OLA reporting
  • Response time (read, write, auth) (month-to-month trending)

Executive summary

  • Vault adoption
  • Automation index: Number of human logins versus machine logins
  • Dynamic secrets adoption: Dynamic vs. KV workloads

Conclusion

Comprehensive monitoring and observability of Vault is one of the most important components for operating Vault successfully as a shared service within an organization. If issues arise, proper monitoring can help a platform team confidently identify the source and the impact, enabling quicker issue resolution.

With the right strategy, organizations proactively discover risks and address them before they impact Vault consumers. A complete monitoring strategy offers a clear overview of how Vault is being implemented and utilized throughout the organization, enabling platform teams and leadership to make informed decisions based on data-driven insights.

References

Log analysis

Telemetry



from HashiCorp Blog https://bit.ly/45gez7F
via IFTTT

No comments:

Post a Comment