Diagnose server issues
When operating Vault, you can encounter issues during server startup due to a range of root causes, from incorrect server configuration to operating environment constraints.
Challenge
To effectively troubleshoot and resolve problems with Vault, you must examine and combine information from 3 distinct sources to arrive at root causes:
- Operating system environment conditions, such as user limits.
- Vault server configuration file.
- Vault server log output as described in the Vault Server Logs section of the Troubleshooting Vault tutorial.
The Vault server configuration is essential to troubleshooting startup issues, while the log can reveal helpful warnings or errors from Vault that can have root causes related to the operating environment.
Gathering information from the system environment and server logs to determine a root cause can be an arduous process, especially when in an outage situation.
It is a task that is ideally suited for automation to ensure that the results are consistent, repeatable, and arrive quickly when needed.
A tool that can help the Vault operator gather and interpret this information will reduce the troubleshooting burden, lower time to root cause analysis, and considerably reduce downtime during an outage.
Solution
Vault version 1.8.0 introduces a new diagnose
sub-command for the operator
CLI command that assists operators with readily identifying causes to the most commonly encountered server configuration and startup issues.
The command can be used with the actual configuration for the server you wish to diagnose. The typical workflow is to invoke diagnose against server configuration and data while the server is down. There is also an option that allows for performing diagnosis against a running server that you will learn about later.
More information about diagnose is available from the operator diagnose documentation, or by invoking vault operator diagnose -help
from a terminal session.
Here is an actual output example to familiarize you with the types of checks performed and reported on by diagnose.
In this example case diagnose was executed against a Vault Community Edition server.
The diagnose resulted in failure about storage along with some warnings about disk usage, licensing, and TLS.
The command aims to explain results in clear language, so the results are often self-explanatory. It also provides guidance to help with resolving warnings and failures, such as the recommendation to have at least 1GB of space free per partition, for example.
What is checked during diagnose?
At a high level, the diagnose command currently checks and reports on these common root causes of server startup issues.
- Environment
- User limits: maximum open files
- Storage capacities
- Configuration
- Access configured storage backend
- Access HA storage backend
- Create seal
- Setup core
- Redirect address
- Cluster address
- Listeners
- TLS configuration
- Seal
You will learn more about the types of failures, warnings, and recommendations from diagnose in the hands on scenario.
Prerequisites
To perform the steps in the scenario, you need:
- Vault 1.8 or later; the Community Edition can be used for this tutorial.
- The Install Vault tutorial can guide you through installation.
- jq to handle JSON output from Vault CLI.
Scenario introduction
You will attempt to operate a local Vault server from the command line within a terminal session using the provided example configuration file.
First, you will use diagnose to check the example configuration.
Then, using the information from diagnose, you will resolve a reported failure in the environment.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
$ mkdir -pm 0000 /tmp/learn-vault-diagnose/data && \ export LEARN_VAULT=/tmp/learn-vault-diagnose
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
$ cat > "${LEARN_VAULT}"/vault-server.hcl << EOFapi_addr = "http://127.0.0.1:8200"cluster_addr = "http://127.0.0.1:8201"cluster_name = "learn-diagnose-cluster"default_lease_ttl = "10h"disable_mlock = truemax_lease_ttl = "10h"pid_file = "$LEARN_VAULT/pidfile"ui = truelistener "tcp" { address = "127.0.0.1:8200" tls_disable = "true" tls_cert_file = "vault.crt" tls_key_file = "vault.key"}backend "file" { path = "$LEARN_VAULT/data" node_id = "learn-diagnose-server"}EOF
Execute diagnose
Execute diagnose to check the initial example configuration.
$ vault operator diagnose -config $LEARN_VAULT/vault-server.hcl
Your output should resemble this example.
Vault v1.8.0 (82a99f14eb6133f99a975e653d4dac21c17505c7)Results:[ failure ] Vault Diagnose [ warning ] Check Operating System It is recommended to have at least 1 GB of space free per partition. [ success ] Check Open File Limits: Open file limits are set to 16384. [ success ] Check Disk Usage: / usage ok. [ warning ] Check Disk Usage: /dev is %!d(float64=100) percent full. [ success ] Check Disk Usage: /System/Volumes/VM usage ok. [ success ] Check Disk Usage: /System/Volumes/Preboot usage ok. [ success ] Check Disk Usage: /System/Volumes/Update usage ok. [ success ] Check Disk Usage: /System/Volumes/Data usage ok. [ warning ] Check Disk Usage: /System/Volumes/Data/home has %d bytes full. [ success ] Parse Configuration [ failure ] Check Storage [ success ] Create Storage Backend [ failure ] Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied [ skipped ] Check Service Discovery: No service registration configured. [ success ] Create Vault Server Configuration Seals [ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration. [ success ] Create Core Configuration [ success ] Initialize Randomness for Core [ success ] HA Storage [ success ] Create HA Storage Backend [ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured. [ success ] Determine Redirect Address [ success ] Check Cluster Address: Cluster address is logically valid and can be found. [ success ] Check Core Creation [ skipped ] Check For Autoloaded License: License check will not run on OSS Vault. [ warning ] Start Listeners [ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza. [ success ] Create Listeners [ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal. [ success ] Check Server Before Runtime [ success ] Finalize Shamir Seal
Note that the diagnose resulted in overall failure on line 4, and there is a failure message about storage at lines 16 and 18, along with a warning about the listener TLS configuration at lines 31-32.
The storage related failure on line 18 Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied points to an issue with the Vault data directory, so try confirming the modes on that directory.
$ ls -l /tmp/learn-vault-diagnose/total 8d--------- 2 devops wheel 64 Jul 15 17:40 data-rw-r--r-- 1 devops wheel 580 Jul 15 17:40 vault-server.hcl
The permissions are too restrictive on the data directory.
To understand the Vault log messages around this issue at this point, attempt to start a Vault server with the configuration.
$ vault server -config $LEARN_VAULT/vault-server.hclWARNING! Unable to read storage migration status.2021-07-26T11:22:44.552-0400 [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""2021-07-26T11:22:44.554-0400 [WARN] storage migration check error: error="open /tmp/learn-vault-diagnose/data/core/_migration: permission denied"
The Vault server emits a similar permission denied error about the data path when attempting to access the core storage migration key.
Press control
+ c
to stop the server.
Change the mode to 0700
so that Vault can write to the file storage backend configured for this path.
$ chmod 0700 /tmp/learn-vault-diagnose/data
Execute the diagnose command again to re-check the configuration.
$ vault operator diagnose -config $LEARN_VAULT/vault-server.hcl
Your output should resemble this example.
Vault v1.8.0 (82a99f14eb6133f99a975e653d4dac21c17505c7)Results:[ warning ] Vault Diagnose [ warning ] Check Operating System [ success ] Check Open File Limits: Open file limits are set to 16384. [ success ] Check Disk Usage: / usage ok. [ warning ] Check Disk Usage: /dev is %!d(float64=100) percent full. It is recommended to have more than five percent of the partition free. [ success ] Check Disk Usage: /System/Volumes/VM usage ok. [ success ] Check Disk Usage: /System/Volumes/Preboot usage ok. [ success ] Check Disk Usage: /System/Volumes/Update usage ok. [ success ] Check Disk Usage: /System/Volumes/Data usage ok. [ warning ] Check Disk Usage: /System/Volumes/Data/home has %d bytes full. It is recommended to have at least 1 GB of space free per partition. [ success ] Parse Configuration [ success ] Check Storage [ success ] Create Storage Backend [ success ] Check Storage Access [ skipped ] Check Service Discovery: No service registration configured. [ success ] Create Vault Server Configuration Seals [ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration. [ success ] Create Core Configuration [ success ] Initialize Randomness for Core [ success ] HA Storage [ success ] Create HA Storage Backend [ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured. [ success ] Determine Redirect Address [ success ] Check Cluster Address: Cluster address is logically valid and can be found. [ success ] Check Core Creation [ skipped ] Check For Autoloaded License: License check will not run on OSS Vault. [ warning ] Start Listeners [ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza. [ success ] Create Listeners [ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal. [ success ] Check Server Before Runtime [ success ] Finalize Shamir Seal
Now the failure about storage is resolved, but there is at least one warning in the diagnose output remaining.
Note
Depending on your environment, you might notice other warnings not present in the example output, such as warnings about storage volume capacity, open files, or more. You can also attempt to resolve those for a completely passing result, but it is not necessary to do so for the purposes of this tutorial.
The warning details that TLS is disabled for the listener.
This is an important warning to note, as while it will not stop you from operating Vault (for example in a dev or QA capacity), best practices detailed in Production Hardening recommend operating Vault with end-to-end TLS enabled for production use.
Given that there are no failures, the Vault server should start now even with any warnings present in the diagnose output.
Attempt once again to start the server.
$ vault server -config $LEARN_VAULT/vault-server.hcl==> Vault server configuration: Api Address: http://127.0.0.1:8200 Cgo: disabled Cluster Address: https://127.0.0.1:8201 Go Version: go1.16.5 Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled") Log Level: info Mlock: supported: false, enabled: false Recovery Mode: false Storage: file Version: Vault v1.8.0 Version Sha: 89492a0b7ff1504be2a13e1a92adcfe809aeaf78==> Vault server started! Log data will stream in below:2021-07-26T11:41:40.426-0400 [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""
The Vault server has successfully started, confirming resolution of the storage path permission issue.
Doing it live
You can also check a running Vault server by using a -skip
flag to the diagnose command line and specifying the Vault subsystem that diagnose should skip checking. This helps to avoid errors such as Error initializing listener of type tcp: listen tcp 127.0.0.1:8200: bind: address already in use
when using diagnose against a running server.
In a new terminal session, try using diagnose while the Vault server is up and running, but this time use the -skip
flag and specify listener
so that diagnose skips the listener configuration.
$ vault operator diagnose -config=$LEARN_VAULT/vault-server.hcl --skip=listener
Note
To continue resolution of all diagnose warnings in this example configuration requires a valid TLS certificate and key, and setting tls_disable = "false"
or removal of the line entirely. That is beyond the scope of this tutorial, which aims to provide a simple introduction to diagnose.
Cleanup
In the terminal where you most recently started the Vault server, press
control
+c
to stop the server.Remove the temporary directory containing Vault server configuration and data.
$ rm -rf "$LEARN_VAULT"
Summary
You learned about the diagnose sub-command and how to use it with Vault configuration while Vault is not running, and also how to use the -skip
flag to diagnose a running Vault server.
You learned about the common root causes that diagnose checks for, and some of the warnings and failures that can result.