Troubleshooting VSAN Performance

Executive Summary

An infrastructure that delivers sufficient performance for applications is a table stake for data center administrators. The lack of sufficient and predictable performance can not only impact the VMs that run in an environment, but the consumers who use those applications. Determining the root cause of performance issues in any environment can be a challenge, but with environments running dozens, if not hundreds of virtual workloads, pinpointing the exact causes, and understanding the options for mitigation can be difficult for even the experienced administrator.

VMware vSAN is a distributed storage solution that is fully integrated into VMware vSphere. By aggregating local storage devices in each host across a cluster, vSAN is a unique, and innovative approach to providing cluster-wide, shared storage and data services to all virtual workloads running in a cluster. While it eliminates many of the design, operation and performance challenges associated with
three-tier architectures using storage arrays, it introduces additional considerations in diagnosing and mitigating performance issues that may be storage related.

This document will help the reader better understand how to identify, quantify, and remediate performance issues of real workloads running in a vSAN powered environment running all-flash. It is not a step-by-step guide for all possible situations, but rather, a framework of considerations in how to address problems that are perceived to be performance related. The example provided in Appendix C will illustrate how this framework can be used. The information provided assumes an understanding of virtualization, vSAN, infrastructures, and applications

Diagnostics and Remediation Overview

vSAN environments may experience performance challenges in a variety of circumstances. This includes:

  • Proof of Concept (PoC) phase using synthetic testing, or performance benchmarking
  • Initial migration of production workloads to vSAN
  • Normal day-to-day operation of production workloads
  • Evolving demands of production workloads

The primary area of focus of this document is related to production workloads in a vSAN environment. Many of the same mitigation steps can be used to evaluate performance challenges when using synthetic I/O testing during an initial PoC. A vSAN Performance Evaluation Checklist offers a collection of guidance and practices for PoCs that will be helpful for customers in that phase of the process.
Accurately diagnosing performance issues of a production environment requires care, persistence and correctly understanding the factors that can commonly contribute to performance challenges.

Contributing Factors

Several factors influence the expected outcome of system performance in a customer’s environment, and the behavior of workloads for that specific organization. Most fall in one of the five categories, but are not mutually exclusive. These factors contribute to the performance vSAN is able to provide, as well as the performance perceived by users and administrators. When reviewing previous performance issues in any
architecture where a root cause was determined, you’ll find that the reason can often trace back to one or more of these five categories.

Contributing factors

Understanding the contributing factors to a performance issue is critical to knowing what information needs to be collected to begin the process of diagnosis and mitigation.

Process of Diagnosis and Mitigation

A process for diagnosis and mitigation helps work through the problem, and address it in a clear and systematic way. Without this level of discipline, further speculation and potential remedies to the issue
will be scattered, and ineffective in addressing the actual issue. This process can be broken down into five steps:

    1. Identify and quantify. This step helps to clearly define the issue. Clarifying questions can help properly qualify the problem statement, which will allow for a more targeted approach to addressing. This process helps sort out real versus perceived issues, and focuses on the end result and supporting symptoms, without implying the cause of the issue.

    2. Discovery/Review – Environment. This step takes a review of the current configuration. This will help eliminate previously unnoticed basic configuration or topology issues that might be plaguing the environment.

    3. Discovery/Review – Workload. This step will help the reader review the applications and workflows. This will help a virtualization administrator better understand what the application is attempting to perform, and why.

    4. Performance Metrics – Insight. This step will review some of the key performance metrics to view, and how to interpret some of the findings the reader may see when observing their workloads. It clarifies what the performance metrics means, and how they relate to each other..

    5. Mitigation – Options in potential software and hardware changes. This step will help the reader step through the potential actions for mitigation.

A performance issue can be defined in a myriad of ways. For storage performance issues with production workloads, the primary indicator of storage performance challenges is I/O latency as seen by the guest VM running the application(s). Latency, and other critical metrics are discussed in greater detail in steps 4 and 5 of this diagnosis workflow. With guest VM latency being the leading symptom of insufficient performance of a production workload, and an understanding of the influencing factors that contribute to storage performance (shown in Figure 1.), the troubleshooting workflow for vSAN could be visualized similar to what is found in Figure 2 below.

A visual representation of a performance troubleshooting workflow

This document follows the process of diagnosis and mitigation described in Figure 2, and will elaborate on each step in an appropriate level of detail. Recommendations are provided on specific metrics to monitor (In Step 4, and Appendix A), and what they mean to the applications and the environment.
Additionally, the reader will find a summary of useful tools (found In Appendix B) should there be a desire to explore the details at a deeper level. Troubleshooting performance issues can be difficult even under the best of circumstances. This is made worse by skipping valuable steps in gathering information about the issue to really understand and define the problem. Therefore, the information provided here places emphasis on understanding the environment and workloads over associating each potential performance problem with a single fix.

This document does not cover the details of how to evaluate vSAN for a PoC environment, nor does it provide detail on how to run synthetic I/O based performance benchmarks. The information provided here closely aligns with the recommendations found in the vSAN Performance Evaluation Checklist, which is a great resource to level-set an environment for performance evaluation.

Download Troubleshooting VSAN Performance technical white paper (april 2019).

Rating: 5/5

Comments are closed.