We in the IT community often obsess over show horses—the fastest and/or most powerful devices or systems—because we love those big and small numbers they showcase. The reality, however, is that most users simply don’t need the ostentatious power of show horses. They need workhorses, systems that can serve up real workloads on a day-to-day basis at an affordable price. We have spent a fair amount of time looking into DataON’s Hyperconverged Infrastructure (HCI) offerings, and even gave the DataON HCI-224 with Intel® Optane™ SSDs our Editor’s Choice award last year. In this article, we will look at another HCI-224 two-node cluster. However, this one features a unique mix of storage: Intel Optane SSDs front end Intel® SSD D5-P4326 15.36TB with QLC 3D NAND, creating a system that optimizes capacity, performance and cost.
We in the IT community often obsess over show horses—the fastest and/or most powerful devices or systems—because we love those big and small numbers they showcase. The reality, however, is that most users simply don’t need the ostentatious power of show horses. They need workhorses, systems that can serve up real workloads on a day-to-day basis at an affordable price. We have spent a fair amount of time looking into DataON’s Hyperconverged Infrastructure (HCI) offerings, and even gave the DataON HCI-224 with Intel® Optane™ SSDs our Editor’s Choice award last year. In this article, we will look at another HCI-224 two-node cluster. However, this one features a unique mix of storage: Intel Optane SSDs front end Intel® SSD D5-P4326 15.36TB with QLC 3D NAND, creating a system that optimizes capacity, performance and cost.
Before diving into this cluster, however, we will first discuss why DataON went with QLC for its storage capacity tier, and provide a review of Microsoft Azure Stack HCI, DataON and two-node HCI clusters.
Intel® SSD D5-P4326 Series
Using QLC-based Intel SSD D5-P4326 for capacity storage in this HCI cluster is a logical choice, as it delivers solid, reliable, and cost-efficient performance. We have seen faster SSDs for sure, but the SSD D5-P4326 finds the right balance between performance and cost, with a massive 15.36TB capacity per drive. This combination is due to its underlying architecture. Using Intel® QLC 3D NAND technology, Intel is able to drive the cost of this device down, while increasing its capacity.
Intel was one of the first storage vendors to make QLC-based drives. QLC or quad-level cell technology stores four bits of data on a single cell, while older technology such as TLC, MLC and SLC only stores three, two, or one bit(s) per cell. Due to QLC higher-density storage, they are able to have a lower cost per GB of storage. Furthermore, Intel’s 3D NAND technology allows these cells to be stacked up horizontally on the chip, further increasing the density of storage. There is a compromise though. To effectively take advantage of the Intel SSD D5-P4326 SSD, write workloads need to be buffered before going into the QLC-based drive. QLC SSDs are ideally suited for capacity-optimized, read-heavy workloads. As such, platforms like an HCI cluster need to use an appropriate cache device in front of the QLC SSDs to deliver even performance. In the case of the DataON HCI-224, four Intel Optane SSD DC P4800X NVMe 750GB 2.5” drives are used per node to absorb writes before moving data down into the QLC layer. This approach prevents excessive writes from causing performance degradation of the QLC layer. The net result is customers get a seamless experience and an ideal blend of Intel Optane-based performance with QLC-based capacity.
Microsoft Azure Stack HCI
Briefly, Microsoft Azure Stack HCI is an on-premise implementation of Microsoft Azure Cloud Services. Basically, Microsoft brought their existing HCI technology into the Azure Stack family so their customers can run virtualized applications on-premises with direct access to Azure management services such as backup and disaster recovery.
Azure Stack HCI should not be confused with Azure or Azure Stack Hub. Whereas Azure is a public cloud service, Azure Stack Hub and Azure Stack HCI are on-premise solutions. Furthermore, Azure Stack Hub runs Azure OS with Azure Services and is an IaaS and PaaS solution. Azure Stack HCI, on the other hand, runs Windows Server OS with Azure Services and allows you to run virtualized workloads in the same manner that you are used to, with the added benefit of being able to connect to the Azure cloud for additional services. This is a huge difference and allows IT administrators to use the same tools and management stack on Azure Stack HCI as they use with Azure.
Azure Stack HCI uses Hyper-V for its hypervisor, Storage Spaces Direct for storage, Microsoft Software Defined Networking (SDN) for networking, and Windows Admin Center (WAC) for its management. Azure Stack HCI runs on standard x86 servers and other commodity components.
WAC is a locally deployed, browser-based management platform that can manage both on-premise and Azure cloud-based instances of Windows 10 and Windows Server. WAC is installed on a Windows system and uses PowerShell scripts. It also uses Microsoft Windows Management Framework (WMF) over WinRM (Windows Remote Management) to monitor and manage Windows systems, including HCI clusters and Azure virtual machines.
WAC’s main dashboard gives an overview of CPU, memory, networking, and disk activity for the systems being monitored. On the left side of the screen, WAC also includes a number of system management and browsing tools including Certificates, Devices, Events, Files, Local Users and Groups, Firewall, Processes, Registry, Roles and Features, Services, and Storage.
DataON was one of the first companies to take advantage of WAC’s open framework and ported its Management Utility Software Tool (MUST) extension to WAC. DataON MUST provides infrastructure visibility, monitoring, and management for Windows server-based HCI, networking, and storage.
DataON HCI
Although Azure Stack HCI uses commodity hardware components, these items must be engineered to work together in order deliver optimal outcomes. In some ways, it is easier to design high performance systems than workhorse systems. With high performance systems, you can select best-of-breed components and ignore cost. But with workhorses, you need to evaluate the cost/performance of the components and then tune them to optimize their performance. It takes just as much—if not more—engineering effort to deliver a value-oriented system, and this system engineering is where we continue to be impressed with DataON.
DataON has a strong partnership with both Microsoft and Intel, and they capitalized on these relationships when engineering systems for Azure Stack HCI. DataON’s HCI Intel Select solutions can be pre-configured and shipped in their own rack, ready to deploy immediately. This delivery method is not only useful in the datacenter, but also proves beneficial for systems deployed at the edge where existing IT infrastructure and personnel is either limited or nonexistent.
2-Node HCI Clusters
We recently did an article on Microsoft Azure Stack HCI 2 node clusters (2NC). Below is a summary of that article. We found that a 2NC could, for many use cases, provide the resilience needed for an organization and that 2NCs are less complex and costly than a traditional three-node or four-node cluster. DataON was one of the first vendors to recognize the value of and embrace the integration of 2NCs. But 2NCs are not new to DataON, as in September of 2017, DataON announced the first two commercially available Kepler-47 HCI for Windows Server 2016 Storage Spaces Direct systems (now Azure Stack HCI).
DataON’s 2NC implementation supports having both a drive failure and server failure at the same time. It does this by using RAID 5 + 1 to do parity resiliency and mirror that across to the other server. Microsoft calls this ability “nested resiliency” and added this capability to Storage Spaces Direct in Windows Server 2019. Again, 2NC are not the right technology choice for everyone, but they can provide a reliable and cost-effective solution to many organizations.
Build and Design
The Azure Stack HCI cluster we are working with here was built on the DataON HCI-224 all-flash NVMe platform. These servers were 2U in size with 24-NVMe bays up front, offering plenty of expansion in the rear for PCIe-based components. The labeling was high in contrast to the matte-black drive caddies, making it easy to spot specific drives in case of a needed swap-out. Everything was labeled, which isn’t that uncommon, but the extent of the labeling was extraordinary. Our deployment had each node labeled (1 and 2), as well as several other items, making it easy to deploy and manage DataON systems in the datacenter.
The nodes in this testing included dual 2nd Gen Intel® Xeon® Scalable Gold 6248 2.5 GHz, 20-Core, 28MB Cache processors, as well as eight Samsung 32GB DDR4 2933MHz ECC-Registered RDIMMs (256GB total per node), and dual Intel S4510 480GB SATA M.2 boot drives.
For storage, each node came with four Intel Optane SSD DC P4800X NVMe 750GB 2.5” drives (used for caching), and four Intel SSD D5-P4326 15.36TB 2.5” QLC drives (capacity storage tier).
The nodes were connected to each other via Mellanox ConnectX-4 EN dual port QSFP28 40/56 GbE cards using 3M Mellanox LinkX ETH 40GbE, 40Gb/s, QSFP passive copper cables.
Obviously, DataON spent a fair amount of time and thought with regards to the configuration and component selection for this system to balance performance and cost. We were very interested to see how the Intel SSD D5-P4326 SSDs would perform as the storage tier. By combining Intel Optane SSDs and Intel QLC 3D NAND SSDs, the D5-P4326 SSDs should provide a high-performance tier and cost-effective flash storage, which used to be the domain of sluggish, but large hard drives.
In the StorageReview lab, we deployed the two storage nodes and switches as diagrammed below.
Testing
To get a feel for how a small cluster like this can perform in an edge use case, we set up several Microsoft SQL Server tests. The goal was to examine full cluster performance to ensure DataON could make proper use of the Intel Optane technology and Intel QLC SSDs. Secondarily, we wanted to examine capabilities of just a single node, to get a sense of how this solution handles the loss of a node, either for planned updates or in the event of a more serious failure.
Our test plan leveraged Quest’s Benchmark Factory using the TPC-C profile as the load generator for the SQL Server VMs we deployed. We configured eight VMs (four per node), which offered a good balance of CPU and disk activity for the cluster. The workload generators were hosted on a system outside of this environment and connected to this cluster over 10GbE networking.
SQL Server Testing Configuration (per VM)
- Windows Server 2019
- Storage Footprint: 800GB allocated, 620GB used
- 8 vCPUs
- 60GB RAM (55GB in failed mode configuration)
- SQL Server 2019
- Database Size: 1,500 scale
- Virtual Client Load: 15,000
- RAM Buffer: 48GB
- Test Length: 3 hours
- 15 minutes preconditioning
- 45 minutes sample period
In our tests we focused on latency performance, with the transaction performance level remaining constant with Benchmark Factory.
With a load of 4 VMs total (2 per node), we measured an average latency of 2.5ms with an aggregate transaction load of 12,649TPS.
Increasing the load to 6VMs, the average latency increased slightly to 4ms with an aggregate transaction load of 18,967TPS.
At the peak load of 8VMs (4 per node), latency topped out at 6.5ms average, with an aggregate transaction load of 25,277.
Throughout these tests, we clearly saw the benefit of having the Optane SSDs in this mix. They took the brunt of the writes, freeing up the QLC SSDs for responsive reads as the high-speed capacity tier. Even as we doubled the workload to eight SQL Server VMs hitting this HCI cluster, latency moved up only a little, showing this configuration to be well-suited for workloads that may burst from time to time.
While performance in a fully operational environment is important, another consideration is how the workloads will operate if a node in the cluster goes offline, or workloads need to be migrated for system maintenance. To test this scenario, we kept our full load of 8VMs and migrated them to a single node. In this setup, we measured an average latency of just 4.5ms, which was better than both nodes online. Part of this is from the removal of storage overhead in single node operation.
Conclusion
For this project, we ran a series of SQL tests on the system to illustrate the performance workloads that are commonly found in edge and SMB use cases. Our goal was to understand how effectively Microsoft Azure Stack HCI in this DataON Cluster was able to leverage hardware to bring about the desired results. Specifically, this means providing a solution that offers a rare combination of performance and value.
We can confirm through our testing that DataON’s component selection was indeed successful in creating a cost-effective Azure Stack HCI SDS solution that performs extremely well. This is partly due to their choice to use the Intel SSD D5-P4326 for capacity storage, which efficiently takes advantage of Intel Optane SSDs for tiering.
This is a critical notion, as the QLC SSDs provide massive, dense capacity to the cluster, while still providing the TCO benefits that comes with flash storage. To hammer the point, the QLC drives enable 15.36TB of capacity per 2.5” drive bay. It would take 8 2TB HDDs in RAID 0 to match on capacity, or switching to a 3.5” chassis to take advantage of larger, but even slower HDDs. Either way, the performance drop off from the Intel QLC drive to hard drives is more than considerable; it’s an exponential difference when it comes to application responsiveness.
As much as we’d like all reads and writes to come from the Optane SSDs (as they’re the highest performing media in this configuration), sometimes there’s going to be a miss. In that case, the QLC SSD performance will trounce hard drives, protecting the HCI cluster from performance irregularities common in topologies that combine flash and hard drives. In fact, we saw such balanced performance here that going forward, companies in general may need to rethink HDD/flash design and lean more toward QLC/Optane design to reap the most benefits in HCI.
The other major concern around 2-node clusters is performance while in a degraded state. We tested this out by failing a node and gave all the SQL workload to a single node. In this case, SQL was more responsive and performed a little better than in 2-node, mostly due to the reduced overhead from node-to-node communications. Of course, it’s not suggested to run in a degraded state like this for long, but it’s reassuring to know that it can be done without sacrificing performance.
Overall, the HCI-224 HCI cluster with D5-P4326 QLC SSDs was simple to deploy, easy to use, and powerful enough for a wide range of workloads. Its price point also makes it available to a wide swath of users. On top of that, this system has been certified for Microsoft Windows Server 2019 and validated as an Intel Select Solution.
Engage with StorageReview
Newsletter | YouTube | Podcast iTunes/Spotify | Instagram | Twitter | Facebook | RSS Feed
This report is sponsored by DataON. All views and opinions expressed in this report are based on our unbiased view of the product(s) under consideration.