StorageReview takes the first of several comprehensive looks at tagged command queuing and its implications for the desktop as well as the server world. Does TCQ rate? How does it mesh with RAID arrays? Is SATA TCQ as effective as SCSI TCQ? Find the answers to these questions and more in SR’s latest!
See also How We Test Drives
See also Western Digital Raptor WD740GD Review
See also Seagate Cheetah 10K.6 Review
IntroductionOver the last few years, Western Digital has maintained a virtual vice-lock on the high-performance, high-capacity desktop and enthusiast markets. The venerable WD Caviar series has combined enviable speed and capacity with reasonable prices. However, aside from a relatively obscure, short-lived SCSI line, when it came to the lucrative enterprise arena the firm simply watched from a distance as titans such as Seagate, Maxtor, and Hitachi battled for market share. A little over a year ago WD tested the enterprise waters with the introduction of the world’s first 10,000 RPM ATA drive, the Raptor WD360GD. The Raptor paired SCSI-class mechanics with the new and relatively inexpensive Serial ATA interface in an attempt to undercut the rather hefty premiums that SCSI subsystems demanded. StorageReview’s performance results, however, revealed that while the WD360GD delivered world-class single-user results, its multi-user performance remained unimpressive when contrasted with existing 10k RPM SCSI units. |
The WD360GD lacked a key element that the SCSI world has enjoyed for years- tagged command queuing (TCQ), a feature that intelligently reorders requests to minimize actuator movement. In September of 2003, Western Digital announced the follow-up Raptor WD740GD, a second-generation unit that brought a host of improvements to the line. Though the doubling of the Raptor’s capacity to 74 gigabytes is the most visible improvement, the most intriguing undoubtedly is the implementation of TCQ.
TCQ In Brief
Not to be confused with operating-system reordering and optimization, tagged command queuing is a hardware-level process designed to streamline the delivery of data in highly-random accesses under heavy loads. Without TCQ, a drive can only accept a single command at a time. It thus operates on a first-come, first-serve basis, completing requests in the order they are received. This is not always the most effective way to service data requests, especially in an intensive, non-localized environment.
Through the process of tagged command queuing, a host adapter adds special tags to individual commands. The drive itself, privy to its own physical layout of sectors across three dimensions, can take into account rotation and seek distances and reorder commands to serve them more efficiently. Requested data is thus returned to the controller in a more streamlined manner; it can then use the additional information it added earlier to transparently return the data to the operating system.
Consider the diagram to the left. In a traditional, non-queued paradigm, the drive would accept the request for data piece A, move the actuator and retrieve it, accept the request for B, retrieve it, then move to piece C. A drive that can buffer and queue requests, however, would be able to retrieve A, then opt to retrieve C first, followed by B, resulting in a net savings of time in completing these three requests.
TCQ must be supported by both the controller and the hard drive itself. It was introduced to the SCSI world as early as 1990 and was formally codified into the SCSI-2 standard by 1994. The feature rapidly proved itself invaluable in the world of multi-user servers and is today consistently deployed across virtually all host adapters and disks. Likewise, TCQ was formally implemented in the 1998 ATA-4 standard. Unlike SCSI devices, however, ATA drives simply were not used in enterprise applications where features like hot-swappability and low access times were paramount. Further, the traditional ATA stronghold, single-user machines, just did not benefit from TCQ; indeed, in many cases the additional imposed overhead actually reduced rather than enhanced performance in these areas. As a result, the feature went largely ignored by the industry.
Today, however, the advent of Serial ATA, its associated hot swap features, and its promised interoperability with the upcoming Serial Attached SCSI (SAS) standard has resulted in a brightening future for ATA in the enterprise. The forthcoming SATA II standard includes provisions to incorporate tagged command queuing a la ATA-4’s standard. Native SATA drive architectures such as the Seagate Barracuda 7200.8 and Maxtor MaXLine III tout the inclusion of “Native SATA” tagged command queuing, or “Native Command Queuing” (NCQ) for short. NCQ’s fundamental paradigm is identical to that of tagged command queuing; the NCQ moniker simply differentiates the SATA II standard from the existing ATA-4 model.
With its deep research and development pockets, Seagate was the only manufacturer to avoid a less expensive and faster-to-market PATA-to-SATA bridge for its first SATA products. For financial and temporal reasons, other manufacturers such as Western Digital introduced their first products with bridged operation. The Raptor WD740GD is one of these designs. While the practical ramifications are negligible (it is bottom line performance that counts, after all!), the Raptor’s bridge prevents it from using the SATA II NCQ standard. Thus, to implement tagged command queuing into its budding enterprise-oriented line in a timely fashion, Western Digital opted to include ATA-4-style TCQ in the Raptor. Fortunately for WD, the firm has received enthusiastic response from many controller manufacturers. Most firms designing NCQ-enabled SATA host adapters are also incorporating Raptor-style queuing. One such manufacturer is Promise Technology.
The Tests
In this first of what will be several articles examining the effects of SATA’s tagged command queuing, we will take a look at how the upcoming Promise FastTrak TX4200 compares to the currently shipping, non-TCQ-enabled FastTrak S150 TX4. The relationship between these two controllers is especially interesting as the TX4200 is simply a FastTrak S150 TX4 with added TCQ code. The S150 TX4, in turn, is simply a RAID-enabled SATA150TX4, the SR Testbed’s long-standing reference SATA controller. A direct contrast between the two Promise RAID controllers can thus isolate the effects of TCQ from other variables.
The Promise FastTrak TX4200 features:
- 4 Serial ATA Ports for up to 4 drives
- RAID 0/1/10 and JBOD
- 32-Bit / 33-66 MHz PCI Operation
- NCQ & SATA TCQ Support
TCQ, of course, has been around for some time in the SCSI world- all current host adapters, RAID controllers, and hard drives support a very mature implementation. To discover what disadvantages, if any, SATA TCQ suffers when contrasted with more established SCSI solutions, results from a Mylex AcceleRaid 170 RAID controller paired with up to four 73 GB Seagate Cheetah 10K.6 drives have been included in these tests.
The Mylex AcceleRaid 170 features:
- 1 68-pin LVD Port for up to 15 drives
- RAID levels 0, 1, 0+1, 3, 5, 10, 30, 50, JBOD
- 32 MB ECC SDRAM Cache
- 32-bit / 33 MHz PCI Operation
Though TCQ confers benefits even when a single drive operates under heavy random loads, its true potential shines when there are also multiple actuators to work with. Hence, the tests that follow also take a hard look at the scaling provided by arrays in both multi-user and single-user scenarios- our first formal take on RAID in over two years.
In the following tests, Testbed3’s hardware and benchmarks sort out the multiple dimensions of potential performance drivers:
- How does TCQ benefit multi-user and single-user performance?
- How does TCQ affect a RAID array’s ability to scale performance upwards as more drives are added?
- How does SATA TCQ stack up against SCSI’s implementation?
- How does a RAID array scale under increasingly heavy random I/O?
- What benefits does a RAID array deliver to the highly-localized I/O that dominates non-server (single-user) use?
Since these tests take advantage of the standard SR testbed, let us take a moment to consider a potential limitation of the machine’s hardware, the 33 MHz, 32-bit PCI slot.
Limitations of the PCI Bus
The 133 MB/sec limit of the standard 32-bit, 33 MHz PCI bus may be of concern to some, especially those seeking for various reasons to maximize sequential transfer rates. The practical real-world limit remains slightly below that threshold- STR tests associated with the results below top out at 126 MB/sec. A single Raptor in its outer zone can push nearly 72 MB/sec while a Cheetah 10K.6 can do 69 MB/sec- it takes only two of either to saturate the PCI bus.
Let us take a closer look, however, at just how important STR is in the majority of applications. The StorageReview File Server DriveMark generates an average transfer size of 22 kilobytes. In other words, the average generated I/O operation in the suite consists of repositioning the actuator to the desired location followed by the reading or writing of 22 KB of data. In the same vein, the SR Office DriveMark’s average transfer size is 23 KB. The SR High-End DriveMark, based on a suite of applications that includes video and audio editing, is the only test that reaches significantly beyond these sizes, generating a relatively high 69.5 KB transfer per IO.
A single Raptor WD740GD, with its maximum transfer rate of 72 MB/sec, can transfer 22 KB in:
A PCI-throttled RAID0 array can transfer 22 KB of data in:
The diagram to the right illustrates the relationship between positioning and transfers in typical single- and multi-user scenarios. Observe how the time spent positioning the actuator and platter (red) dominates the relatively small amount of time spent reading/writing the data itself (yellow). Even an asymptotic case of an infinite transfer rate unleashed through an infinitely fast bus would only eliminate the yellow portion of the total time it takes to service one request.
Therefore, while the PCI bus can limit sequential transfer rates, its practical effect in capping real-world speed in typical use is not nearly as significant as one may believe at first blush. As a result, the scaling demonstrated in this article also represents the increases one will gain from arrays operating on higher-speed buses.
Our third Raptor WD740GD Sample
The evaluation sample provided to SR by Western Digital for our review published last January was manufactured in December 4th, 2003. For this review, WD sent us four more samples, all dated March 4th 2004. Though much of the focus of this article rests on multi-drive arrays, for control purposes, it was necessary to retest a single drive from this new batch on our reference Promise SATA150TX4 controller. Some differences arise:
Note the differences in the final digit of the extended MDL designation when comparing our second and third samples. The December unit ends with a zero while the March unit concludes with a 1. Why the change? First, we should point out that all manufacturers quietly and regularly refresh the firmware on all their drives after initial release, either to correct bugs or to tweak performance as piles of configuration experience pour in.
Second, as has been painfully obvious over the past several months, SATA command queuing has proven to be a constantly moving target while drive and controller manufacturers alike continue to tweak their products. As controller manufacturers such as Pacific Digital, Silicon Image, and Promise Technology continue to develop pre-release adapter samples, drive manufacturers such as Western Digital are forced to re-optimize firmware to obtain the best results. Likewise, the same has been true in the opposite direction. Though Western Digital announced the WD740GD in September of 2003 and though the units widely available through the channel since last December feature TCQ functionality, the Raptor team is nonetheless driven to regularly reassess the drive’s potential companion host adapters and retune firmware accordingly. A result is the “00FLA1” revision, a unit better suited for the state of today’s TCQ-enabled host adapters albeit with a very slight drop in certain performance measures.
Glancing at the figures above reveals that while there are differences, they are for all intents and purposes trivial. One simply will not notice the difference in speed in subjective use. Attempts to specifically procure the earlier 00FLA0 revision would likely prove frustrating and fruitless. We would not sweat over the difference.
A Word on Organization
Presenting the following results can be quite daunting. Many different dimensions of performance emerge when one attempts to form the “big picture.” How does performance increase when all other variables save queue depth remain constant? What kind of benefits result from adding more drives to an array? How does choosing mirroring over striping affect performance? The list of questions runs on. As a result, we have avoided use of our standard “HTML-generated” graphs in favor of static graphs. Hopefully, they accurately convey the myriad of information to be gleaned.
Without further ado, let us take a look at some results!
Multi-User Performance as Queue Depths IncreaseFirst, a look at how performance scales under the highly-random StorageReview File Server DriveMark as queue depths (load) increase. For information on the FS DriveMark, please review this explanation. For comparison purposes, results for Testbed3’s standard Promise SATA150TX4 SATA controller and Adaptec AHA-29160 SCSI controller have been included the single-drive scenario. The RAID0 results presented below feature a 64 KB stripe size. |
By the time the load reaches 4 outstanding I/Os, the 29160 pulls to the head of the pack, edging past the AcceleRaid. The FastTrak S150 TX4 and SATA150TX4 continue to perform identically while the TX4200, still bringing up the rear, manages to close some of the gap.
A depth of 16 yields considerable changes. The Adaptec vaults to the front of the pack with an impressive 237 I/Os per second. The Mylex continues to scale well, but the TX4200/Raptor pair’s TCQ implementation starts to make its presence felt and places just slightly behind.
When we finally reach a heavy load of 64 outstanding requests, a clear hierarchy emerges. Adaptec SCSI host adapters have always been popular, reliable choices. The 29160 also proves to be a great performer by leaving the other controllers in the dust under heavy load. Though the TX4200 and AcceleRaid fail to reach such lofty heights, they nonetheless deliver the scaling under load that one would expect from subsystems that feature TCQ. By contrast, the two non-TCQ ATA controllers bring up the rear of the pack, a marked contrast from their swift performance when only one request remains outstanding.
When depths hit 16, the S150 TX4’s slope levels out, indicative of a lower rate of scaling than the AcceleRaid and the TX4200. Finally, at 64 I/Os the AcceleRaid and the TX4200 enjoy a significant advantage over the S150 TX4 with the TX4200 achieving a high of 385 I/Os per second.
The non-TCQ S150 TX4 falls behind the other two adapters by the time queues reach 16. When depths hit 64 outstanding I/Os, the TX4200 once again sets itself ahead of the others.
Multi-user Performance as the Number of Drives IncreasesContrasting performance between controllers as the number of drives a RAID0 array increases also yields telling results. The graphs that follow demonstrate the effects of adding more disks while holding all other variables constant. The results presented below feature a 64 KB stripe size. |
When a second drive is added to the array, however, the AcceleRaid scales a bit less than the ATA controllers; all three place very close to one another.
By the time we reach a drive count of three and four, the AcceleRaid gentle slope keeps it behind both Promise controllers. The increased actuator count allows the S150 TX4 to overcome its lack of TCQ functionality and pull ahead of the AcceleRaid as well as the TX4200, though by relatively small margins.
The SCSI AcceleRaid offers better performance than the SATA TX4200 with just one drive. As more disks are added to the array, however, the TX4200 scales better; its greater slope culminates with a 13% lead over the AcceleRaid in a 4-drive array.
Multi-User MirroringResults culled from RAID0 arrays as drive counts increase are interesting from an academic standpoint mainly due to the linear nature in which one can add independent actuators. Admittedly, however, practical use of RAID0 in production servers is quite limited- performance gains are more than offset by the significant increase in risk. Should one fail in a four-drive striped array, all data would be lost. Mirroring (RAID1) is a much more likely scenario. In such an array, every piece of data is written to at least two disks. While writes must occur in unison, reads do not; as a result, intelligently-designed RAID controllers do offer performance increase through implementation of independent reads when using multiple actuators. RAID1 delivers the benefits of redundancy should one drive fail while also offering improved performance through two separate read mechanisms. RAID01 offers redundancy across two sets of RAID0 arrays- should one array fail, data remains preserved on the other. RAID10 mirrors the data on two drives, then stripes the resulting array with another array. Both RAID01 and RAID10 enhance performance with two write and up to four read mechanisms. The Mylex controller offers both RAID01 and RAID10 while the Promise units incorporate RAID10. |
As one would expect, striping a pair of mirrored arrays (RAID10) delivers even greater performance benefits. Here the modest S150 TX4 actually jumps out to a significant early lead at a relatively light load of 4 outstanding requests. Things pretty much even out by the time load hits 16; by 64, the S150 TX4’s gentler improvements lead it to fall behind somewhat. In a RAID10 configuration, the TX4200 does not exhibit the same odd drop between 16 and 64. Rather, it scales as expected.
Single-User PerformanceThough evidence has been presented to the contrary, a combination of overzealous marketing as well as general lack of knowledge has resulted in the proliferation of RAID among power users running single-user workstations. While considerable argument may be made for the redundancy provided by RAID1, the increase in transfer rates and high-I/O random access performance delivered by RAID0 simply do not benefit most non-server uses. Likewise, though the latest Raptor delivers outstanding single-user performance, Western Digital has worked to incorporate TCQ functionality in the drive not to widen its single-user lead over the competition but rather to ensure that the Raptor stakes its claim as a viable alternative in the traditionally SCSI-based server world. |
Even so, we realize that many readers worldwide have been eagerly awaiting results for the Raptor mated with an appropriate TCQ controller. StorageReview’s Desktop DriveMarks offer an unrivaled opportunity to assess just how much difference RAID arrays make when it comes to non-server, single-user applications. Here we will take a look at how RAID, TCQ, SCSI, and SATA all impact performance. The RAID0 results presented below feature a 64 KB stripe size.
On the SCSI side of things, the Adaptec AHA-29160 and Mylex AcceleRaid 170 switch places. While the 29160 proves itself the superior performer in single-drive, multi-user scenarios, the AcceleRaid delivers better single-user scores across the board. Note, however, that even when handicapped by TCQ operation, the Raptor manages to hold onto its lead over the Cheetah in three out of the four single-user tests. Western Digital’s experience in designing great single-user read-ahead and write-back buffer strategies continues to shine through.
Let us move on and examine how the RAID controllers scale in each access pattern as drive counts increase.
Upgrading to three drives reveals a significant diminishing in returns. Moving to four disks actually causes regression in scores. While some may argue that the testbed 32-bit, 33 MHz PCI bus limits the gains achieved by striping, readers should remember that, especially at these I/O levels, such limits are not significant. Regardless of drive count, the S150 TX4’s simpler, streamlined operation grant it a considerable performance advantage over the other two controllers.
Single-User Mirroring
In all other access patterns, the three controllers exhibit very slight changes here and there, with no significant trend discernable. Note that once again the lowly non-TCQ S150 TX4 delivers the best performance in all mirrored combinations.
ConclusionsFrom the plethora of data presented above, we can draw several conclusions: 1. SATA TCQ and SATA RAID have the potential to deliver benefits to the server market just as great as those of SCSI TCQ and SCSI RAID. The Promise FastTrak TX4200 and the Mylex AcceleRaid 170 are respectively entry-level RAID controllers for the SATA and SCSI interfaces. The former, in fact, is little more than Promise’s current FastTrak S150 TX4 controller with SATA TCQ (and NCQ) functionality added. |
Unlike other major hard disk players, Western Digital does not have an established SCSI-based product line to protect. As a result, the firm seeks a competitive advantage through offering SCSI-style mechanics and functionality at prices associated with the more cost-conscious ATA interface. A quick check at the time of this writing with StorageReview sponsor HyperMicro prices 73 GB Cheetah 10K.6s at $339 each, Raptor WD740GDs at $219 each, and the AcceleRaid 170 at $379. When released this August, the Promise TX4200’s price will be on par with that of the S150 TX4 that it is meant to displace. It runs $159 at HyperMicro.
Hence, the following pricing (excluding cables and accessories) arises:
While WD has delivered a solution that can match a SCSI-based solution’s speed and scalability, one must also keep in mind the key factors of infrastructure and reliability. As with TCQ itself, SATA’s support hardware such as backplanes, all-in-one solutions, and the like remain in their infancy when contrasted to the maturity and longevity of SCSI hardware. Also keep in mind that while Western Digital claims an enterprise-class 1.2 million hour MTTF spec and backs the Raptors with a 5-year warranty, the line is still new and remains relatively unproven compared to established solutions such as Seagate’s Cheetah series. Finally, remember that the prices listed above represent the cost of the storage subsystem alone- factoring in the total cost of server hardware when motherboards, CPU, and RAM are considered can dilute the difference significantly.
In the end, the potential for SATA to invade the entry- and mid-level server market is there. The performance is definitely there. If the Raptor’s reliability proves comparable to the competition and if the infrastructure/support hardware surface, WD will have a viable contender.
2. Command queuing is meant to assist multi-user situations, not single-user setups. With the recent release of Intel’s 9xx chipsets, pundits and enthusiasts everywhere have been proclaiming that command queuing is the next big thing for the desktop. Wrong. As evidenced by the disparities between the FastTrak S150 TX4 and TX4200 (otherwise identical except for the latter’s added TCQ functionality), command queuing introduces significant overhead that fails to pay for itself performance-wise in the highly-localized, lower-depth instances that even the heaviest single-user multitasking generates. It is becoming clear, in fact, that the maturity and across-the-board implementation of TCQ in the SCSI world is one of the principal reasons why otherwise mechanically superior SCSI drives stumble when compared to ATA units. Consider that out of the 24 combinations yielded from the four single-user access patterns, one-to-four drive RAID0 arrays, and RAID1/10 mirrored arrays presented above, the non-TCQ S150 TX4 comes out on top in every case by a large margin. TCQ is only meant for servers, much like the technology mentioned just below.
3. RAID helps multi-user applications far more than it does single-user scenarios. The enthusiasm of the power user community combined with the marketing apparatus of firms catering to such crowds has led to an extraordinarily erroneous belief that striping data across two or more drives yields significant performance benefits for the majority of non-server uses. This could not be farther from the truth! Non-server use, even in heavy multitasking situations, generates lower-depth, highly-localized access patterns where read-ahead and write-back strategies dominate. Theory has told those willing to listen that striping does not yield significant performance benefits. Some time ago, a controlled, empirical test backed what theory suggested. Doubts still lingered- irrationally, many believed that results would somehow be different if the array was based off of an SATA or SCSI interface. As shown above, the results are the same. Save your time, money and data- leave RAID for the servers!
We’re far from finished here! Competing SATA TCQ products from Pacific Digital Corp. and Highpoint Technologies are currently available while Silicon Image has a chipset yet to be incorporated into a shipping product. We’ll continue to work with other controller manufacturers to bring readers Raptor TCQ results paired with a variety of products. SATA NCQ is also just now entering prime time. As always, StorageReview will be there.