Tech bits and bytes and Burritos : November 2010

I love high-scale systems and, more than anything, I love data from real systems. I’ve learned over the years that no environment is crueler, less forgiving, or harder to satisfy than real production workloads. Synthetic tests at scale are instructive but nothing catches my attention like data from real, high-scale, production systems. Consequently, I really liked the disk population studies from Google and CMU at FAST2007 (Failure Trends in a Large Disk Population, Disk Failures in the Real World: What does a MTBF of 100,000 hours mean to you). These two papers presented actual results from independent production disk populations of 100,000 each. My quick summary of these 2 papers is basically “all you have learned about disks so far is probably wrong.”

Disk failures are often the #1 or #2 failing component in a storage system usually just ahead of memory. Occasionally fan failures lead disk but that isn’t the common case. We now have publically available data on disk failures but not much has been published on other component failure rates and even less on the overall storage stack failure rates. Cloud storage systems are multi-tiered, distributed systems involving 100s to even 10s of thousands of servers and huge quantities of software. Modeling the failure rates of discrete components in the stack is difficult but, with the large amount of component failure data available to large fleet operators, it can be done. What’s much more difficult to model are correlated failure

Essentially, there are two challenges encountered when attempting to modeling overall storage system reliability: 1) availability of component failure data and 2) correlated failures. The former is available to very large fleet owners but is often unavailable publically. Two notable exceptions are disk reliability data from the two FAST’07 conference papers mentioned above. Other than these two data points, there is little credible component failure data publically available. Admittedly, component manufacturers do publish MTBF data but these data are often owned by the marketing rather than engineering teams and they range between optimistic and epic works of fiction.

Even with good quality component failure data, modeling storage system failure modes and data durability remains incredibly difficult. What makes this hard is the second issue above: correlated failure. Failures don’t always happen alone, many are correlated, and certain types of rare failures can take down the entire fleet or large parts of it. Just about every model assumes failure independence and then works out data durability to many decimal points. It makes for impressive models with long strings of nines but the challenge is the model is only as good as the input. And one of the most important model inputs is the assumption of component failure independence which is violated by every real-world system of any complexity. Basically, these failure models are good at telling you when your design is not good enough but they can never tell you how good your design actually is nor whether it is good enough.

Where the models break down is in modeling rare events and non-independent failures. The best way to understand common correlated failure modes is to study storage systems at scale over longer periods of time. This won’t help us understand the impact of very rare events. For example, Two thousand years of history would not helped us model or predict that a airplane would be flown into the World Trade Center. And certainly the odds of it happening again 16 min and 20 seconds later would be close to impossible. Studying historical storage system failure data will not help us understand the potential negative impact of very rare black swan events but it does help greatly in understanding the more common failure modes including correlated or non-independent failures.

Murray Stokely recently sent me "Availability in Globally Distributed Storage Systems" which is the work of a team from Google and Columbia University. They look at a high scale storage system at Google that includes multiple clusters of Bigtable which is layered over GFS which is implemented as a user–mode application over Linux file system. You might remember Stokely from my Using a post I did back in March titled Using a Market Economy. In this more recent paper, the authors study 10s of Google storage cells each of which is comprised of between 1,000 and 7,000 servers over a 1 year period. The storage cells studied are from multiple datacenters in different regions being used by different projects within Google.

Conclusion: Two way redundancy in two different datacenters is considerably more durable than 4 way redundancy in a single datacenter.

· Correlation among node failures dwarfs all other contributions to unavailability in production environments.

· Disk failures can result in permanent data loss but transitory node failures account for the majority of unavailability.

To read more: http://research.google.com/pubs/pub36737.html

The abstract of the paper:

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

The original case for network storage lies around the benefits of aggregating disk which include more random IOPS and bandwidth as well as the management simplicity of concentrating data into a centralized place. It has been the case that a single application per host has somewhat limited ability to consume bandwidth.

Considering flash in a networked storage environment, it does not have the same benefits of sharing. Since there are enough IOPS in a flash drive to placate most apps, the equation changes. Is network storage fast when the network latency is noticeable compared to flash? We can do 5 SSDs to get 40,000 IOPS but will the latency be a problem?

We will see flash on compute servers. Servers can already accommodate flash and it is cheap enough with 250GB < 10% of the server cost.

A new tier in the storage hierarchy: Steve argues we will see storage split into two tiers: the IOPS tier and the capacity tier. The IOPS-Tier will offer the best $/IOPS with low latency and will have sufficient capacity for the active data in most typical enterprise systems. It will be the new “primary storage”. The Capacity-Tier will offer the best $/GB – Deduplication, compression, and good serial performance.

Lets outline three broad categories of Host-Based Flash usage. DAS (which I think means “Direct Access Storage but my notes failed me), Primary Storage Service, and Cache.

· Host-Base Flash: DAS: Single point of failure and so you would use normal backup or application based replication like Exchange or Oracle). You can have boot files, VM, temp files.

· Host-Based Flash: Primary Storage Service: File or Block. Requires mirroring or RAID to peers. Allows capacity sharing. Steve believes it is likely that Microsoft and VMWare are likely to compete here – the integration of this is complex. Making reliable primary storage is hard and this may impact the roles of folks in IT.

· Host-Based Flash: Cache –File or block. This will be a writethrough cache. If you do a write-back you will occasionally lose stuff. Using a cache can allow for a centralized data management (in a remote network storage server). This can help with keeping the current IT storage management roles by providing an automatic “tiering”.

So, data management spans the IOPS tier and the Capacity tier. You need to have automated data movement and global access. It is impractical to have manual placement at huge scale. It is going to be important for us to have a technology independent language to specify the desired properties for the data. Say what properties you want and not how to get them. New SLO (Service Level Objectives) for Max Latency, Bandwidth, Availability, etc.

In summary, the new host-based flash is the new IOPS-Tier. It will replace high-performance primary storage in the cloud. This IOPS-Tier s is not going to be economical as a secondary or as archival since the $/GB is higher than spinning media. The Capacity Tier will be implemented as network storage. It is optimized for the best $/GB. Data management between these two tiers will be important.

There was a lot of discussion around the SLO (Service Level Objectives). Steve indicated that their vision was not super-flexible set of knobs but more a small set of options for defining the data characteristics. Margo offered that one aspect of what was important was the “value” of different parts of data which can, in turn, tell the system how hard to work to make the data remain available.

Someone asked if there was any performance analysis for 3-tier storage (memory, IOPS-Tier, Capacity-Tier) over more traditional 2-tier storage (Memory and Disk). Steve answered that there had not been an explicit performance analysis.

Stonebraker observed that enterprises seem to like NAS/SAN but that the Cloud guys are using direct attached storage. Are the forward looking guys walking from NAS/SAN? Steve replied there are people very interested in NAS/SAN because it gives them a way to control the data and make servers stateless. Adding in the IOPS-Tier is a simple extension.

This was a fascinating and engaging presentation!

Tech bits and bytes and Burritos

Monday, November 29, 2010

Everything you know about disks is wrong

Friday, November 12, 2010

Flash on Servers...........

Creating Atlas Cluster GCP /24 in Terraform

Links