Curtain Call!

Mikey Butler_Vice President_R&D IaaSBy Mikey Butler, Vice President, R&D IaaS, EVault

I hate stealth mode!

We’ve been keeping our OpenStorage effort under raps for the last 14+ months. Given the fact that what EVault and Seagate are doing with OpenStack is very bold, exciting and frankly, cool, this hasn’t been easy.

Thankfully, we recently GA’d our LTS2 offering, which gives me the chance to raise the curtain on our OpenStack mission!

My reveal will be in multiple posts. Today, I’ll disclose the mission and summarize its top ten technical challenges. Follow on posts will expand on those challenges and what we’ve learned tackling them.

Our Mission - Over the next five years EVault and Seagate will create the world’s largest, most durable, cost effective, easiest to adopt, disk archival cloud.

Our Top Ten Challenges

  1. Scale—maximal economies of scale—we’ve done extensive financial modeling over the past several months. From these it is clear that our archive cloud must exceed 8 exabytes to achieve our pricing objectives.  That’s 2 million 4TB drives without resiliency overhead!
  2. Density—minimal server resources—our financial modeling also shows that the ratio of disk spindles to cpus greatly influences cost. We are shooting for 500+ disks per server in our finished cloud.
  3. Storage overhead—minimal cost for resiliency—Swift now supports only replication for resiliency.  If we went with the recommended practice of 2 copies per object, the 2 million drives of usable storage we need at scale would become 6 million with replication!! Clearly we need to find more efficient resiliency models. Our target is 30-40% resiliency overhead at most.
  4. Power awareness—minimal power consumption—our models show that when disk drives spin 24x7x365 that power is the #1 contributor to operating cost. We want our cloud to have 93% of the drives powered down at any one time with the remaining 7% powered up and providing object location and health information.
  5. Self-healing—minimal down time— As the axiom goes, at scale all the unlikely events happen. Modeling done at a major university for Sandia Labs shows that @ 1 million drives one can expect a significant disk fault every 6 seconds!! Clearly RAID is not an answer at cloud scale. Our archive cloud will need to push the limits of self-healing and self-monitoring.
  6. Low Touch—minimal human intervention—human beings are the single most likely source of error in the data center. Industry data shows that human intervention will result in a problem 20% of the time.  Our cloud must shoot for a zero touch, eventual healing model.  Ideally, intrusion into the operating infrastructure should not happen at all! Labor costs are another reason for low touch. We are likely to have 2000+ racks of equipment. With enterprise staffing models we would not come close to our opex salary targets. We need to achieve a ratio of 1 operator for every 100 racks of equipment.
  7. Durability—no data lost ever!—we want 13+ 9’s of durability with objects distributed across multiple disks, storage nodes, data centers, geographic risk zones.
  8. Longevity—cope with technology & object evolution—we are an archive cloud and hence committing to retaining objects for potentially decades (50+ years).  Our cloud must tolerate to variable storage capacities and technologies over time.  Who knows, in fifty years information technology may be holographic or biological.  We certainly know that today’s technology will be obsolete. Whatever it is, we are committing to our customers that their data will be on those future systems. Our cloud must cope with migrating customers’ data to those future information technologies. We also face the issue of object format migration. Today’s data representations will evolve and will become obsolete several times during a 50+ year retention.  We must seamlessly migrate our customers’ objects to future formats while keeping them accessible at all times.
  9. Interoperability—easily extending our customers’ infrastructure—we want our cloud to be easily and seamlessly integrated into our customers’ existing infrastructures.  Our cloud must at all times support the leading cloud API’s (S3 as a current example) as well as best-of-breed application integration technologies.   We envision our cloud sufficiently tightly coupled to the customers’ infrastructure to allow cross storage tier data processing.
  10. Next Gen Disk Drives—I’ve saved the best for last—number 10 is not a challenge but a wonderful technology breakthrough, which we will eventually utilize for LTS2. The new Seagate Kinetic Drive. Kinetic is a dramatic re-imagining of the disk drive. Instead the familiar SATA, SAS or SCSI block storage device, Kinetic is an ethernet-connected, object-friendly key value drive with a generous amount of compute on board! Now disk drives can do much more on their own helping us more easily address many of the above mentioned challenges. For example, with Kinetic, it is now possible for drives to do object consistency checking and inter-drive object migration on their own without servers being in the data path, which implies fewer servers (refer to point #2) LOVE IT!

Can you understand now why I used those adjectives: bold, exciting and cool…and why it’s been hard to be stealthy!?!

Cheers until the next time,
Mikey