Tuesday, April 3, 2012

#CloudStack and #Hadoop: a Match Made in the Cloud

CloudStack and Hadoop: a Match Made in the Cloud:
Today Citrix announced that CloudStack would become the cloud platform project in Apache Software Foundation. I’m excited not just because CloudStack will be an incredibly vibrant and successful project by itself, I also believe there is a tremendous amount of synergy between CloudStack and other cloud-related projects in Apache Software Foundation. I look forward to continuing to work with, for example, Apache Libcloud and Deltacloud projects.
I am the most excited, however, about the prospect of integrating with Apache Hadoop project. Known primarily as the technology for Big Data applications, Hadoop has gained wide-spread adoption in the industry. Similar to CloudStack which is inspired by Amazon’s EC2 service, Hadoop is modeled after Google’s MapReduce and Google File System technologies. And just like CloudStack, Hadoop is implemented in Java.
At the lowest level, Hadoop Distributed File System (HDFS) is a distributed and scalable file system. HDFS is designed to run on a large number of hosts and achieves reliability by automatically replicating data across multiple hosts. Hadoop project also includes a MapReduce engine and HBase distributed database (modeled after Google’s BigTable.) MapReduce and HBase run on top of HDFS. Highly reliable and highly efficient, Hadoop technology is being used by some of the largest cloud companies including eBay, Yahoo! and Facebook.
Today, CloudStack users already run Hadoop on CloudStack. They implement a service very similar to Amazon’s Elastic MapReduce (EMR). For cloud service providers, Hadoop represents a significant amount of workload that can be readily moved to the cloud. Enterprise deployments can achieve tremendous savings by leveraging the same CloudStack infrastructure to host Big Data workload. Users also leverage CloudStack’s bare metal provisioning capabilities to build high performance Hadoop clusters.
Working closely with Hadoop development community, we have started to explore other ways to integrate CloudStack and Hadoop. Because of its scalability, reliability, performance, and maturity, HDFS is a great object store solution for IaaS cloud. We have started the development of an S3 API front-end for HDFS. Once that work is complete, the combination of CloudStack and Hadoop will provide features equivalent to Amazon EC2 and S3 services.
Of course HDFS will be just one of many technologies CloudStack integrate with to implement S3-compatible object store. CloudStack will continue to work with other scalable storage solutions such as SwiftStack, NexentaStor, Gluster, as well as commercial solutions from NetApp, EMC, Scale Computing, and Caringo. Many of these vendors have incorporated technologies similar to Apache Hadoop in their products. By deploying CloudStack with one of these object store technologies, we can all benefit from the best of both Amazon-style and Google-style clouds!

No comments:

Post a Comment