Wednesday, July 3, 2013

Big Data on-Demand with Apache Whirr

Big Data on-Demand with Apache Whirr - CloudStack Blog

http://buildacloud.org/blog/271-big-data-on-demand-with-apache-whirr.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+open-source-cloud-computing+%28BuildaCloud%3A+Head+in+the+Clouds%29

via Digg Reader

Big Data on-Demand with Apache Whirr

This post is a little more formal than usual as I wrote this for a tutorial on how to run hadoop in the clouds, but I thought this was very useful so I am posting it here for everyone's benefit (hopefully).

When CloudStack graduated from the Apache Incubator in March 2013 it joined Hadoop as a Top-Level Project (TLP) within the Apache Software Foundation (ASF). This made the ASF the only Open Source Foundation which contains a cloud platform and a big data solution. Moreover a closer look at the projects making the entire ASF shows that approximately 30% of the Apache Incubator and 10% of the TLPs is "Big Data" related. Projects such as Hbase, Hive, Pig and Mahout are sub-projects of the Hadoop TLP. Ambari, Kafka, Falcon and Mesos are part of the incubator and all based on Hadoop.

To Complement CloudStack, API wrappers such as Libcloud, deltacloud and jclouds are also part of the ASF. To connect CloudStack and Hadoop two interesting projects are also in the ASF: Apache Whirr a TLP, and Provisionr currently in incubation. Both Whirr and Provisionr aimed at providing an abstraction layer to define big data infrastructure based on Hadoop and instantiate those infrastructure on Clouds, including Apache CloudStack based clouds. This co-existence of CloudStack and the entire Hadoop ecosystem under the same Open Source Foundation means that the same governance, processes and development principles apply to both project bringing great synergy that promises an even better complementarity.

In this tutorial we introduce Apache Whirr, an application that can be used to define, provision and configure big data solutions on CloudStack based clouds. Whirr automatically starts instances in the cloud and boostrapps hadoop on them. It can also add packages such as Hive, Hbase and Yarn for map-reduce jobs.

Whirr [1] is a "set of libraries for running cloud services" and specifically big data services. Whirr is based on jclouds [2]. Jclouds is a java based abstraction layer that provides a common interface to a large set of Cloud Services and providers such as Amazon EC2, Rackspace servers and CloudStack. As such all Cloud providers supported in Jclouds are supported in Whirr. The core contributors of Whirr include four developers from Cloudera the well-known Hadoop distribution. Whirr can also be used as a command line tool, making it straightforward for users to define and provision Hadoop clusters in the Cloud.

As an Apache project, Whirr comes as a source tarball and can be downloaded from one of the Apache mirrors [3]. Similarly to CloudStack, Whirr community members can host packages. Cloudera is hosting whirr packages to ease the installation. For instance on Ubuntu and Debian based systems you can add the Cloudera repository by creating /etc/apt/sources.list.d/cloudera.list and putting the following contents in it:

deb [arch=amd64] http://archive.cloudera.com/cdh4/-cdh4 contrib   deb-src http://archive.cloudera.com/cdh4/-cdh4 contrib  

With this repository in place, one can install whirr with:

$sudo apt-get install whirr

The whirr command will now be available. Developers can use the latest version of Whirr by cloning the software repository, writing new code and submitting patches the same way that they would submit patches to CloudStack. To clone the git repository of Whirr do:

$git clone git://git.apache.org/whirr.git

They can then build their own version of whirr using maven:

$mvn install

The whirr binary will be located under the /bin directory. Adding it to one's path with:

$export PATH=$PATH:/path/to/whirr/bin

Will make the whirr command available in the user's environment. Successfull installation can be checked by simply entering:

$whirr --help

With whirr installed, one now needs to specify the credentials of the Cloud that will be used to create the Hadoop infrastructure. A ~/.whirr/credentials has been created during the installation phase. The type of provider (e.g cloudstack), the endpoint of the cloud and the access and secret keys need to be entered in this credentials file like so:

PROVIDER=cloudstack  IDENTITY=  CREDENTIAL=  ENDPOINT=  

For instance on Exoscale [4] a CloudStack based cloud in Switzerland, the endpoint would be https://api.exoscale.ch/compute

Now that the CloudStack cloud endpoint and keys have been configured, the hadoop cluster that we want to instantiate needs to be defined. This is done in a properties file using a set of Whirr specific configuration variables [5]. Below is the content of the file with explanations in-line:

---------------------------------------  # Set the name of your hadoop cluster  whirr.cluster-name=hadoop  # Change the name of cluster admin user  whirr.cluster-user=${sys:user.name}  # Change the number of machines in the cluster here  # Below we define one hadoop namenode and 3 hadoop datanode  whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker  # Specify which distribution of hadoop you want to use  # Here we choose to use the Cloudera distribution  whirr.env.repo=cdh4  whirr.hadoop.install-function=install_cdh_hadoop  whirr.hadoop.configure-function=configure_cdh_hadoop  # Use a specific instance type.  # Specify the uuid of the CloudStack service offering to use for the instances of your hadoop cluster  whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8  # If you use ssh key pairs to access instances in the cloud  # Specify them like so  whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_exoscale  whirr.public-key-file=${whirr.private-key-file}.pub  # Specify the template to use for the instances  # This is the uuid of the CloudStack template  whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2  ------------------------------------------------------  

To launch this Hadoop cluster use the whirr command line:

$whirr launch-cluster --config hadoop.properties

The following example output shows the instances being started and boostrapped. At the end of the provisioning, whirr returns the ssh command that shall be used to access the hadoop instances.

-------------------  Running on provider cloudstack using identity mnH5EbKcKeJd456456345634563456345654634563456345  Bootstrapping cluster  Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker  Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker  Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker]  Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]  >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725)  >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(9d5c46f8-003d-4368-aabf-9402af7f8321)  >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(6727950e-ea43-488d-8d5a-6f3ef3018b0f)  >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(6a643851-2034-4e82-b735-2de3f125c437)  << success executing InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725): {output=This function does nothing. It just needs to exist so Statements.call("retry_helpers") doesn't call something which doesn't exist  Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]  Get:2 http://security.ubuntu.com precise-security Release [49.6 kB]  Hit http://ch.archive.ubuntu.com precise Release.gpg  Get:3 http://ch.archive.ubuntu.com precise-updates Release.gpg [198 B]  Get:4 http://ch.archive.ubuntu.com precise-backports Release.gpg [198 B]  Hit http://ch.archive.ubuntu.com precise Release  ..../snip/.....  You can log into instances using the following ssh commands:  [hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.xx.yy.zz  [hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.zz.zz.rr  [hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.tt.yy.uu  [hadoop-namenode+hadoop-jobtracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no sebastiengoasguen@185.ii.oo.pp  -----------  

To destroy the cluster from your client do:

$whirr destroy-cluster --config hadoop.properties.

Whirr gives you the ssh command to connect to the instances of your hadoop cluster, login to the namenode and browse the hadoop file system that was created:

$ hadoop fs -ls /  Found 5 items  drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /hadoop  drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /hbase  drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /mnt  drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /tmp  drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /user  

Create a directory to put your input data.

$ hadoop fs -mkdir input  $ hadoop fs -ls /user/sebastiengoasguen  Found 1 items  drwxr-xr-x   - sebastiengoasguen supergroup          0 2013-06-21 20:15 /user/sebastiengoasguen/input  

Create a test input file and put in the hadoop file system:

$ cat foobar   this is a test to count the words  $ hadoop fs -put ./foobar input  $ hadoop fs -ls /user/sebastiengoasguen/input  Found 1 items  -rw-r--r--   3 sebastiengoasguen supergroup         34 2013-06-21 20:17 /user/sebastiengoasguen/input/foobar  

Define the map-reduce environment. Note that this default Cloudera distribution installation uses MRv1. To use Yarn one would have to edit the hadoop.properties file.

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Start the map-reduce job:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output  13/06/21 20:19:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.  13/06/21 20:20:00 INFO input.FileInputFormat: Total input paths to process : 1  13/06/21 20:20:00 INFO mapred.JobClient: Running job: job_201306212011_0001  13/06/21 20:20:01 INFO mapred.JobClient:  map 0% reduce 0%  13/06/21 20:20:11 INFO mapred.JobClient:  map 100% reduce 0%  13/06/21 20:20:17 INFO mapred.JobClient:  map 100% reduce 33%  13/06/21 20:20:18 INFO mapred.JobClient:  map 100% reduce 100%  13/06/21 20:20:21 INFO mapred.JobClient: Job complete: job_201306212011_0001  13/06/21 20:20:22 INFO mapred.JobClient: Counters: 32  13/06/21 20:20:22 INFO mapred.JobClient:   File System Counters  13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes read=133  13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes written=766347  13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of read operations=0  13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of large read operations=0  13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of write operations=0  13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes read=157  13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes written=50  13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of read operations=2  13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of large read operations=0  13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of write operations=3  13/06/21 20:20:22 INFO mapred.JobClient:   Job Counters   13/06/21 20:20:22 INFO mapred.JobClient:     Launched map tasks=1  13/06/21 20:20:22 INFO mapred.JobClient:     Launched reduce tasks=3  13/06/21 20:20:22 INFO mapred.JobClient:     Data-local map tasks=1  13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=10956  13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=15446  13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0  13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0  13/06/21 20:20:22 INFO mapred.JobClient:   Map-Reduce Framework  13/06/21 20:20:22 INFO mapred.JobClient:     Map input records=1  13/06/21 20:20:22 INFO mapred.JobClient:     Map output records=8  13/06/21 20:20:22 INFO mapred.JobClient:     Map output bytes=66  13/06/21 20:20:22 INFO mapred.JobClient:     Input split bytes=123  13/06/21 20:20:22 INFO mapred.JobClient:     Combine input records=8  13/06/21 20:20:22 INFO mapred.JobClient:     Combine output records=8  13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input groups=8  13/06/21 20:20:22 INFO mapred.JobClient:     Reduce shuffle bytes=109  13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input records=8  13/06/21 20:20:22 INFO mapred.JobClient:     Reduce output records=8  13/06/21 20:20:22 INFO mapred.JobClient:     Spilled Records=16  13/06/21 20:20:22 INFO mapred.JobClient:     CPU time spent (ms)=1880  13/06/21 20:20:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=469413888  13/06/21 20:20:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5744541696  13/06/21 20:20:22 INFO mapred.JobClient:     Total committed heap usage (bytes)=207687680  

And you can finally check the output:

$ hadoop fs -cat output/part-* | head  this 1  to  1  the  1  a  1  count 1  is  1  test 1  words 1  

Of course this is a silly example of map-reduce job and you will want to do much more critical tasks. In order to benchmark your cluster Hadoop comes with examples jar.

To benchmark your hadoop cluster you can use the TeraSort tools available in the hadoop distribution. Generate some 100 MB of input data with TeraGen (100 byte rows):

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teragen 1000000 output3

Sort it with TeraSort:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort output3 output4

And then validate the results with TeraValidate:

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teravalidate output4 outvalidate

Performance of map-reduce jobs run in Cloud based hadoop clusters will be highly dependent on the hadoop configuration, the template and the service offering being used and of course on the underlying hardware of the Cloud. Hadoop was not designed to run in the Cloud and therefore some assumptions were made that do not fit the Cloud model, see [6] for more information. Deploying Hadoop in the Cloud however is a viable solution for on-demand map-reduce applications. Development work is currently under way within the Google Summer of Code program to provide CloudStack with a compatible Amazon Elastic Map-Reduce (EMR) service. This service will be based on Whirr or a new Amazon CloudFormation compatible interface called StackMate [7].

[1] http://whirr.apache.org [2] http://jclouds.incubator.apache.org [3] http://www.apache.org/dyn/closer.cgi/whirr/ [4] http://exoscale.ch [5] http://whirr.apache.org/docs/0.8.2/configuration-guide.html [6] http://wiki.apache.org/hadoop/Virtual%20Hadoop [7] https://github.com/chiradeep/stackmate





No comments:

Post a Comment