third_party/hadoop-0.20.0/contrib/hod/getting_started.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233

            Getting Started With Hadoop On Demand (HOD)
            ===========================================

1. Pre-requisites:
==================

Hardware:
HOD requires a minimum of 3 nodes configured through a resource manager.

Software:
The following components are assumed to be installed before using HOD:
* Torque:
  (http://www.clusterresources.com/pages/products/torque-resource-manager.php)
  Currently HOD supports Torque out of the box. We assume that you are
  familiar with configuring Torque. You can get information about this
  from the following link: 
  http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki
* Python (http://www.python.org/)
  We require version 2.5.1 of Python.
    
The following components can be optionally installed for getting better
functionality from HOD:
* Twisted Python: This can be used for improving the scalability of HOD
  (http://twistedmatrix.com/trac/)
* Hadoop: HOD can automatically distribute Hadoop to all nodes in the 
  cluster. However, it can also use a pre-installed version of Hadoop,
  if it is available on all nodes in the cluster.
  (http://hadoop.apache.org/core)
  HOD currently supports Hadoop 0.15 and above.

NOTE: HOD configuration requires the location of installs of these 
components to be the same on all nodes in the cluster. It will also 
make the configuration simpler to have the same location on the submit
nodes.

2. Resource Manager Configuration Pre-requisites:
=================================================

For using HOD with Torque:
* Install Torque components: pbs_server on a head node, pbs_moms on all
  compute nodes, and PBS client tools on all compute nodes and submit 
  nodes.
* Create a queue for submitting jobs on the pbs_server.
* Specify a name for all nodes in the cluster, by setting a 'node 
  property' to all the nodes.
  This can be done by using the 'qmgr' command. For example:
  qmgr -c "set node node properties=cluster-name"
* Ensure that jobs can be submitted to the nodes. This can be done by
  using the 'qsub' command. For example:
  echo "sleep 30" | qsub -l nodes=3
* More information about setting up Torque can be found by referring
  to the documentation under:
http://www.clusterresources.com/pages/products/torque-resource-manager.php

3. Setting up HOD:
==================

* HOD is available under the 'contrib' section of Hadoop under the root
  directory 'hod'.
* Distribute the files under this directory to all the nodes in the
  cluster. Note that the location where the files are copied should be
  the same on all the nodes.
* On the node from where you want to run hod, edit the file hodrc 
  which can be found in the <install dir>/conf directory. This file
  contains the minimal set of values required for running hod.
* Specify values suitable to your environment for the following 
  variables defined in the configuration file. Note that some of these
  variables are defined at more than one place in the file.

  * ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK
    1.5.x
  * ${CLUSTER_NAME}: Name of the cluster which is specified in the 
    'node property' as mentioned in resource manager configuration.
  * ${HADOOP_HOME}: Location of Hadoop installation on the compute and
    submit nodes.
  * ${RM_QUEUE}: Queue configured for submiting jobs in the resource
    manager configuration.
  * ${RM_HOME}: Location of the resource manager installation on the
    compute and submit nodes.

* The following environment variables *may* need to be set depending on 
  your environment. These variables must be defined where you run the 
  HOD client, and also be specified in the HOD configuration file as the 
  value of the key resource_manager.env-vars. Multiple variables can be
  specified as a comma separated list of key=value pairs.

  * HOD_PYTHON_HOME: If you install python to a non-default location 
    of the compute nodes, or submit nodes, then, this variable must be 
    defined to point to the python executable in the non-standard 
    location.


NOTE: 

You can also review other configuration options in the file and
modify them to suit your needs. Refer to the file config.txt for 
information about the HOD configuration.


4. Running HOD:
===============

4.1 Overview:
-------------

A typical session of HOD will involve atleast three steps: allocate, 
run hadoop jobs, deallocate.

4.1.1 Operation allocate
------------------------

The allocate operation is used to allocate a set of nodes and install and
provision Hadoop on them. It has the following syntax:

  hod -c config_file -t hadoop_tarball_location -o "allocate \
                                                cluster_dir number_of_nodes"

The hadoop_tarball_location must be a location on a shared file system
accesible from all nodes in the cluster. Note, the cluster_dir must exist
before running the command. If the command completes successfully then
cluster_dir/hadoop-site.xml will be generated and will contain information
about the allocated cluster's JobTracker and NameNode.

For example, the following command uses a hodrc file in ~/hod-config/hodrc and
allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes,
storing the generated Hadoop configuration in a directory named
~/hadoop-cluster:

  $ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate \
                                                        ~/hadoop-cluster 10"

HOD also supports an environment variable called HOD_CONF_DIR. If this is
defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc.
Defining this allows the above command to also be run as follows:

  $ export HOD_CONF_DIR=~/hod-config
  $ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10" 

4.1.2 Running Hadoop jobs using the allocated cluster
-----------------------------------------------------

Now, one can run Hadoop jobs using the allocated cluster in the usual manner:

  hadoop --config cluster_dir hadoop_command hadoop_command_args

Continuing our example, the following command will run a wordcount example on
the allocated cluster:

  $ hadoop --config ~/hadoop-cluster jar \
       /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output 

4.1.3 Operation deallocate
--------------------------

The deallocate operation is used to release an allocated cluster. When
finished with a cluster, deallocate must be run so that the nodes become free
for others to use. The deallocate operation has the following syntax:

  hod -o "deallocate cluster_dir"

Continuing our example, the following command will deallocate the cluster:

  $ hod -o "deallocate ~/hadoop-cluster" 

4.2 Command Line Options
------------------------

This section covers the major command line options available via the hod
command:

--help
Prints out the help message to see the basic options.

--verbose-help
All configuration options provided in the hodrc file can be passed on the
command line, using the syntax --section_name.option_name[=value]. When
provided this way, the value provided on command line overrides the option
provided in hodrc. The verbose-help command lists all the available options in
the hodrc file. This is also a nice way to see the meaning of the
configuration options.

-c config_file
Provides the configuration file to use. Can be used with all other options of
HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to
specify a directory that contains a file named hodrc, alleviating the need to
specify the configuration file in each HOD command.

-b 1|2|3|4
Enables the given debug level. Can be used with all other options of HOD. 4 is
most verbose.

-o "help"
Lists the operations available in the operation mode.

-o "allocate cluster_dir number_of_nodes"
Allocates a cluster on the given number of cluster nodes, and store the
allocation information in cluster_dir for use with subsequent hadoop commands.
Note that the cluster_dir must exist before running the command.

-o "list"
Lists the clusters allocated by this user. Information provided includes the
Torque job id corresponding to the cluster, the cluster directory where the
allocation information is stored, and whether the Map/Reduce daemon is still
active or not.

-o "info cluster_dir"
Lists information about the cluster whose allocation information is stored in
the specified cluster directory.

-o "deallocate cluster_dir"
Deallocates the cluster whose allocation information is stored in the
specified cluster directory.

-t hadoop_tarball
Provisions Hadoop from the given tar.gz file. This option is only applicable
to the allocate operation. For better distribution performance it is
recommended that the Hadoop tarball contain only the libraries and binaries,
and not the source or documentation. 

-Mkey1=value1 -Mkey2=value2
Provides configuration parameters for the provisioned Map/Reduce daemons
(JobTracker and TaskTrackers). A hadoop-site.xml is generated with these
values on the cluster nodes

-Hkey1=value1 -Hkey2=value2
Provides configuration parameters for the provisioned HDFS daemons (NameNode
and DataNodes). A hadoop-site.xml is generated with these values on the
cluster nodes

-Ckey1=value1 -Ckey2=value2
Provides configuration parameters for the client from where jobs can be
submitted. A hadoop-site.xml is generated with these values on the submit
node.