From Beocat
Jump to: navigation, search
(HDFS WARNINGS Can I state this any other way?)
Line 6: Line 6:
 
To use Hadoop:
 
To use Hadoop:
 
* Login to Beocat
 
* Login to Beocat
* From there login to the Hadoop headnode, named 'theia'. <tt>ssh theia</tt>
+
* From there login to the Hadoop headnode, named 'gremlin00'. <tt>ssh gremlin00</tt>
* Copy files into or out of the Hadoop filesystem. Use <tt>hadoop fs put</tt> and <tt>hadoop fs get</tt> to copy files. Note:
+
* Copy files into or out of the Hadoop filesystem. Use <tt>hdfs fs put</tt> and <tt>hdfs fs get</tt> to copy files. Note:
 
*# the Hadoop filesystem is both smaller than the Beocat filesystem and is '''not backed up.'''  
 
*# the Hadoop filesystem is both smaller than the Beocat filesystem and is '''not backed up.'''  
 
*# Please copy data back out of Hadoop as soon as you are done using it.
 
*# Please copy data back out of Hadoop as soon as you are done using it.
Line 14: Line 14:
 
*# If you have data from your runs that you need to keep please copy it out of HDFS as soon as possible.
 
*# If you have data from your runs that you need to keep please copy it out of HDFS as soon as possible.
 
*# The HDFS volume may disappear at any point in time and when it comes back your data may not.
 
*# The HDFS volume may disappear at any point in time and when it comes back your data may not.
* Run your Hadoop job. <tt>hadoop -jar path/to/file.jar</tt>
+
* Run your Hadoop job. <tt>yarn -jar path/to/file.jar</tt>
  
 
=== (Some) Best Practices ===
 
=== (Some) Best Practices ===
 
* Block size is set to 64MB on the HDFS filesystem
 
* Block size is set to 64MB on the HDFS filesystem
** As such, please keep the files stored there at least that size. If you need smaller files, we recommend using [http://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html HAR files]
+
** As such, please keep the files stored there at least that size. If you need smaller files, we recommend using [https://hadoop.apache.org/docs/r2.6.4/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html HAR files]
 
* Multiple users running jobs at the same time can be problematic, as they can slow each other down. If you can, try to run jobs when the cluster isn't already running someone else's jobs.
 
* Multiple users running jobs at the same time can be problematic, as they can slow each other down. If you can, try to run jobs when the cluster isn't already running someone else's jobs.
** you can check the status of the hadoop cluster from any beocat host with <tt>elinks http://theia.beocat:50030</tt>
+
** you can check the status of the hadoop cluster from any beocat host with <tt>elinks http://gremlin00.beocat:8088</tt>

Revision as of 14:36, 24 August 2016

Hadoop

Hadoop is a "Big Data" distributed processing service. It is primarily used for very large data sets (greater than 1 TB).

Hadoop does not integrate well with SGE (or, for that matter, any other HPC scheduling system). So we have created our own separate Cloudera Hadoop cluster to accommodate the increased usage of Hadoop on campus.

To use Hadoop:

  • Login to Beocat
  • From there login to the Hadoop headnode, named 'gremlin00'. ssh gremlin00
  • Copy files into or out of the Hadoop filesystem. Use hdfs fs put and hdfs fs get to copy files. Note:
    1. the Hadoop filesystem is both smaller than the Beocat filesystem and is not backed up.
    2. Please copy data back out of Hadoop as soon as you are done using it.
    3. Data which remains untouched may be deleted with no prior notice.
    4. We make no claims of the viability of data within HDFS nor for how long the data will be available.
    5. If you have data from your runs that you need to keep please copy it out of HDFS as soon as possible.
    6. The HDFS volume may disappear at any point in time and when it comes back your data may not.
  • Run your Hadoop job. yarn -jar path/to/file.jar

(Some) Best Practices

  • Block size is set to 64MB on the HDFS filesystem
    • As such, please keep the files stored there at least that size. If you need smaller files, we recommend using HAR files
  • Multiple users running jobs at the same time can be problematic, as they can slow each other down. If you can, try to run jobs when the cluster isn't already running someone else's jobs.