|
|
(7 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
| == Hadoop == | | == Hadoop == |
| Hadoop does not integrate well with SGE (or, for that matter, any other HPC scheduling system). So we have created our own separate Cloudera Hadoop cluster to accommodate the increased usage of Hadoop on campus. | | [http://hadoop.apache.org/ Hadoop] is a "Big Data" distributed processing service. It is primarily used for very large data sets (greater than 1 TB). |
|
| |
|
| To use Hadoop:
| | Our previous Hadoop cluster has died, and we are evaluating other options. One such option is [https://github.com/LLNL/magpie/tree/master/doc magpie]. We will let the list know if and when we have something functional. |
| * Login to Beocat
| |
| * From there login to the Hadoop headnode, named 'theia'. <tt>ssh theia</tt>
| |
| * Copy files into or out of the Hadoop filesystem. Use <tt>hadoop fs put</tt> and <tt>hadoop fs get</tt> to copy files. Note that the Hadoop filesystem is both smaller than the Beocat filesystem and is not backed up. Please copy data back out of Hadoop as soon as you are done using it. '''Data which remains untouched may be deleted with no prior notice.'''
| |
| * Run your Hadoop job. <tt>hadoop -jar path/to/file.jar</tt>
| |
Latest revision as of 14:50, 1 May 2018
Hadoop
Hadoop is a "Big Data" distributed processing service. It is primarily used for very large data sets (greater than 1 TB).
Our previous Hadoop cluster has died, and we are evaluating other options. One such option is magpie. We will let the list know if and when we have something functional.