From Beocat
Jump to: navigation, search
Line 43: Line 43:
Watch the video 'Hadoop - Big Data Video 2'  (slides are <B>Hadoop2019.pdf</B>)
Watch the video 'Hadoop - Big Data Video 2'  (slides are <B>Hadoop2019.pdf</B>)
We do not have Hadoop on Beocat so the commands they cover will not work locally
We do not have Hadoop on Beocat so the commands they cover will not work locally
== Intro to Spark and Spark sections combined ==
The link below shows how to load the Spark and Python modules on Beocat,
set up the Python virtual environment, and run Spark code interactively
or through the Slurm scheduler.
https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark
Watch the video 'Spark - Big Data Video 3'  (slides are <B>Intro_To_Spark.pdf</B>)
Pause the video and do the exercises 1-5 around the 43 minute mark.
Try these yourself before they cover the answers.
You can do demos and exercises interactively by requesting an Elf core
or you can submit the job using a script
(see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).
Request 1 core on an Elf node for interactive use then load the modules
  > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 
  > module purge
  > module load Spark
  > module load Python
 
  > source ~/.virtualenvs/spark-test/bin/activate
 
  > pyspark
  >>>
Email your solutions to exercises 1-5 to Dan along with a description of
how well you did on your own.  Also include your solutions to
homework assignments 1-3 around the 103 minute mark if you want to impress him.
Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.

Revision as of 15:44, 24 March 2020

Big Data course on Beocat

The Pittsburgh Supercomputing Center hosts 2-day remote Big Data workshops several times each year. The information provided here will allow individual users to go through the videos at their own pace and perform the exercises on our local Beocat supercomputer. Each exercise will have data and results tailored to each individual to allow instructors to measure the progress of students assigned to take this course interactively.

Use the Agenda website below to access the slides starting with the Welcome slides that don't have an associated video. The '>' sign at the start of lines below represents the command line prompt on Beocat, and '>>>' represents the prompt you'll get when you start pyspark or python.

Agenda: https://www.psc.edu/hpc-workshop-series/big-data

Videos: https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E

Welcome

ssh into Beocat from your computer and copy the workshop data to your home directory.

 > cp -rp ~daveturner/workshops/bigdata_workshop .
 > cd bigdata_workshop

PDF versions of the slides are available for each section as are directories containing the data for each set of exercises. You'll need to copy the PDF files to your local computer for viewing.

Go through the Welcome slides from the Agenda website link or PDF file Big_Data_Welcome.pdf. Much of this information is specific to the Bridges supercomputer at PSC so just scan over these slides.

Intro to Big Data

[web link is bad] Watch the video 'Intro to Big Data - Big Data Video 1' (slides are A_Brief_History_of_Big_Data.pdf)

Hadoop

Watch the video 'Hadoop - Big Data Video 2' (slides are Hadoop2019.pdf) We do not have Hadoop on Beocat so the commands they cover will not work locally

Intro to Spark and Spark sections combined

The link below shows how to load the Spark and Python modules on Beocat, set up the Python virtual environment, and run Spark code interactively or through the Slurm scheduler.

https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark

Watch the video 'Spark - Big Data Video 3' (slides are Intro_To_Spark.pdf)

Pause the video and do the exercises 1-5 around the 43 minute mark. Try these yourself before they cover the answers. You can do demos and exercises interactively by requesting an Elf core or you can submit the job using a script (see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).

Request 1 core on an Elf node for interactive use then load the modules

 > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 
 > module purge
 > module load Spark
 > module load Python
 
 > source ~/.virtualenvs/spark-test/bin/activate
 
 > pyspark
 >>>

Email your solutions to exercises 1-5 to Dan along with a description of how well you did on your own. Also include your solutions to homework assignments 1-3 around the 103 minute mark if you want to impress him. Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.