|
|
(20 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
| = Big Data course on Beocat =
| | This course is now available here: http://people.beocat.ksu.edu/~dan/education/bigdata/ |
| | |
| The Pittsburgh Supercomputing Center hosts 2-day remote Big Data workshops
| |
| several times each year. The information provided here will allow individual
| |
| users to go through the videos at their own pace and perform the exercises
| |
| on our local Beocat supercomputer. Each exercise will have data and results
| |
| tailored to each individual to allow instructors to measure the progress of
| |
| students assigned to take this course interactively.
| |
| | |
| Use the Agenda website below to access the slides starting with the Welcome slides
| |
| that don't have an associated video. The '>' sign at the start of lines below
| |
| represents the command line prompt on Beocat, and '>>>' represents the prompt
| |
| you'll get when you start pyspark or python.
| |
| | |
| Agenda: https://www.psc.edu/hpc-workshop-series/big-data
| |
| | |
| Videos: https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E
| |
| | |
| == Welcome ==
| |
| | |
| ssh into Beocat from your computer and copy the workshop data to your
| |
| home directory.
| |
| | |
| > cp -rp ~daveturner/workshops/bigdata_workshop .
| |
| > cd bigdata_workshop
| |
| | |
| PDF versions of the slides are available for each section
| |
| as are directories containing the data for each set of exercises.
| |
| You can copy the PDF files to your local computer for viewing or click
| |
| on the web link for each section.
| |
| | |
| Follow along with the Welcome slides from the Agenda website link or PDF file
| |
| Big_Data_Welcome.pdf as you listen to the video. Much of this information is specific to
| |
| the Bridges supercomputer at PSC so just scan over these slides.
| |
| | |
| Welcome slides: https://www.psc.edu/images/xsedetraining/BigData/Big_Data_Welcome.pdf
| |
| | |
| == Intro to Big Data ==
| |
| | |
| Video: https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=1
| |
| Slides: https://www.psc.edu/images/xsedetraining/BigData/A_Brief_History_of_Big_Data.pdf
| |
| | |
| Watch the video 'Intro to Big Data - Big Data Video 1' and follow along with the slides
| |
| | |
| == Hadoop ==
| |
| | |
| Hadoop slides: https://www.psc.edu/images/xsedetraining/BigData/Hadoop2019.pdf
| |
| | |
| Watch the video 'Hadoop - Big Data Video 2' (<B>Hadoop2019.pdf</B>)
| |
| We do not have Hadoop on Beocat so the commands they cover will not work locally
| |
| | |
| == Intro to Spark and Spark sections combined ==
| |
| | |
| The link below shows how to load the Spark and Python modules on Beocat,
| |
| set up the Python virtual environment, and run Spark code interactively
| |
| or through the Slurm scheduler.
| |
| | |
| https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark
| |
| | |
| Watch the video 'Spark - Big Data Video 3' (slides are <B>Intro_To_Spark.pdf</B>)
| |
| | |
| Pause the video and do the exercises 1-5 around the 43 minute mark.
| |
| Try these yourself before they cover the answers.
| |
| You can do demos and exercises interactively by requesting an Elf core
| |
| or you can submit the job using a script
| |
| (see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).
| |
| | |
| Request 1 core on an Elf node for interactive use then load the modules
| |
| | |
| > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
| |
|
| |
| > module purge
| |
| > module load Spark
| |
| > module load Python
| |
|
| |
| > source ~/.virtualenvs/spark-test/bin/activate
| |
|
| |
| > pyspark
| |
| >>>
| |
| | |
| Email your solutions to exercises 1-5 to Dan along with a description of
| |
| how well you did on your own. Also include your solutions to
| |
| homework assignments 1-3 around the 103 minute mark if you want to impress him.
| |
| Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.
| |
| | |
| == Machine Learning: Recommender System for Spark ==
| |
| | |
| If you want to run demos and exercises interactively,
| |
| request 1 core on an Elf node for interactive use then load the modules
| |
| and activate your Python virtual environment.
| |
| | |
| > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
| |
|
| |
| > module purge
| |
| > module load Spark
| |
| > module load Python
| |
|
| |
| > source ~/.virtualenvs/spark-test/bin/activate
| |
| | |
| Watch the video 'Machine Learning Recommender System With Spark - Big Data Video 4'
| |
| (slides are <B>A_Recommender_System.pdf</B>)
| |
| Do the 3 exercises at 1:06 in the video and email Dan your answers and
| |
| a summary of how you did on your own.
| |
| | |
| Demos and exercises can be run on the node you're on using pyspark-submit
| |
| | |
| > pyspark-submit recommender.py
| |
| | |
| You can also start pyspark and use it interactively
| |
| | |
| > pyspark
| |
| >>>
| |
| | |
| The recommender.py script can be run using the job script sb.recommender
| |
| | |
| > sbatch sb.recommender
| |
| | |
| == Deep Learning with TensorFlow ==
| |
| | |
| Watch the video 'Tensorflow - Big Data Video 5' (slides are Deep_Learning.pdf)
| |
| PSC has a version of TensorFlow that works on GPUs. The version on
| |
| Beocat is newer, but works on the CPUs instead.
| |
| | |
| You can do the demos on Beocat if you want. There is a warning that the
| |
| mnist data will be deprecated in the future.
| |
| | |
| > module purge
| |
| > module load TensorFlow
| |
|
| |
| > source ~/.virtualenvs/spark-test/bin/activate
| |
|
| |
| > python
| |
| >>>
| |
| | |
| == Bridges ==
| |
| | |
| Watch the video 'A Big Data Platform - Big Data Video 6'
| |