|   |     | 
| Line 1: | Line 1: | 
|  | = Big Data courseon Beocat =
 |  | This course is now available here: http://people.beocat.ksu.edu/~dan/education/bigdata/ | 
|  |   |  | 
|  | The Pittsburgh Supercomputing Center hosts 2-day remote Big Data workshops
 |  | 
|  | several times each year.  The information provided herewill allow individual
 |  | 
|  | users to go through the videos at their own pace and perform the exercises
 |  | 
|  | on our local Beocat supercomputer.  Each exercise will have data and results
 |  | 
|  | tailored to each individual to allow instructors to measure the progress of
 |  | 
|  | students assigned to take this course interactively.
 |  | 
|  |   |  | 
|  | Use the Agenda website below to access the slides starting with the Welcome slides
 |  | 
|  | that don't have an associated video.  The '>' sign at the start of lines below
 |  | 
|  | represents the command line prompt on Beocat, and '>>>' represents the prompt
 |  | 
|  | you'll get when you start pyspark or python.
 |  | 
|  |   |  | 
|  | Agenda: https://www.psc.edu/current-workshop
 |  | 
|  | * [[File: KSU-BigData-ch0-agenda.pdf]]
 |  | 
|  |   |  | 
|  | Videos:  https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E
 |  | 
|  |   |  | 
|  | == Welcome ==
 |  | 
|  |   |  | 
|  | ssh into Beocat from your computer and copy the workshop data to your
 |  | 
|  | home directory.
 |  | 
|  |   |  | 
|  |   > cp -rp ~daveturner/workshops/bigdata_workshop .
 |  | 
|  |   > cd bigdata_workshop
 |  | 
|  |   |  | 
|  | PDF versions of the slides are available for each section
 |  | 
|  | as are directories containing the data for each set of exercises.
 |  | 
|  | You can copy the PDF files to your local computer for viewing or click
 |  | 
|  | on the web link for each section.
 |  | 
|  |   |  | 
|  | Follow along with the Welcome slides from the Agenda website link or PDF file
 |  | 
|  | Big_Data_Welcome.pdf as you listen to the video.  Much of this information is specific to
 |  | 
|  | the Bridges supercomputer at PSC so just scan over these slides.
 |  | 
|  |   |  | 
|  | Welcome slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/Big_Data_Welcome.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | == Intro to Big Data ==
 |  | 
|  |   |  | 
|  | Video:  https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=1
 |  | 
|  |   |  | 
|  | Slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/A_Brief_History_of_Big_Data.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | Watch the video 'Intro to Big Data - Big Data Video 1' and follow along with the slides
 |  | 
|  |   |  | 
|  | == Hadoop ==
 |  | 
|  |   |  | 
|  | Video:  https://www.youtube.com/watch?v=WpxBFQr-ccw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=2
 |  | 
|  |   |  | 
|  | Slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/Hadoop2020.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | Watch the video 'Hadoop - Big Data Video 2'
 |  | 
|  | We do not have Hadoop on Beocat so the commands they cover will not work locally
 |  | 
|  |   |  | 
|  | == Intro to Spark and Spark sections combined ==
 |  | 
|  |   |  | 
|  | Video:  https://www.youtube.com/watch?v=iONkwqP2fEk&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=3
 |  | 
|  |   |  | 
|  | Slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/Intro_To_Spark.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | Watch the video 'Spark - Big Data Video 3'
 |  | 
|  |   |  | 
|  | The link below shows how to load the Spark and Python modules on Beocat,
 |  | 
|  | set up the Python virtual environment, and run Spark code interactively
 |  | 
|  | or through the Slurm scheduler.
 |  | 
|  |   |  | 
|  | https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark
 |  | 
|  |   |  | 
|  | Pause the video and do the exercises 1-5 around the 43 minute mark.
 |  | 
|  | Try these yourself before they cover the answers.
 |  | 
|  | You can do demos and exercises interactively by requesting an Elf core
 |  | 
|  | or you can submit the job using a script
 |  | 
|  | (see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).
 |  | 
|  |   |  | 
|  | Request 1 core on an Elf node for interactive use then load the modules
 |  | 
|  |   |  | 
|  |   > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 |  | 
|  |   
 |  | 
|  |   > module purge
 |  | 
|  |   > module load Spark
 |  | 
|  |   > module load Python
 |  | 
|  |   
 |  | 
|  |   > source ~/.virtualenvs/spark-test/bin/activate
 |  | 
|  |   
 |  | 
|  |   > pyspark
 |  | 
|  |   >>>
 |  | 
|  |   |  | 
|  | Email your solutions to exercises 1-5 to Dan along with a description of
 |  | 
|  | how well you did on your own.  Also include your solutions to
 |  | 
|  | homework assignments 1-3 around the 103 minute mark if you want to impress him.
 |  | 
|  | Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.
 |  | 
|  |   |  | 
|  | == Machine Learning: Recommender System for Spark ==
 |  | 
|  |   |  | 
|  | Video: https://www.youtube.com/watch?v=2rvW13YSNmM&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=4
 |  | 
|  |   |  | 
|  | Slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/A_Recommender_System.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | Watch the video 'Machine Learning Recommender System With Spark - Big Data Video 4'
 |  | 
|  |   |  | 
|  | If you want to run demos and exercises interactively,
 |  | 
|  | request 1 core on an Elf node for interactive use then load the modules
 |  | 
|  | and activate your Python virtual environment.
 |  | 
|  |   |  | 
|  |   > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 |  | 
|  |   
 |  | 
|  |   > module purge
 |  | 
|  |   > module load Spark
 |  | 
|  |   > module load Python
 |  | 
|  |   
 |  | 
|  |   > source ~/.virtualenvs/spark-test/bin/activate
 |  | 
|  |   |  | 
|  | Do the 3 exercises at 1:06 in the video and email Dan your answers and
 |  | 
|  | a summary of how you did on your own.
 |  | 
|  |   |  | 
|  | Demos and exercises can be run on the node you're on using pyspark-submit
 |  | 
|  |   |  | 
|  |   > pyspark-submit recommender.py
 |  | 
|  |   |  | 
|  | You can also start pyspark and use it interactively
 |  | 
|  |   |  | 
|  |   > pyspark
 |  | 
|  |   >>>
 |  | 
|  |   |  | 
|  | The recommender.py script can be run using the job script sb.recommender
 |  | 
|  |   |  | 
|  |   > sbatch sb.recommender
 |  | 
|  |   |  | 
|  | == Deep Learning with TensorFlow ==
 |  | 
|  |   |  | 
|  | Video: https://www.youtube.com/watch?v=bC1mzhoabRE&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=5
 |  | 
|  |   |  | 
|  | Slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/Deep_Learning.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | Watch the video 'Tensorflow - Big Data Video 5'
 |  | 
|  | Beocat has versions of TensorFlow that work on both CPUs and GPUs.  
 |  | 
|  | Use <B>module spider TensorFlow</B> to see a list of available versions.
 |  | 
|  |   |  | 
|  | You can do the demos on Beocat if you want.  There is a warning that the
 |  | 
|  | mnist data will be deprecated in the future.
 |  | 
|  |   |  | 
|  |   > module purge
 |  | 
|  |   > module load TensorFlow
 |  | 
|  |   
 |  | 
|  |   > source ~/.virtualenvs/spark-test/bin/activate
 |  | 
|  |   
 |  | 
|  |   > python
 |  | 
|  |   >>>
 |  | 
|  |   |  | 
|  | == Bridges ==
 |  | 
|  |   |  | 
|  | Video: https://www.youtube.com/watch?v=gfqlNW-zILo&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=6
 |  | 
|  |   |  | 
|  | Slides:
 |  | 
|  |   |  | 
|  | {{#widget:PDF
 |  | 
|  |  |url=https://www.psc.edu/images/xsedetraining/BigData/Goodbye.pdf
 |  | 
|  |  |width=750
 |  | 
|  |  |height=600
 |  | 
|  | }}
 |  | 
|  |   |  | 
|  | Watch the video 'A Big Data Platform - Big Data Video 6'.  This will provide an overview of
 |  | 
|  | the Bridges system at Pittsburgh Supercomputing Center.
 |  |