From Beocat
Jump to: navigation, search
(Created page with "== Big Data course on Beocat ==")
 
(36 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Big Data course on Beocat ==
= Big Data course on Beocat =
 
The Pittsburgh Supercomputing Center hosts 2-day remote Big Data workshops
several times each year.  The information provided here will allow individual
users to go through the videos at their own pace and perform the exercises
on our local Beocat supercomputer.  Each exercise will have data and results
tailored to each individual to allow instructors to measure the progress of
students assigned to take this course interactively.
 
Use the Agenda website below to access the slides starting with the Welcome slides
that don't have an associated video.  The '>' sign at the start of lines below
represents the command line prompt on Beocat, and '>>>' represents the prompt
you'll get when you start pyspark or python.
 
Agenda:  https://www.psc.edu/current-workshop
* [[File: KSU-BigData-ch0-agenda.pdf]]
 
Videos:  https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E
 
== Welcome ==
 
ssh into Beocat from your computer and copy the workshop data to your
home directory.
 
  > cp -rp ~daveturner/workshops/bigdata_workshop .
  > cd bigdata_workshop
 
PDF versions of the slides are available for each section
as are directories containing the data for each set of exercises.
You can copy the PDF files to your local computer for viewing or click
on the web link for each section.
 
Follow along with the Welcome slides from the Agenda website link or PDF file
Big_Data_Welcome.pdf as you listen to the video.  Much of this information is specific to
the Bridges supercomputer at PSC so just scan over these slides.
 
Welcome slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/Big_Data_Welcome.pdf
|width=750
|height=600
}}
 
== Intro to Big Data ==
 
Video:  https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=1
 
Slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/A_Brief_History_of_Big_Data.pdf
|width=750
|height=600
}}
 
Watch the video 'Intro to Big Data - Big Data Video 1' and follow along with the slides
 
== Hadoop ==
 
Video:  https://www.youtube.com/watch?v=WpxBFQr-ccw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=2
 
Slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/Hadoop2020.pdf
|width=750
|height=600
}}
 
Watch the video 'Hadoop - Big Data Video 2'
We do not have Hadoop on Beocat so the commands they cover will not work locally
 
== Intro to Spark and Spark sections combined ==
 
Video:  https://www.youtube.com/watch?v=iONkwqP2fEk&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=3
 
Slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/Intro_To_Spark.pdf
|width=750
|height=600
}}
 
Watch the video 'Spark - Big Data Video 3'
 
The link below shows how to load the Spark and Python modules on Beocat,
set up the Python virtual environment, and run Spark code interactively
or through the Slurm scheduler.
 
https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark
 
Pause the video and do the exercises 1-5 around the 43 minute mark.
Try these yourself before they cover the answers.
You can do demos and exercises interactively by requesting an Elf core
or you can submit the job using a script
(see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).
 
Request 1 core on an Elf node for interactive use then load the modules
 
  > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 
  > module purge
  > module load Spark
  > module load Python
 
  > source ~/.virtualenvs/spark-test/bin/activate
 
  > pyspark
  >>>
 
Email your solutions to exercises 1-5 to Dan along with a description of
how well you did on your own.  Also include your solutions to
homework assignments 1-3 around the 103 minute mark if you want to impress him.
Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.
 
== Machine Learning: Recommender System for Spark ==
 
Video: https://www.youtube.com/watch?v=2rvW13YSNmM&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=4
 
Slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/A_Recommender_System.pdf
|width=750
|height=600
}}
 
Watch the video 'Machine Learning Recommender System With Spark - Big Data Video 4'
 
If you want to run demos and exercises interactively,
request 1 core on an Elf node for interactive use then load the modules
and activate your Python virtual environment.
 
  > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 
  > module purge
  > module load Spark
  > module load Python
 
  > source ~/.virtualenvs/spark-test/bin/activate
 
Do the 3 exercises at 1:06 in the video and email Dan your answers and
a summary of how you did on your own.
 
Demos and exercises can be run on the node you're on using pyspark-submit
 
  > pyspark-submit recommender.py
 
You can also start pyspark and use it interactively
 
  > pyspark
  >>>
 
The recommender.py script can be run using the job script sb.recommender
 
  > sbatch sb.recommender
 
== Deep Learning with TensorFlow ==
 
Video: https://www.youtube.com/watch?v=bC1mzhoabRE&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=5
 
Slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/Deep_Learning.pdf
|width=750
|height=600
}}
 
Watch the video 'Tensorflow - Big Data Video 5'
Beocat has versions of TensorFlow that work on both CPUs and GPUs. 
Use <B>module spider TensorFlow</B> to see a list of available versions.
 
You can do the demos on Beocat if you want.  There is a warning that the
mnist data will be deprecated in the future.
 
  > module purge
  > module load TensorFlow
 
  > source ~/.virtualenvs/spark-test/bin/activate
 
  > python
  >>>
 
== Bridges ==
 
Video: https://www.youtube.com/watch?v=gfqlNW-zILo&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=6
 
Slides:
 
{{#widget:PDF
|url=https://www.psc.edu/images/xsedetraining/BigData/Goodbye.pdf
|width=750
|height=600
}}
 
Watch the video 'A Big Data Platform - Big Data Video 6'.  This will provide an overview of
the Bridges system at Pittsburgh Supercomputing Center.

Revision as of 14:42, 13 April 2020

Big Data course on Beocat

The Pittsburgh Supercomputing Center hosts 2-day remote Big Data workshops several times each year. The information provided here will allow individual users to go through the videos at their own pace and perform the exercises on our local Beocat supercomputer. Each exercise will have data and results tailored to each individual to allow instructors to measure the progress of students assigned to take this course interactively.

Use the Agenda website below to access the slides starting with the Welcome slides that don't have an associated video. The '>' sign at the start of lines below represents the command line prompt on Beocat, and '>>>' represents the prompt you'll get when you start pyspark or python.

Agenda: https://www.psc.edu/current-workshop

  • KSU-BigData-ch0-agenda.pdf

Videos: https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E

Welcome

ssh into Beocat from your computer and copy the workshop data to your home directory.

 > cp -rp ~daveturner/workshops/bigdata_workshop .
 > cd bigdata_workshop

PDF versions of the slides are available for each section as are directories containing the data for each set of exercises. You can copy the PDF files to your local computer for viewing or click on the web link for each section.

Follow along with the Welcome slides from the Agenda website link or PDF file Big_Data_Welcome.pdf as you listen to the video. Much of this information is specific to the Bridges supercomputer at PSC so just scan over these slides.

Welcome slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Intro to Big Data

Video: https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=1

Slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Watch the video 'Intro to Big Data - Big Data Video 1' and follow along with the slides

Hadoop

Video: https://www.youtube.com/watch?v=WpxBFQr-ccw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=2

Slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Watch the video 'Hadoop - Big Data Video 2' We do not have Hadoop on Beocat so the commands they cover will not work locally

Intro to Spark and Spark sections combined

Video: https://www.youtube.com/watch?v=iONkwqP2fEk&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=3

Slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Watch the video 'Spark - Big Data Video 3'

The link below shows how to load the Spark and Python modules on Beocat, set up the Python virtual environment, and run Spark code interactively or through the Slurm scheduler.

https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark

Pause the video and do the exercises 1-5 around the 43 minute mark. Try these yourself before they cover the answers. You can do demos and exercises interactively by requesting an Elf core or you can submit the job using a script (see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).

Request 1 core on an Elf node for interactive use then load the modules

 > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 
 > module purge
 > module load Spark
 > module load Python
 
 > source ~/.virtualenvs/spark-test/bin/activate
 
 > pyspark
 >>>

Email your solutions to exercises 1-5 to Dan along with a description of how well you did on your own. Also include your solutions to homework assignments 1-3 around the 103 minute mark if you want to impress him. Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.

Machine Learning: Recommender System for Spark

Video: https://www.youtube.com/watch?v=2rvW13YSNmM&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=4

Slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Watch the video 'Machine Learning Recommender System With Spark - Big Data Video 4'

If you want to run demos and exercises interactively, request 1 core on an Elf node for interactive use then load the modules and activate your Python virtual environment.

 > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
 
 > module purge
 > module load Spark
 > module load Python
 
 > source ~/.virtualenvs/spark-test/bin/activate

Do the 3 exercises at 1:06 in the video and email Dan your answers and a summary of how you did on your own.

Demos and exercises can be run on the node you're on using pyspark-submit

 > pyspark-submit recommender.py

You can also start pyspark and use it interactively

 > pyspark
 >>>

The recommender.py script can be run using the job script sb.recommender

 > sbatch sb.recommender

Deep Learning with TensorFlow

Video: https://www.youtube.com/watch?v=bC1mzhoabRE&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=5

Slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Watch the video 'Tensorflow - Big Data Video 5' Beocat has versions of TensorFlow that work on both CPUs and GPUs. Use module spider TensorFlow to see a list of available versions.

You can do the demos on Beocat if you want. There is a warning that the mnist data will be deprecated in the future.

 > module purge
 > module load TensorFlow
 
 > source ~/.virtualenvs/spark-test/bin/activate
 
 > python
 >>>

Bridges

Video: https://www.youtube.com/watch?v=gfqlNW-zILo&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E&index=6

Slides:

Currently your browser does not use a PDF plugin. You may however download the PDF file instead.

Watch the video 'A Big Data Platform - Big Data Video 6'. This will provide an overview of the Bridges system at Pittsburgh Supercomputing Center.