|
|
(29 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
| = Big Data course on Beocat =
| | This course is now available here: http://people.beocat.ksu.edu/~dan/education/bigdata/ |
| | |
| The Pittsburgh Supercomputing Center hosts 2-day remote Big Data workshops
| |
| several times each year. The information provided here will allow individual
| |
| users to go through the videos at their own pace and perform the exercises
| |
| on our local Beocat supercomputer. Each exercise will have data and results
| |
| tailored to each individual to allow instructors to measure the progress of
| |
| students assigned to take this course interactively.
| |
| | |
| Use the Agenda website below to access the slides starting with the Welcome slides
| |
| that don't have an associated video. The '>' sign at the start of lines below
| |
| represents the command line prompt on Beocat, and '>>>' represents the prompt
| |
| you'll get when you start pyspark or python.
| |
| | |
| Agenda: https://www.psc.edu/hpc-workshop-series/big-data
| |
| | |
| Videos: https://www.youtube.com/watch?v=NpapUmGHXyw&list=PLdkRteUOw2X-YKqommnuGWqNfEEUG6P2E
| |
| | |
| == Welcome ==
| |
| | |
| ssh into Beocat from your computer and copy the workshop data to your
| |
| home directory.
| |
| | |
| > cp -rp ~daveturner/workshops/bigdata_workshop .
| |
| > cd bigdata_workshop
| |
| | |
| PDF versions of the slides are available for each section
| |
| as are directories containing the data for each set of exercises.
| |
| You'll need to copy the PDF files to your local computer for viewing.
| |
| | |
| Go through the Welcome slides from the Agenda website link or PDF file
| |
| Big_Data_Welcome.pdf. Much of this information is specific to
| |
| the Bridges supercomputer at PSC so just scan over these slides.
| |
| | |
| == Intro to Big Data ==
| |
| | |
| [web link is bad]
| |
| Watch the video 'Intro to Big Data - Big Data Video 1'
| |
| (slides are A_Brief_History_of_Big_Data.pdf)
| |
| | |
| == Hadoop ==
| |
| | |
| Watch the video 'Hadoop - Big Data Video 2' (slides are <B>Hadoop2019.pdf</B>)
| |
| We do not have Hadoop on Beocat so the commands they cover will not work locally
| |
| | |
| == Intro to Spark and Spark sections combined ==
| |
| | |
| The link below shows how to load the Spark and Python modules on Beocat,
| |
| set up the Python virtual environment, and run Spark code interactively
| |
| or through the Slurm scheduler.
| |
| | |
| https://support.beocat.ksu.edu/BeocatDocs/index.php/Installed_software#Spark
| |
| | |
| Watch the video 'Spark - Big Data Video 3' (slides are <B>Intro_To_Spark.pdf</B>)
| |
| | |
| Pause the video and do the exercises 1-5 around the 43 minute mark.
| |
| Try these yourself before they cover the answers.
| |
| You can do demos and exercises interactively by requesting an Elf core
| |
| or you can submit the job using a script
| |
| (see ~/bigdata_workshop/Shakespeare/sb.shakespeare as an example).
| |
| | |
| Request 1 core on an Elf node for interactive use then load the modules
| |
| | |
| > srun -J srun -N 1 -n 1 -t 24:00:00 --mem=10G -C elves --pty bash
| |
|
| |
| > module purge
| |
| > module load Spark
| |
| > module load Python
| |
|
| |
| > source ~/.virtualenvs/spark-test/bin/activate
| |
|
| |
| > pyspark
| |
| >>>
| |
| | |
| Email your solutions to exercises 1-5 to Dan along with a description of
| |
| how well you did on your own. Also include your solutions to
| |
| homework assignments 1-3 around the 103 minute mark if you want to impress him.
| |
| Dave's answers are in ~/bigdata_workshop/Shakespeare/shakespeare.py.
| |