Introduction to Spark 3 with Python: Lab Setup Instructions (Linux: Your Environment)
Below are the standard requirements for this course. If you have any questions or issues, please contact us.
Important Note: Student lab files are required on each computer used for the course. The links for these are not in this lab setup, and you should receive them separately.
Other notes:
- It’s a good idea to keep downloaded software install files on the machines during the class in case of problems that require a re-install.
- Cloning a setup is generally not a problem. If it is, we’ll mention it in the software section (for example, much of the IBM/RAD-WAS software can be problematic in this regard).
Linux: Hardware and classroom setup
Each student and the instructor shall have a work environment that fulfills the listed requirements.
- RAM: 8GB recommended
- Disk Space: Free disk space for software installs (5GB is sufficientl)
- Operating System: Linux
- We assume that you know how to set up and administer your Linux system.
- We can briefly review a setup once it's done, but we do not have the resources to set up nor troubleshoot your Linux installations.
- Note that the setup is relatively standard, with standard software packages.
- Any relatively recent Linux system you are comfortable with should work
- It must have the required software and equivalent environment setup.
- Again, we do not have the resources to provide setup or troubleshooting support for other Linux variants. We'll do our best to help if you have questions/problems, but may not have the expertise.
- When installing, consider the following choices.
- Root Password:
- You can use any password you like, as long as whoever needs it (e.g. the instructor or system manager) knows what it is.
- For example, set to password123 if you need something easily accessible
- User Creation:
- Make sure to create a student user that is easily used - tailor this to your environment
- e.g. Create a user student with password of password123
- You can use a different user/password as long as students/instructor are aware of what it is, and can use it where needed.
- Note: For specific environments (e.g. running as a virtual machine under another environment) you may need to do specific setup.
- We assume that you know what you need to do for this, and can't support these many possible environments.
- Recommended: Internet access
- It's best to provide internet access to the student machines.
- If this is not feasible for your environment, please contact us to ensure that everything works.
- Required: Adobe Acrobat Reader
- Required: One of either Firefox browser (https://www.mozilla.org/en-US/firefox/new/) or Chrome browser (https://www.google.com/chrome/).
- Required: An editor for editing lab files (e.g. Java files, or maven POM files).
- If NOT using an IDE, then this should be as capable as possible for your environment
- For example vim is a more capable editor than nano or vi
- If using an IDE (e.g. Eclipse or IntelliJ) which one you use is not important, as long as it's easily available.
Lab Files: Each student and instructor must have lab files installed (links to these files are generally sent separately via e-mail).
- Extract the lab files to a location conveniently accessible to the student (generally the student’s home directory - e.g. /home/student)
- Make sure that students/instructor know where they are and can freely access them.
Other instructor requirements for the classroom
- Capability to display presentation slides or code examples.
- For virtual environments: Generally some type of screen sharing capability.
- For physical in-person classes:
- Projector or large screen TV capable of 1280x800 or higher resolution. Instructor must be able to use this to project slides.
- Whiteboard (preferred) or flip charts with markers.
Install Java Development Kit – JDK 11 (11.0.x)
- Note that any JDK 11 version should work fine. Other close (later) Java versions may work, but have not been tested. Please contact us if you have an issue with using Java 11.
- Removing existing Java and installing Java 11:
- Many recent versions of Linux come pre-installed with Java 11. If you already have Java 11 installed, you can skip this step and can go on to "Find Java install location"
Check if you have Java 11 by opening a terminal window, and typing the following. If you see some variation of the output that indicates you have Java 11 installed, then skip this step.
$ java -version
openjdk version "11.0.13" 2021-10-19 LTS
Otherwise, you should un-install the existing Java install, and install the latest version of Java 11, which we did for our Linux version as follows.
$ sudo yum -y remove java*
$ sudo yum -y install java-11-openjdk-devel
$ sudo alternatives --config java #(select the Java 11 option, usually option '2', then hit enter to save)
$ sudo alternatives --config javac #(select the Java 11 option, usually option '2', then hit enter to save)
- Continue here whether or not you had to install Java 11.
- Find Java Install location.
- Can be found as follows (with sample output from our system)
$ readlink -f $(which java)
/usr/lib/jvm/java-11-openjdk-11.0.13.0.8-1.el8_4.x86_64/bin/java
- On our installation, it was under /usr/lib/jvm/java-11-openjdk-nnnn (nnnn depends on version).
- Edit/save student user's .bash_profile to set JAVA_HOME environment variable pointing to your java install. e.g. in our install, it looked like this.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.13.0.8-1.el8_4.x86_64
- Open a terminal window, and test the install, as follows - with sample (first line only) of expected output
$ java -version
openjdk version "11.0.13" 2021-10-19 LTS
$ javac -version
javac 11.0.13
- If this all works, you are done.
Install Python 3.9
- We've only tested the labs with Python 3.9, so you MUST install this.
- You can keep other Python versions on your machines if you need to.
- On our Linux install, we installed Python 3.9, as follows from a terminal window
$ su
# yum install python39
# alternatives --set python /usr/bin/python3.9
- Open a terminal window, and test the install, as follows - with sample (first line only) of expected output
$ python -V
Python 3.9.2
- If this all works, you are done.
PySpark Environment Setup
- Edit/save student user's .bash_profile to set the following environment variables
- Important Note: The below assumes that the lab setup was extracted to $HOME (which should point to the student home directory). If the lab setup was extracted elsewhere, then make sure to set SPARK_LABS to the location consistent with your environment.
- Note: If your python installation is in a different location, then modify the PYSPARK_PYTHON environment variable value to match it. These instructions assume that you've installed Python 3.9 according to our installation instructions for our environment, which may not match yours exactly.
export PYSPARK_PYTHON=/usr/bin/python3.9
export SPARK_LABS=$HOME/spark-labs-python
export SPARK_HOME=$SPARK_LABS/spark
export KAFKA_HOME=$SPARK_LABS/kafka
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$KAFKA_HOME/bin
- This will set up the environment appropriately to run the Spark labs.
- Test the setup by opening a terminal window (logged in as the student user) and running pyspark. We illustrate this below, with sample output.
$ pyspark
Python 3.9.2 (default, May 20 2021, 18:04:00)
[GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
... Much logging omitted ...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.9.2 (default, May 20 2021 18:04:00)
SparkSession available as 'spark'.
>>>
-
If you see the above, then you're all done. If you see errors and don't get to the pyspark prompt ( >>> ), then you have a problem.