Installing Spark on Linux
Here is how to install Apache Spark on Linux.
Prerequisites
Install Java
First, ensure that you have Java installed. You need to run these commands:
sudo apt update
sudo apt install openjdk-21-jdk -y
You can check if it was installed correctly by running this command, which should print Java's version:
java -version
Install Python tools
You also need to install packages for venv
module an pip
:
sudo apt install python3-pip python3.13-venv -y
Download & Install Spark
Go to this link, and download the latest Spark release: https://spark.apache.org/downloads.html
Unpack Spark, and move into the /opt
directory:
tar -xzf spark-<version>-bin-hadoop3.tgz
sudo mv spark-<version>-bin-hadoop3 /opt/spark
Sanity check
To check if Spark works, you can run the spark-shell
command:
/opt/spark/bin/spark-shell
It should start a Spark shell where you can write Scala code to use Spark.
Configure the PATH variable
When you type spark-shell
, the shell has to know where the binaries are. Rather than typing full paths each time, we add SPARK_HOME to your start‑up file ( ~/.bashrc
for Bash users, ~/.zshrc
for Z‑shell ). We also extend the global PATH so that sub-commands, such as spark-submit
, work anywhere.
cat <<'EOF' >> ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH="$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin"
EOF
source ~/.bashrc
Set up a Python environment
To start PySpark with Jupyter Notebooks we first need to create a virtual environment:
python3 -m venv venv
source venv/bin/activate
Inside the virtual environment, you need to install PySpark and Jupyter Notebooks:
pip install pyspark==4.0.0 jupyterlab
And start PySpark:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=lab
pyspark