Installing Spark on Linux

Here is how to install Apache Spark on Linux.

Prerequisites

Install Java

First, ensure that you have Java installed. You need to run these commands:

sudo apt update

sudo apt install openjdk-21-jdk -y

You can check if it was installed correctly by running this command, which should print Java's version:

java -version  


Install Python tools

You also need to install packages for venv module an pip:

sudo apt install python3-pip python3.13-venv -y




Download & Install Spark

Go to this link, and download the latest Spark release: https://spark.apache.org/downloads.html


Unpack Spark, and move into the /opt directory:

tar -xzf spark-<version>-bin-hadoop3.tgz

sudo mv spark-<version>-bin-hadoop3 /opt/spark


Sanity check

To check if Spark works, you can run the spark-shell command:

/opt/spark/bin/spark-shell

It should start a Spark shell where you can write Scala code to use Spark.


Configure the PATH variable

When you type spark-shell, the shell has to know where the binaries are. Rather than typing full paths each time, we add SPARK_HOME to your start‑up file ( ~/.bashrc for Bash users, ~/.zshrc for Z‑shell ). We also extend the global PATH so that sub-commands, such as spark-submit, work anywhere.

cat <<'EOF' >> ~/.bashrc

export SPARK_HOME=/opt/spark

export PATH="$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin"

EOF

source ~/.bashrc




Set up a Python environment

To start PySpark with Jupyter Notebooks we first need to create a virtual environment:

python3 -m venv venv

source venv/bin/activate

Inside the virtual environment, you need to install PySpark and Jupyter Notebooks:

pip install pyspark==4.0.0 jupyterlab


And start PySpark:

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS=lab

pyspark