|
| 1 | +# Deploying Spark on Kubernetes |
| 2 | + |
| 3 | +This post details how to deploy Spark on a Kubernetes cluster. |
| 4 | + |
| 5 | +*Dependencies:* |
| 6 | + |
| 7 | +- Docker v18.06.1-ce |
| 8 | +- Minikube v0.29.0 |
| 9 | +- Spark v2.2.1 |
| 10 | +- Hadoop 2.7.3 |
| 11 | + |
| 12 | +## Minikube |
| 13 | + |
| 14 | +[Minikube](https://kubernetes.io/docs/setup/minikube/) is a tool used to run a single-node Kubernetes cluster locally. |
| 15 | + |
| 16 | +Follow the official [Install Minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/) guide to install it along with a [Hypervisor](https://kubernetes.io/docs/tasks/tools/install-minikube/#install-a-hypervisor) (like [VirtualBox](https://www.virtualbox.org/wiki/Downloads) or [HyperKit](https://github.com/moby/hyperkit), to manage virtual machines, and [Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), to deploy and manage apps on Kubernetes. |
| 17 | + |
| 18 | +By default, the Mikikube VM is configured to use 1GB of memory and 2 CPU cores. This is [not sufficient](https://spark.apache.org/docs/2.3.1/hardware-provisioning.html) for Spark jobs, so be sure to increase the memory in your Docker [client](https://docs.docker.com/docker-for-mac/#advanced) (for HyperKit) or directly in VirtualBox. Then, when you start Mikikube, pass the memory and CPU options to it: |
| 19 | + |
| 20 | +```sh |
| 21 | +$ minikube start --vm-driver=hyperkit --memory 8192 --cpus 4 |
| 22 | + |
| 23 | +or |
| 24 | + |
| 25 | +$ minikube start --memory 8192 --cpus 4 |
| 26 | +``` |
| 27 | + |
| 28 | +## Docker |
| 29 | + |
| 30 | +Next, let's build a custom Docker image for Spark [2.2.1](https://spark.apache.org/releases/spark-release-2-2-2.html), designed for Spark [Standalone mode](https://spark.apache.org/docs/latest/spark-standalone.html). |
| 31 | + |
| 32 | +*Dockerfile*: |
| 33 | + |
| 34 | +``` |
| 35 | +# base image |
| 36 | +FROM java:openjdk-8-jdk |
| 37 | +
|
| 38 | +# define spark and hadoop versions |
| 39 | +ENV HADOOP_VERSION 2.7.3 |
| 40 | +ENV SPARK_VERSION 2.2.1 |
| 41 | +
|
| 42 | +# download and install hadoop |
| 43 | +RUN mkdir -p /opt && \ |
| 44 | + cd /opt && \ |
| 45 | + curl http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \ |
| 46 | + tar -zx hadoop-${HADOOP_VERSION}/lib/native && \ |
| 47 | + ln -s hadoop-${HADOOP_VERSION} hadoop && \ |
| 48 | + echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native |
| 49 | +
|
| 50 | +# download and install spark |
| 51 | +RUN mkdir -p /opt && \ |
| 52 | + cd /opt && \ |
| 53 | + curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \ |
| 54 | + tar -zx && \ |
| 55 | + ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark && \ |
| 56 | + echo Spark ${SPARK_VERSION} installed in /opt |
| 57 | +
|
| 58 | +# add scripts and update spark default config |
| 59 | +ADD common.sh spark-master spark-worker / |
| 60 | +ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf |
| 61 | +ENV PATH $PATH:/opt/spark/bin |
| 62 | +``` |
| 63 | + |
| 64 | +You can find the above *Dockerfile* along with the Spark config file and scripts in the [spark-kubernetes](foo) repo on GitHub. |
| 65 | + |
| 66 | +Build the image: |
| 67 | + |
| 68 | +```sh |
| 69 | +$ eval $(minikube docker-env) |
| 70 | +$ docker build -t spark-hadoop:2.2.1 . |
| 71 | +``` |
| 72 | + |
| 73 | +> If you don't want to spend the time building the image locally, feel free to use my pre-built Spark image from [Docker Hub](https://hub.docker.com/) - `mjhea0/spark-hadoop:2.2.1`. |
| 74 | +
|
| 75 | +View: |
| 76 | + |
| 77 | +```sh |
| 78 | +$ docker image ls spark-hadoop |
| 79 | + |
| 80 | +REPOSITORY TAG IMAGE ID CREATED SIZE |
| 81 | +spark-hadoop 2.2.1 3ebc80d468bb 3 minutes ago 875MB |
| 82 | +``` |
| 83 | + |
| 84 | +## Spark Master |
| 85 | + |
| 86 | +*spark-master-deployment.yaml*: |
| 87 | + |
| 88 | +```yaml |
| 89 | +kind: Deployment |
| 90 | +apiVersion: extensions/v1beta1 |
| 91 | +metadata: |
| 92 | + name: spark-master |
| 93 | +spec: |
| 94 | + replicas: 1 |
| 95 | + selector: |
| 96 | + matchLabels: |
| 97 | + component: spark-master |
| 98 | + template: |
| 99 | + metadata: |
| 100 | + labels: |
| 101 | + component: spark-master |
| 102 | + spec: |
| 103 | + containers: |
| 104 | + - name: spark-master |
| 105 | + image: spark-hadoop:2.2.1 |
| 106 | + command: ["/spark-master"] |
| 107 | + ports: |
| 108 | + - containerPort: 7077 |
| 109 | + - containerPort: 8080 |
| 110 | + resources: |
| 111 | + requests: |
| 112 | + cpu: 100m |
| 113 | +``` |
| 114 | +
|
| 115 | +*spark-master-service.yaml*: |
| 116 | +
|
| 117 | +```yaml |
| 118 | +kind: Service |
| 119 | +apiVersion: v1 |
| 120 | +metadata: |
| 121 | + name: spark-master |
| 122 | +spec: |
| 123 | + ports: |
| 124 | + - name: webui |
| 125 | + port: 8080 |
| 126 | + targetPort: 8080 |
| 127 | + - name: spark |
| 128 | + port: 7077 |
| 129 | + targetPort: 7077 |
| 130 | + selector: |
| 131 | + component: spark-master |
| 132 | +``` |
| 133 | +
|
| 134 | +Create the Spark master Deployment and start the Services: |
| 135 | +
|
| 136 | +```sh |
| 137 | +$ kubectl create -f ./kubernetes/spark-master-deployment.yaml |
| 138 | +$ kubectl create -f ./kubernetes/spark-master-service.yaml |
| 139 | +``` |
| 140 | + |
| 141 | +Verify: |
| 142 | + |
| 143 | +```sh |
| 144 | +$ kubectl get deployments |
| 145 | + |
| 146 | +NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE |
| 147 | +spark-master-deployment 1 1 1 1 11s |
| 148 | + |
| 149 | + |
| 150 | +$ kubectl get pods |
| 151 | + |
| 152 | +NAME READY STATUS RESTARTS AGE |
| 153 | +spark-master-698c46ff7d-vxv7r 1/1 Running 0 41s |
| 154 | +``` |
| 155 | + |
| 156 | +## Spark Workers |
| 157 | + |
| 158 | +*spark-worker-deployment.yaml*: |
| 159 | + |
| 160 | +```yaml |
| 161 | +kind: Deployment |
| 162 | +apiVersion: extensions/v1beta1 |
| 163 | +metadata: |
| 164 | + name: spark-worker |
| 165 | +spec: |
| 166 | + replicas: 2 |
| 167 | + selector: |
| 168 | + matchLabels: |
| 169 | + component: spark-worker |
| 170 | + template: |
| 171 | + metadata: |
| 172 | + labels: |
| 173 | + component: spark-worker |
| 174 | + spec: |
| 175 | + containers: |
| 176 | + - name: spark-worker |
| 177 | + image: spark-hadoop:2.2.1 |
| 178 | + command: ["/spark-worker"] |
| 179 | + ports: |
| 180 | + - containerPort: 8081 |
| 181 | + resources: |
| 182 | + requests: |
| 183 | + cpu: 100m |
| 184 | +``` |
| 185 | +
|
| 186 | +Create the Spark worker Deployment: |
| 187 | +
|
| 188 | +```sh |
| 189 | +$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml |
| 190 | +``` |
| 191 | + |
| 192 | +Verify: |
| 193 | + |
| 194 | +```sh |
| 195 | +$ kubectl get deployments |
| 196 | +NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE |
| 197 | +spark-master 1 1 1 1 1m |
| 198 | +spark-worker 2 2 2 2 3s |
| 199 | + |
| 200 | + |
| 201 | +$ kubectl get pods |
| 202 | + |
| 203 | +NAME READY STATUS RESTARTS AGE |
| 204 | +spark-master-698c46ff7d-vxv7r 1/1 Running 0 1m |
| 205 | +spark-worker-c49766f54-r5p9t 1/1 Running 0 21s |
| 206 | +spark-worker-c49766f54-rh4bc 1/1 Running 0 21s |
| 207 | +``` |
| 208 | + |
| 209 | +## Ingress |
| 210 | + |
| 211 | +Did you notice that we exposed the Spark web UI on port 8080? In order to access it outside the cluster, let's configure an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) object. |
| 212 | + |
| 213 | +*minikube-ingress.yaml*: |
| 214 | + |
| 215 | +```yaml |
| 216 | +apiVersion: extensions/v1beta1 |
| 217 | +kind: Ingress |
| 218 | +metadata: |
| 219 | + name: minikube-ingress |
| 220 | + annotations: |
| 221 | +spec: |
| 222 | + rules: |
| 223 | + - host: spark-kubernetes |
| 224 | + http: |
| 225 | + paths: |
| 226 | + - path: / |
| 227 | + backend: |
| 228 | + serviceName: spark-master |
| 229 | + servicePort: 8080 |
| 230 | +``` |
| 231 | +
|
| 232 | +Enable the Ingress [addon](https://github.com/kubernetes/minikube/tree/master/deploy/addons/ingress): |
| 233 | +
|
| 234 | +```sh |
| 235 | +$ minikube addons enable ingress |
| 236 | +``` |
| 237 | + |
| 238 | +Create the Ingress object: |
| 239 | + |
| 240 | +```sh |
| 241 | +$ kubectl apply -f ./kubernetes/minikube-ingress.yaml |
| 242 | +``` |
| 243 | + |
| 244 | +Next, you need to update your */etc/hosts* file to route requests from the host we defined, `spark-kubernetes`, to the Minikube instance. |
| 245 | + |
| 246 | +Add an entry to /etc/hosts: |
| 247 | + |
| 248 | +```sh |
| 249 | +$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts |
| 250 | +``` |
| 251 | + |
| 252 | +Test it out in the browser at [http://spark-kubernetes/](http://spark-kubernetes/): |
| 253 | + |
| 254 | +TODO: add image |
| 255 | + |
| 256 | +## Test |
| 257 | + |
| 258 | +To test, run the PySpark shell from the the master container: |
| 259 | + |
| 260 | +```sh |
| 261 | +$ kubectl exec spark-master-698c46ff7d-r4tq5 -it pyspark |
| 262 | +``` |
| 263 | + |
| 264 | +Then run the following code after the PySpark prompt appears: |
| 265 | + |
| 266 | +```python |
| 267 | +words = 'the quick brown fox jumps over the\ |
| 268 | + lazy dog the quick brown fox jumps over the lazy dog' |
| 269 | +sc = SparkContext() |
| 270 | +seq = words.split() |
| 271 | +data = sc.parallelize(seq) |
| 272 | +counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect() |
| 273 | +dict(counts) |
| 274 | +sc.stop() |
| 275 | +``` |
| 276 | + |
| 277 | +You should see: |
| 278 | + |
| 279 | +```sh |
| 280 | +{'brown': 2, 'lazy': 2, 'over': 2, 'fox': 2, 'dog': 2, 'quick': 2, 'the': 4, 'jumps': 2} |
| 281 | +``` |
| 282 | + |
| 283 | +TODO: add video |
0 commit comments