Skip to content

Commit 42e85ea

Browse files
committed
init
0 parents  commit 42e85ea

16 files changed

+503
-0
lines changed

.gitignore

Whitespace-only changes.

README.md

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# Deploying Spark on Kubernetes
2+
3+
This post details how to deploy Spark on a Kubernetes cluster.
4+
5+
*Dependencies:*
6+
7+
- Docker v18.06.1-ce
8+
- Minikube v0.29.0
9+
- Spark v2.2.1
10+
- Hadoop 2.7.3
11+
12+
## Minikube
13+
14+
[Minikube](https://kubernetes.io/docs/setup/minikube/) is a tool used to run a single-node Kubernetes cluster locally.
15+
16+
Follow the official [Install Minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/) guide to install it along with a [Hypervisor](https://kubernetes.io/docs/tasks/tools/install-minikube/#install-a-hypervisor) (like [VirtualBox](https://www.virtualbox.org/wiki/Downloads) or [HyperKit](https://github.com/moby/hyperkit), to manage virtual machines, and [Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), to deploy and manage apps on Kubernetes.
17+
18+
By default, the Mikikube VM is configured to use 1GB of memory and 2 CPU cores. This is [not sufficient](https://spark.apache.org/docs/2.3.1/hardware-provisioning.html) for Spark jobs, so be sure to increase the memory in your Docker [client](https://docs.docker.com/docker-for-mac/#advanced) (for HyperKit) or directly in VirtualBox. Then, when you start Mikikube, pass the memory and CPU options to it:
19+
20+
```sh
21+
$ minikube start --vm-driver=hyperkit --memory 8192 --cpus 4
22+
23+
or
24+
25+
$ minikube start --memory 8192 --cpus 4
26+
```
27+
28+
## Docker
29+
30+
Next, let's build a custom Docker image for Spark [2.2.1](https://spark.apache.org/releases/spark-release-2-2-2.html), designed for Spark [Standalone mode](https://spark.apache.org/docs/latest/spark-standalone.html).
31+
32+
*Dockerfile*:
33+
34+
```
35+
# base image
36+
FROM java:openjdk-8-jdk
37+
38+
# define spark and hadoop versions
39+
ENV HADOOP_VERSION 2.7.3
40+
ENV SPARK_VERSION 2.2.1
41+
42+
# download and install hadoop
43+
RUN mkdir -p /opt && \
44+
cd /opt && \
45+
curl http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \
46+
tar -zx hadoop-${HADOOP_VERSION}/lib/native && \
47+
ln -s hadoop-${HADOOP_VERSION} hadoop && \
48+
echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native
49+
50+
# download and install spark
51+
RUN mkdir -p /opt && \
52+
cd /opt && \
53+
curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
54+
tar -zx && \
55+
ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark && \
56+
echo Spark ${SPARK_VERSION} installed in /opt
57+
58+
# add scripts and update spark default config
59+
ADD common.sh spark-master spark-worker /
60+
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
61+
ENV PATH $PATH:/opt/spark/bin
62+
```
63+
64+
You can find the above *Dockerfile* along with the Spark config file and scripts in the [spark-kubernetes](foo) repo on GitHub.
65+
66+
Build the image:
67+
68+
```sh
69+
$ eval $(minikube docker-env)
70+
$ docker build -t spark-hadoop:2.2.1 .
71+
```
72+
73+
> If you don't want to spend the time building the image locally, feel free to use my pre-built Spark image from [Docker Hub](https://hub.docker.com/) - `mjhea0/spark-hadoop:2.2.1`.
74+
75+
View:
76+
77+
```sh
78+
$ docker image ls spark-hadoop
79+
80+
REPOSITORY TAG IMAGE ID CREATED SIZE
81+
spark-hadoop 2.2.1 3ebc80d468bb 3 minutes ago 875MB
82+
```
83+
84+
## Spark Master
85+
86+
*spark-master-deployment.yaml*:
87+
88+
```yaml
89+
kind: Deployment
90+
apiVersion: extensions/v1beta1
91+
metadata:
92+
name: spark-master
93+
spec:
94+
replicas: 1
95+
selector:
96+
matchLabels:
97+
component: spark-master
98+
template:
99+
metadata:
100+
labels:
101+
component: spark-master
102+
spec:
103+
containers:
104+
- name: spark-master
105+
image: spark-hadoop:2.2.1
106+
command: ["/spark-master"]
107+
ports:
108+
- containerPort: 7077
109+
- containerPort: 8080
110+
resources:
111+
requests:
112+
cpu: 100m
113+
```
114+
115+
*spark-master-service.yaml*:
116+
117+
```yaml
118+
kind: Service
119+
apiVersion: v1
120+
metadata:
121+
name: spark-master
122+
spec:
123+
ports:
124+
- name: webui
125+
port: 8080
126+
targetPort: 8080
127+
- name: spark
128+
port: 7077
129+
targetPort: 7077
130+
selector:
131+
component: spark-master
132+
```
133+
134+
Create the Spark master Deployment and start the Services:
135+
136+
```sh
137+
$ kubectl create -f ./kubernetes/spark-master-deployment.yaml
138+
$ kubectl create -f ./kubernetes/spark-master-service.yaml
139+
```
140+
141+
Verify:
142+
143+
```sh
144+
$ kubectl get deployments
145+
146+
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
147+
spark-master-deployment 1 1 1 1 11s
148+
149+
150+
$ kubectl get pods
151+
152+
NAME READY STATUS RESTARTS AGE
153+
spark-master-698c46ff7d-vxv7r 1/1 Running 0 41s
154+
```
155+
156+
## Spark Workers
157+
158+
*spark-worker-deployment.yaml*:
159+
160+
```yaml
161+
kind: Deployment
162+
apiVersion: extensions/v1beta1
163+
metadata:
164+
name: spark-worker
165+
spec:
166+
replicas: 2
167+
selector:
168+
matchLabels:
169+
component: spark-worker
170+
template:
171+
metadata:
172+
labels:
173+
component: spark-worker
174+
spec:
175+
containers:
176+
- name: spark-worker
177+
image: spark-hadoop:2.2.1
178+
command: ["/spark-worker"]
179+
ports:
180+
- containerPort: 8081
181+
resources:
182+
requests:
183+
cpu: 100m
184+
```
185+
186+
Create the Spark worker Deployment:
187+
188+
```sh
189+
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
190+
```
191+
192+
Verify:
193+
194+
```sh
195+
$ kubectl get deployments
196+
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
197+
spark-master 1 1 1 1 1m
198+
spark-worker 2 2 2 2 3s
199+
200+
201+
$ kubectl get pods
202+
203+
NAME READY STATUS RESTARTS AGE
204+
spark-master-698c46ff7d-vxv7r 1/1 Running 0 1m
205+
spark-worker-c49766f54-r5p9t 1/1 Running 0 21s
206+
spark-worker-c49766f54-rh4bc 1/1 Running 0 21s
207+
```
208+
209+
## Ingress
210+
211+
Did you notice that we exposed the Spark web UI on port 8080? In order to access it outside the cluster, let's configure an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) object.
212+
213+
*minikube-ingress.yaml*:
214+
215+
```yaml
216+
apiVersion: extensions/v1beta1
217+
kind: Ingress
218+
metadata:
219+
name: minikube-ingress
220+
annotations:
221+
spec:
222+
rules:
223+
- host: spark-kubernetes
224+
http:
225+
paths:
226+
- path: /
227+
backend:
228+
serviceName: spark-master
229+
servicePort: 8080
230+
```
231+
232+
Enable the Ingress [addon](https://github.com/kubernetes/minikube/tree/master/deploy/addons/ingress):
233+
234+
```sh
235+
$ minikube addons enable ingress
236+
```
237+
238+
Create the Ingress object:
239+
240+
```sh
241+
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml
242+
```
243+
244+
Next, you need to update your */etc/hosts* file to route requests from the host we defined, `spark-kubernetes`, to the Minikube instance.
245+
246+
Add an entry to /etc/hosts:
247+
248+
```sh
249+
$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts
250+
```
251+
252+
Test it out in the browser at [http://spark-kubernetes/](http://spark-kubernetes/):
253+
254+
TODO: add image
255+
256+
## Test
257+
258+
To test, run the PySpark shell from the the master container:
259+
260+
```sh
261+
$ kubectl exec spark-master-698c46ff7d-r4tq5 -it pyspark
262+
```
263+
264+
Then run the following code after the PySpark prompt appears:
265+
266+
```python
267+
words = 'the quick brown fox jumps over the\
268+
lazy dog the quick brown fox jumps over the lazy dog'
269+
sc = SparkContext()
270+
seq = words.split()
271+
data = sc.parallelize(seq)
272+
counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
273+
dict(counts)
274+
sc.stop()
275+
```
276+
277+
You should see:
278+
279+
```sh
280+
{'brown': 2, 'lazy': 2, 'over': 2, 'fox': 2, 'dog': 2, 'quick': 2, 'the': 4, 'jumps': 2}
281+
```
282+
283+
TODO: add video

create.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
3+
kubectl create -f ./kubernetes/spark-master-deployment.yaml
4+
kubectl create -f ./kubernetes/spark-master-service.yaml
5+
kubectl create -f ./kubernetes/spark-worker-deployment.yaml
6+
kubectl apply -f ./kubernetes/minikube-ingress.yaml

delete.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
3+
kubectl delete -f ./kubernetes/spark-master-deployment.yaml
4+
kubectl delete -f ./kubernetes/spark-master-service.yaml
5+
kubectl delete -f ./kubernetes/spark-worker-deployment.yaml
6+
kubectl delete -f ./kubernetes/minikube-ingress.yaml

docker/Dockerfile

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# base image
2+
FROM java:openjdk-8-jdk
3+
4+
# define spark and hadoop versions
5+
ENV HADOOP_VERSION 2.7.3
6+
ENV SPARK_VERSION 2.2.1
7+
8+
# download and install hadoop
9+
RUN mkdir -p /opt && \
10+
cd /opt && \
11+
curl http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \
12+
tar -zx hadoop-${HADOOP_VERSION}/lib/native && \
13+
ln -s hadoop-${HADOOP_VERSION} hadoop && \
14+
echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native
15+
16+
# download and install spark
17+
RUN mkdir -p /opt && \
18+
cd /opt && \
19+
curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
20+
tar -zx && \
21+
ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark && \
22+
echo Spark ${SPARK_VERSION} installed in /opt
23+
24+
# add scripts and update spark default config
25+
ADD common.sh spark-master spark-worker /
26+
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
27+
ENV PATH $PATH:/opt/spark/bin

docker/common.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
#!/bin/bash
2+
3+
# unset variable set by kubernetes
4+
unset SPARK_MASTER_PORT

docker/spark-defaults.conf

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
spark.master spark://spark-master:7077
2+
spark.driver.extraLibraryPath /opt/hadoop/lib/native
3+
spark.app.id KubernetesSpark

docker/spark-master

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
. /common.sh
4+
5+
echo "$(hostname -i) spark-master" >> /etc/hosts
6+
7+
/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080

docker/spark-worker

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/bin/bash
2+
3+
. /common.sh
4+
5+
if ! getent hosts spark-master; then
6+
sleep 5
7+
exit 0
8+
fi
9+
10+
/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 --webui-port 8081
11+
12+
13+
14+
memory = '10g'
15+
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
16+
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

0 commit comments

Comments
 (0)