|
1 | | -# Deploying Spark on Kubernetes |
2 | | - |
3 | | -This post details how to deploy Spark on a Kubernetes cluster. |
4 | | - |
5 | | -*Dependencies:* |
6 | | - |
7 | | -- Docker v18.06.1-ce |
8 | | -- Minikube v0.29.0 |
9 | | -- Spark v2.2.1 |
10 | | -- Hadoop 2.7.3 |
11 | | - |
12 | | -## Minikube |
13 | | - |
14 | | -[Minikube](https://kubernetes.io/docs/setup/minikube/) is a tool used to run a single-node Kubernetes cluster locally. |
15 | | - |
16 | | -Follow the official [Install Minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/) guide to install it along with a [Hypervisor](https://kubernetes.io/docs/tasks/tools/install-minikube/#install-a-hypervisor) (like [VirtualBox](https://www.virtualbox.org/wiki/Downloads) or [HyperKit](https://github.com/moby/hyperkit), to manage virtual machines, and [Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), to deploy and manage apps on Kubernetes. |
17 | | - |
18 | | -By default, the Mikikube VM is configured to use 1GB of memory and 2 CPU cores. This is [not sufficient](https://spark.apache.org/docs/2.3.1/hardware-provisioning.html) for Spark jobs, so be sure to increase the memory in your Docker [client](https://docs.docker.com/docker-for-mac/#advanced) (for HyperKit) or directly in VirtualBox. Then, when you start Mikikube, pass the memory and CPU options to it: |
19 | | - |
20 | | -```sh |
21 | | -$ minikube start --vm-driver=hyperkit --memory 8192 --cpus 4 |
22 | | - |
23 | | -or |
24 | | - |
25 | | -$ minikube start --memory 8192 --cpus 4 |
26 | | -``` |
27 | | - |
28 | | -## Docker |
29 | | - |
30 | | -Next, let's build a custom Docker image for Spark [2.2.1](https://spark.apache.org/releases/spark-release-2-2-2.html), designed for Spark [Standalone mode](https://spark.apache.org/docs/latest/spark-standalone.html). |
31 | | - |
32 | | -*Dockerfile*: |
33 | | - |
34 | | -``` |
35 | | -# base image |
36 | | -FROM java:openjdk-8-jdk |
37 | | -
|
38 | | -# define spark and hadoop versions |
39 | | -ENV HADOOP_VERSION 2.7.3 |
40 | | -ENV SPARK_VERSION 2.2.1 |
41 | | -
|
42 | | -# download and install hadoop |
43 | | -RUN mkdir -p /opt && \ |
44 | | - cd /opt && \ |
45 | | - curl http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \ |
46 | | - tar -zx hadoop-${HADOOP_VERSION}/lib/native && \ |
47 | | - ln -s hadoop-${HADOOP_VERSION} hadoop && \ |
48 | | - echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native |
49 | | -
|
50 | | -# download and install spark |
51 | | -RUN mkdir -p /opt && \ |
52 | | - cd /opt && \ |
53 | | - curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \ |
54 | | - tar -zx && \ |
55 | | - ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark && \ |
56 | | - echo Spark ${SPARK_VERSION} installed in /opt |
57 | | -
|
58 | | -# add scripts and update spark default config |
59 | | -ADD common.sh spark-master spark-worker / |
60 | | -ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf |
61 | | -ENV PATH $PATH:/opt/spark/bin |
62 | | -``` |
63 | | - |
64 | | -You can find the above *Dockerfile* along with the Spark config file and scripts in the [spark-kubernetes](foo) repo on GitHub. |
65 | | - |
66 | | -Build the image: |
67 | | - |
68 | | -```sh |
69 | | -$ eval $(minikube docker-env) |
70 | | -$ docker build -t spark-hadoop:2.2.1 . |
71 | | -``` |
72 | | - |
73 | | -> If you don't want to spend the time building the image locally, feel free to use my pre-built Spark image from [Docker Hub](https://hub.docker.com/) - `mjhea0/spark-hadoop:2.2.1`. |
74 | | -
|
75 | | -View: |
76 | | - |
77 | | -```sh |
78 | | -$ docker image ls spark-hadoop |
79 | | - |
80 | | -REPOSITORY TAG IMAGE ID CREATED SIZE |
81 | | -spark-hadoop 2.2.1 3ebc80d468bb 3 minutes ago 875MB |
82 | | -``` |
83 | | - |
84 | | -## Spark Master |
85 | | - |
86 | | -*spark-master-deployment.yaml*: |
87 | | - |
88 | | -```yaml |
89 | | -kind: Deployment |
90 | | -apiVersion: extensions/v1beta1 |
91 | | -metadata: |
92 | | - name: spark-master |
93 | | -spec: |
94 | | - replicas: 1 |
95 | | - selector: |
96 | | - matchLabels: |
97 | | - component: spark-master |
98 | | - template: |
99 | | - metadata: |
100 | | - labels: |
101 | | - component: spark-master |
102 | | - spec: |
103 | | - containers: |
104 | | - - name: spark-master |
105 | | - image: spark-hadoop:2.2.1 |
106 | | - command: ["/spark-master"] |
107 | | - ports: |
108 | | - - containerPort: 7077 |
109 | | - - containerPort: 8080 |
110 | | - resources: |
111 | | - requests: |
112 | | - cpu: 100m |
113 | | -``` |
114 | | -
|
115 | | -*spark-master-service.yaml*: |
116 | | -
|
117 | | -```yaml |
118 | | -kind: Service |
119 | | -apiVersion: v1 |
120 | | -metadata: |
121 | | - name: spark-master |
122 | | -spec: |
123 | | - ports: |
124 | | - - name: webui |
125 | | - port: 8080 |
126 | | - targetPort: 8080 |
127 | | - - name: spark |
128 | | - port: 7077 |
129 | | - targetPort: 7077 |
130 | | - selector: |
131 | | - component: spark-master |
132 | | -``` |
133 | | -
|
134 | | -Create the Spark master Deployment and start the Services: |
135 | | -
|
136 | | -```sh |
137 | | -$ kubectl create -f ./kubernetes/spark-master-deployment.yaml |
138 | | -$ kubectl create -f ./kubernetes/spark-master-service.yaml |
139 | | -``` |
140 | | - |
141 | | -Verify: |
142 | | - |
143 | | -```sh |
144 | | -$ kubectl get deployments |
145 | | - |
146 | | -NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE |
147 | | -spark-master-deployment 1 1 1 1 11s |
148 | | - |
149 | | - |
150 | | -$ kubectl get pods |
151 | | - |
152 | | -NAME READY STATUS RESTARTS AGE |
153 | | -spark-master-698c46ff7d-vxv7r 1/1 Running 0 41s |
154 | | -``` |
155 | | - |
156 | | -## Spark Workers |
157 | | - |
158 | | -*spark-worker-deployment.yaml*: |
159 | | - |
160 | | -```yaml |
161 | | -kind: Deployment |
162 | | -apiVersion: extensions/v1beta1 |
163 | | -metadata: |
164 | | - name: spark-worker |
165 | | -spec: |
166 | | - replicas: 2 |
167 | | - selector: |
168 | | - matchLabels: |
169 | | - component: spark-worker |
170 | | - template: |
171 | | - metadata: |
172 | | - labels: |
173 | | - component: spark-worker |
174 | | - spec: |
175 | | - containers: |
176 | | - - name: spark-worker |
177 | | - image: spark-hadoop:2.2.1 |
178 | | - command: ["/spark-worker"] |
179 | | - ports: |
180 | | - - containerPort: 8081 |
181 | | - resources: |
182 | | - requests: |
183 | | - cpu: 100m |
184 | | -``` |
185 | | -
|
186 | | -Create the Spark worker Deployment: |
187 | | -
|
188 | | -```sh |
189 | | -$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml |
190 | | -``` |
191 | | - |
192 | | -Verify: |
193 | | - |
194 | | -```sh |
195 | | -$ kubectl get deployments |
196 | | -NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE |
197 | | -spark-master 1 1 1 1 1m |
198 | | -spark-worker 2 2 2 2 3s |
199 | | - |
200 | | - |
201 | | -$ kubectl get pods |
202 | | - |
203 | | -NAME READY STATUS RESTARTS AGE |
204 | | -spark-master-698c46ff7d-vxv7r 1/1 Running 0 1m |
205 | | -spark-worker-c49766f54-r5p9t 1/1 Running 0 21s |
206 | | -spark-worker-c49766f54-rh4bc 1/1 Running 0 21s |
207 | | -``` |
208 | | - |
209 | | -## Ingress |
210 | | - |
211 | | -Did you notice that we exposed the Spark web UI on port 8080? In order to access it outside the cluster, let's configure an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) object. |
212 | | - |
213 | | -*minikube-ingress.yaml*: |
214 | | - |
215 | | -```yaml |
216 | | -apiVersion: extensions/v1beta1 |
217 | | -kind: Ingress |
218 | | -metadata: |
219 | | - name: minikube-ingress |
220 | | - annotations: |
221 | | -spec: |
222 | | - rules: |
223 | | - - host: spark-kubernetes |
224 | | - http: |
225 | | - paths: |
226 | | - - path: / |
227 | | - backend: |
228 | | - serviceName: spark-master |
229 | | - servicePort: 8080 |
230 | | -``` |
231 | | -
|
232 | | -Enable the Ingress [addon](https://github.com/kubernetes/minikube/tree/master/deploy/addons/ingress): |
233 | | -
|
234 | | -```sh |
235 | | -$ minikube addons enable ingress |
236 | | -``` |
237 | | - |
238 | | -Create the Ingress object: |
239 | | - |
240 | | -```sh |
241 | | -$ kubectl apply -f ./kubernetes/minikube-ingress.yaml |
242 | | -``` |
243 | | - |
244 | | -Next, you need to update your */etc/hosts* file to route requests from the host we defined, `spark-kubernetes`, to the Minikube instance. |
245 | | - |
246 | | -Add an entry to /etc/hosts: |
247 | | - |
248 | | -```sh |
249 | | -$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts |
250 | | -``` |
251 | | - |
252 | | -Test it out in the browser at [http://spark-kubernetes/](http://spark-kubernetes/): |
253 | | - |
254 | | -TODO: add image |
255 | | - |
256 | | -## Test |
257 | | - |
258 | | -To test, run the PySpark shell from the the master container: |
259 | | - |
260 | | -```sh |
261 | | -$ kubectl exec spark-master-698c46ff7d-r4tq5 -it pyspark |
262 | | -``` |
263 | | - |
264 | | -Then run the following code after the PySpark prompt appears: |
265 | | - |
266 | | -```python |
267 | | -words = 'the quick brown fox jumps over the\ |
268 | | - lazy dog the quick brown fox jumps over the lazy dog' |
269 | | -sc = SparkContext() |
270 | | -seq = words.split() |
271 | | -data = sc.parallelize(seq) |
272 | | -counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect() |
273 | | -dict(counts) |
274 | | -sc.stop() |
275 | | -``` |
276 | | - |
277 | | -You should see: |
278 | | - |
279 | | -```sh |
280 | | -{'brown': 2, 'lazy': 2, 'over': 2, 'fox': 2, 'dog': 2, 'quick': 2, 'the': 4, 'jumps': 2} |
281 | | -``` |
282 | | - |
283 | | -TODO: add video |
0 commit comments