Tech bits and bytes and Burritos : January 2019

Submitting TensorFlow jobs

The Custom Resource Definition (CRD) allows you to define custom objects with their own name and schema. This is what we are going to use to submit TensorFlow jobs to our cluster.

Luckily, the Kubeflow Core installation step already created the CRD so we can immediately submit models as ksonnet components by using the generate/apply pair of commands.

The job we are going to deploy is tf-cnn, a convolutional neural network (CNN) example shipped with Kubeflow (GKE users can substitute cdk for gke):

ks generate tf-cnn kubeflow-test --name=cdk-tf-cnn --namespace=kf-tutorial
ks apply cdk -c kubeflow-test

We can check that a resource of type "tfjob" was indeed submitted into the "kf-tutorial" namespace:

kubectl get tfjobs --namespace=kf-tutorial

Which should return (the job name will be gke-tf-cnn on GKE):

NAME         AGE
cdk-tf-cnn   1m

You can also find the components of the TensorFlow job in the "Jobs" section of your Kubernetes Dashboard. The following image shows the Parameter Server and Worker and components on GKE. CDK has a Master component in addition to these two:

Once all pods have been deployed, we can verify the CNN job is running properly by inspecting the logs of the worker pod. The following command shows the output from our CDK deployment:

kubectl logs --namespace=kf-tutorial -f cdk-tf-cnn-worker-rptp-0-wjdph

The end of the log should show us our job:

INFO|2017-12-19T01:12:17|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Model:       resnet50
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Mode:        training
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| SingleSess:  False
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Batch size:  32 global
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| 32 per device
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/cpu:0']
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Data format: NHWC
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Optimizer:   sgd
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Variables:   parameter_server
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Sync:        True
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| ==========
INFO|2017-12-19T01:12:17|/opt/launcher.py|27| Generating model
INFO|2017-12-19T01:12:21|/opt/launcher.py|27| 2017-12-19 01:12:21.230800: I tensorflow/core/distributed_runtime/master_session.cc:1008] Start master session 8ba56f373a0872fb with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true
INFO|2017-12-19T01:12:22|/opt/launcher.py|27| Running warm up

There it is! Congratulations, you have successfully launched Kubeflow on top of either CDK on AWS or GKE (or both!).

You can check its parameters using the ks show command:

ks show cdk -c kubeflow-test

The above will return the following on CDK, and be similar on GKE:

---
apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  name: cdk-tf-cnn
  namespace: kf-tutorial
spec:
  replicaSpecs:
  - replicas: 1
    template:
      spec:
        containers:
        - args:
          - python
          - tf_cnn_benchmarks.py
          - --batch_size=32
          - --model=resnet50
          - --variable_update=parameter_server
          - --flush_stdout=true
          - --num_gpus=1
          - --local_parameter_device=cpu
          - --device=cpu
          - --data_format=NHWC
          image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
          name: tensorflow
          workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        restartPolicy: OnFailure
    tfReplicaType: MASTER
  - replicas: 1
    template:
      spec:
        containers:
        - args:
          - python
          - tf_cnn_benchmarks.py
          - --batch_size=32
          - --model=resnet50
          - --variable_update=parameter_server
          - --flush_stdout=true
          - --num_gpus=1
          - --local_parameter_device=cpu
          - --device=cpu
          - --data_format=NHWC
          image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
          name: tensorflow
          workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        restartPolicy: OnFailure
    tfReplicaType: WORKER
  - replicas: 1
    template:
      spec:
        containers:
        - args:
          - python
          - tf_cnn_benchmarks.py
          - --batch_size=32
          - --model=resnet50
          - --variable_update=parameter_server
          - --flush_stdout=true
          - --num_gpus=1
          - --local_parameter_device=cpu
          - --device=cpu
          - --data_format=NHWC
          image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
          name: tensorflow
          workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        restartPolicy: OnFailure
    tfReplicaType: PS
tfImage: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3

As you can see, by default there are no GPUs being used (the parameter --device=cpu indicates this and forces the usage of the CPU version of the docker image). In a follow-up tutorial, we will build on this guide to add GPU-accelerated TensorFlow workers to your cluster and expose them via the CRD interface.

In order to clean up the kubeflow deployment on the cluster, issue the kubectl delete command. On CDK, enter the following:

ks delete cdk -c kubeflow-test

The equivalent command to delete our GKE instance is:

kubectl delete ns kf-tutorial

Congratulations! You're ready to rock'n roll using Kubeflow on CDK and GKE!

This post assumes you have a minimal level of working knowledge of Kubernetes...press on!

What is Helm?

Helm is a tool for templating and applying Kubernetes resources into a Kubernetes cluster. It advertises itself as the “npm of k8s”, which is a description I have found thoroughly unhelpful. Instead, read this article — it explains it perfectly.

Why do you want your own Helm Repository?

From this point onwards, I’m going to assume you’re familiar with what a “helm chart” is. If you’re not, read the linked article.

You’ve got some options when you want to deploy your applications into Helm. You either have a helm chart per application or a helm chart for a group of applications. For example, you either have your auth-service-helm-chartor you have your java-applications-helm-chart. What are the pros and cons of either?

A chart per application

Pros: You can implement logic specific to an app or service within your chart.

Cons: If you’ve got Microservices (and chances are in k8s you do), you’re going to end up with a lot of disparate charts all over the place. Lots of repetition and difficult to manage at scale. Creating new applications also requires more effort, you’ve got to wire up a chart correctly.

A chart for many applications

Pros: One chart is easier to manage. Your charts are all in one place (a chart repo).

Cons: You’re going to need to be very careful that specific applications don’t bleed into the shared chart’s logic. Everything needs to be generic.

I don’t like repetition

As such, I opted for the shared chart approach. This created a new problem — where the hell do we host a chart? Enter the helm s3 plugin.

Install the Plugin

To configure a local helm CLI to use this plugin, the following commands need to be invoked:

helm plugin install https://github.com/hypnoglow/helm-s3.git

Wire up the S3 Bucket

A Helm Repository needs to have an index.yaml at its root. You can use the helm CLI to initialise it, but it’s easier just to wire this up with a spot of terraform. Instead of running two commands or depending on the helm CLI being correctly configured, run one command and rely exclusively on terraform.

resource "aws_s3_bucket" "helm_central" {

   bucket = "my-helm-central-bucket"

   acl    = "private"

resource "aws_s3_bucket_object" "object" {

   bucket = "${aws_s3_bucket.helm_central.bucket}"

   key    = "charts/index.yaml"

   source = "/path/to/my/files/index.yaml"

This terraform requires a file called index.yaml in a local directory. The file I used looks like this:

apiVersion: v1

entries: {}

Note, this will create a key “charts” where your bundled charts will go. Also, your S3 bucket will not be accessible from the internet and you’ll need to regulate access through IAM roles.

Let Helm know about your new Bucket

Now you’ve got a bucket, you need to inform your local Helm CLI that the s3 bucket exists and it is a usable Helm repository. To do this, you make use of the s3 plugin:

helm repo add my-charts s3://my-helm-central-bucket/charts

Note: Wherever you’re running the helm command from will need appropriate IAM access to this S3 bucket. Either to read, write or both.

Try it lets see what happens

Pull down an existing chart, package it up and push it to your new repository.

# This will download the tar.gz from your stable central repository.
helm fetch stable/rabbitmq

# This will push that new tar.gz into your private repository.
helm s3 push rabbitmq-.tgz my-charts

If that is successful, congratulations! You’ve just wired up your very own chart repository.

Tech bits and bytes and Burritos

Sunday, January 27, 2019

Submitting TensorFlow jobs in Kubeflow