Do a search for “kubernetes” on Google and you get back 31 million results. Do a search for “nomad scheduler” and you get back 200,000 results. Despite all the hype around Kubernetes, I often wonder if folks stop and think about what Kubernetes is at heart? And same goes for HashiCorp’s Nomad.
If you peel away all the layers and the hype, the core problem that Kubernetes and Nomad solve is this: automatically scheduling
jobs onto a pool of machines. For the rest of this post, I’m going to refer to
schedulers generically, rather than talk about Kubernetes or Nomad (or Google’s Borg, Facebook’s Twine, or Apache Mesos) specifically.
What is a job?
The foundation of a scheduler is the “job”. It can be defined as one or more related tasks that, together, perform a job. Lets use an analogy from the real-world: the job of a hair stylist.
We can describe the tasks that make up the job of hair stylist as:
- cutting a customer’s hair
- coloring a customer’s hair
- styling a customer’s hair
- sweeping up hair remnants after a cut
- cleaning up chemicals and other products after coloring a customer’s hair
What is a scheduler?
Lets continue our hair stylist analogy to talk about the scheduler itself. A hair salon has a number of stylists–these are the workers. Lets imagine that our salon is popular, so it has a person that greets customers at the door and assigns them to workers. This greeter is our scheduler.
But how does she assign the work?
She can put the customers in a queue and assign work to stylists on a first-come, first-serve basis. This is a simple way to schedule work based on a FIFO (first in, first out) queue. In combination with a FIFO queue, our scheduler has to figure out which workers to assign work to. A simple way to do this would be to use a round-robin selection of workers. Starting with the last worker she assigned work to, she checks the next one to see if the stylist is available. If he is, then she assigns him the work; if not, she moves on to the next stylist. If no stylists are available, then the current work is left in the queue until a worker is available.
As we all know, life is not always so clearcut, and using a FIFO queue and round-robin scheduling may not be robust enough to handle our real-world needs.
What happens if you have a high-priority job? Lets say a customer is getting married the next day, and it’s critical that her hair look amazing for the wedding. Everyone else in the salon that day is just getting maintenance work done on their hair, so the bride (or groom) can be classified as high priority. What happens in this situation?
One way to handle this scenario is to give each job a priority. For simplicity, lets say the salon owner has deemed that work for weddings and job interviews are to be given priority and are classified as
P0. Everything else is considered “routine” and classified as
If the salon is not busy and there is an available worker, when our bride walks in for her appointment, the scheduler can simply assign her to an available stylist. If, however, the salon is busy when our bride walks in, the scheduler must evict a lower priority customer from a stylist and make it available for the higher priority customer.
I have simplified this analogy, but it represents the essence of scheduling that systems like Kubernetes and Nomad use to schedule jobs.
What are the benefits of using a scheduler
Lets say our salon is small and has only four stylists. The stylists get along well and are able to work out among themselves who should attend to the next customer (i.e. one stylist isn’t trying to take all the work). In such a situation, a scheduler might be overkill. Moreover, it would add the cost of a salary for the scheduler without adding much benefit.
Alternatively, lets say our salon is large: it has twenty stylists working at any given time, and it has a large enough customer base that the stylists are always busy. A scheduler in this situation makes sense. In order to attend to customers in a timely fashion and handle the high customer traffic, the scheduler can deal with non styling-related tasks that free up the stylists to focus exclusively on handling customers' styling needs. The scheduler takes on the task of assigning work, scheduling future appointments, and collecting payment.
In the world of technology, the same considerations apply. In a small startup, for example, using a scheduler may be overkill, a premature optimization perhaps. Early on, the company may only have a handful of employees and just as few customers. Using a scheduler is not free; it requires care and feeding. And if the engineering staff is spending time dealing with the scheduler, then they aren’t spending time on building the features that will help grow the company’s customer base.
As a startup grows from a few customers to hundreds or even thousands, and just as many engineers, the equation changes. Now, a scheduler might make sense. The company might employ a small infrastructure team that builds out and operates the schedule across a pool of machines. The scheduler helps save the company money, because more than one job can run on a machine, thus reducing the number of total machines needed to operate the business. Another benefit that wasn’t discussed in our salon example, is failure recovery. When a job dies unexpectedly, schedulers can attempt to restart the job. For example, if the machine on which a job is running dies (maybe it was accidentally unplugged), the scheduler can restart the job on another machine.
While talk of
schedulers can become mired in marketing hype and industry buzzwords, it’s important to keep in mind the fundamental problem that schedulers like Kubernetes and Nomad solve. They are systems to efficiently assign work to workers. And, in the case of failure, they can help recovery efforts by automatically restarting failed jobs, thus minimizing customer impact.