Understanding How System Orchestration Actually Work (By Building One): Part 1

Series Overview

Understanding How System Orchestration Actually Work (By Building One) (Part 2 of 2)

The First phases

I did a little research about system orchestration. By definition, Orchestration is an architectural pattern that controls the flow of data across multiple components in a system. While "automation" handles a single task (like a script installing an app), "orchestration" manages the entire lifecycle and interaction of multiple moving parts.

What is System Orchestration?

Think of it like a conductor in an orchestra. The individual musicians (scripts/automation) know how to play their instruments, but the conductor ensures they all start at the right time, play at the right volume and stop if something goes wrong.

Orchestration involvements depends on the context of use. In the context of a PaaS like Heroku it involves:

Provisioning: Automatically spinning up containers for user code.
Configuration: Environments and secrets
Healing: restarting a process if it crashes
Networking: route custom domain to the correct internal container.

In this first phase, we are going to build a Task Orchestrator. This is the most basic form of orchestration: taking a set of instructions and executing them in a specific order.

The Goal

Our goal for Part 1 is to create a program that can:

Accepts a list of task (could be via configuration files like YAML or JSON)
Execute these steps sequentially.
Capture and log the output of each step.

Why start here?

Before we can get into complex distributed systems, we need to understand how to manage a single process. If we can't reliably run a build script on one machine, we certainly can't manage hundreds of them across a cluster. In the context of a PaaS (Platform as a Service), the task orchestrator is the "brain" that decides what runs, where it should live, and how to ensure it stays alive.

find the source code for this part [here](https://github.com/coderoyalty/orchestration-system/tree/mini-ci-engine

First, we'd have to define a task model.

type Task = {
  id: string; // task uniqueness
  command: string; // command to execute
  retries: number; // no. of time to retry executing the command in case it fails.
  dependsOn?: string[]; // list of other tasks this current depends on.
};

Core Components of a Task Orchestration

The core components includes:

The State Store: We need a place to store the "desired state" (e.g I need to execute this command that depends these other commands, retry twice if the command fails.) The could be a Redis server (recommended), or within a Postgres Table. But for our use case, we'll use a simple in-memory store.
The Scheduler: This component looks at the tasks in the store, then decides which task to handle next.
The Executor: It takes the instruction from the Scheduler and interacts with the OS (or even a container API like Docker if it involved building a container etc.) to execute the task.

The State Store

For the state store, we'll use Map<...>() to store the state. We'll represent the task state as TaskState which is simply the metadata of a task.

type TaskStatus = "pending" | "failed" | "running" | "success";

The TaskStatus represents every status of the command execution. A pending status means the command is yet to be executed. When retry is > 0, pending could mean the command was executed earlier but failed; to note this, we'll add attempts to the TaskState, this allows us track how many times a command has been executed.

type TaskState = {
  status: TaskStatus;
  attempts: number;
  error?: Error; // the error encountered when a Task fails.
};

Here's the implementation of the store:

Next, we'll define the Executor.

The Executor

import { exec } from "node:child_process";
export class Executor {
  run(task: Task): Promise<void> {
    return new Promise((resolve, reject) => {
      const process = exec(task.command);

      process.stdout?.on("data", (data) => {
        console.log(`[${task.id}] ${data}`);
      });

      process.stderr?.on("data", (data) => {
        console.error(`[${task.id}] ${data}`);
      });

      process.on("exit", (code) => {
        if (code === 0) resolve();
        else reject(new Error(`Exit code ${code}`));
      });
    });
  }
}

The executor is simply the abstraction of Node.js exec function, making it easy for other components to interact without directly making the low-level calls.

The Scheduler

This component of the system is responsible for provide the next scheduled task to execute. It responsibility involves maintaining the tasks and their states within the system. It'll interact with the state store, has it needs to reference the state of a task. For this project we need the scheduler to pass the runnable tasks to the Orchestrator.

To fetch the available executable tasks, we'd loop through the tasks, if a task is not pending, meaning its probably running, successfully executed, or failed at execution, we'll skip it. Another thing we'd have to check is the task dependencies. A task can be dependent on other tasks, meaning it has to wait for the successful execution of task(s) it depends on. If all it's dependencies ran successful, we can call that task a runnable task. Below is how this is expressed in code.

export class Scheduler {
  constructor(
    private tasks: Task[],
    private state: StateStore,
  ) {}

  getRunnableTasks(): Task[] {
    const runnable: Task[] = [];

    for (const task of this.tasks) {
      const taskState = this.state.get(task.id);
      if (taskState.status !== "pending") continue;

      const deps = task.dependsOn || [];

      const allDepsDone = deps.every((depId) => {
        const depState = this.state.get(depId);

        return depState.status === "success";
      });

      if (allDepsDone) {
        runnable.push(task);
      }
    }

    return runnable;
  }
}

The Orchestrator: Where the puzzle comes together.

The orchestrator unites these components together. How the tasks are executed is another case. It could be an event-driven method where events are produced, listened for and based off these, task are executed. For the sake of simplicity, we'll try to run the orchestrator in an infinite loop which breaks when there's no task to execute anymore.

export class Orchestrator {
  constructor(
    private scheduler: Scheduler,
    private executor: Executor,
    private state: StateStore,
  ) {}

  async run() {
    while (true) {
      const runnableTasks = this.scheduler.getRunnableTasks();
      if (runnableTasks.length === 0) {
        if (this.isCompleted()) break;
        await this.sleep(100);
        continue;
      }

      // wait for tasks to run completely
      await Promise.all(runnableTasks.map((task) => this.executeTask(task)));
    }
  }

  private async executeTask(task: Task) {
    ...
  }


  private sleep(ms: number) {
    return new Promise((res) => setTimeout(res, ms));
  }
}

In a while loop, we fetch the runnableTasks from the Scheduler, if there's no task to run, where we check using this Orchestrator.isCompleted(), we exit the loop. In cases where we have some tasks still running, we use a not-ideal way of waiting for the execution of a task, then we continue the loop. If there are tasks to execute, we offload to the Executor, and also execute these tasks in parallel using Promise.all.

To check if the whole tasks where executed completely:

private isCompleted() {
    const all = this.state.all().values();

    for (const state of all) {
      if (state.status === "pending" || state.status === "running") {
        return false;
      }
      return true;
    }
  }

For the task execution:

Update the task metadata before executing.
If the task failed at execution, we'll configure retries based on the task schedule.

private async executeTask(task: Task) {
    const current = this.state.get(task.id);
    this.state.set(task.id, {
      status: "running",
      attempts: current.attempts + 1,
    });

    try {
      await this.executor.run(task);

      this.state.set(task.id, { status: "success" });
    } catch (err: any) {
      const retries = task.retries ?? 0;

      if (current.attempts < retries) {
        this.state.set(task.id, { status: "pending" });
      } else {
        this.state.set(task.id, { status: "failed", error: err });
      }
    }
  }

And to test this out, we'll have this:

async function main() {
  const tasks: Task[] = [
    { command: "scoop list", id: "scoop-command" },
    {
      command: "git status",
      id: "git-status",
      dependsOn: ["scoop-command"],
    },
  ];

  const store = new StateStore();
  store.init(tasks);
  const scheduler = new Scheduler(tasks, store);
  const executor = new Executor();

  const orchestrator = new Orchestrator(scheduler, executor, store);

  await orchestrator.run();
}

main()
  .then(() => console.log("Program Started"))
  .catch((err) => console.error(err));

We're pretty much done, except there's a problem with dependencies support. Cyclic dependencies.

Take a task like this:

const tasks: Task[] = [
  { command: "scoop list", id: "scoop-command", dependsOn: ["git-status"] },
  {
    command: "git status",
    id: "git-status",
    dependsOn: ["scoop-command"],
  },
];

There's a cyclic pattern of dependency here. The first task won't execute because it depends on the second, and the second task is yet to be executed, but also depends on the first. The current system does not know how to resolve or reject this, hence it'll be stuck in an infinite loop. We can solve this by first checking if the tasks has a cycle.

A DFS Solution: detecting for cycles in dependencies.

Depth-First Search algorithm (DFS) is a fundamental algorithm used to transverse or search through a tree or graph data structures. It's commonly used to find paths and cycles in graphs/trees. It works by starting at a "root" or a given node and exploring as deeply as possible along each branch before backtracking to explore unvisited paths.

          (Root)
            /  \
          (A)  (B)
         /   \    \
       (C)   (D)  (E)

DFS follows a simple "deep dive" strategy:

Explore: Visit a node, mark it has visited, move on to the next unvisited neighbours.
Backtrack: When you reach a node with no unvisited neighbors (a dead end), go back to the previous node to check for other paths.

There are two ways of implementing it:

Recursively
Iteratively

How it solves our problem

DFS is commonly used for Cycle Detection. We want to avoid taking a series of task with Cyclic Dependencies. We we pick a task, we want to look into its dependencies and their own dependencies to see if there's no dependency on the "root" task.

Graphically:

                (A: git-status)
                    /       \
         (B: git-init)   (C: git-branches)
                /
        (A: git-status)

From the diagram, you see the "root," Task A depends on Task B, which also depends on Task A (again). But Task A has been marked as visited, hence a conflict! We've detected a Cycle then.

Let's implement this flow. Firstly, we need to define a data structure that works with the algorithm.

type Graph = Map<string, string[]>;

We need to keep update on the state for every node.

const state = new Map<string, "unvisited" | "visiting" | "visited">();

The implementation:

export function hasCycle(graph: Graph) {
  const state = new Map<string, "unvisited" | "visiting" | "visited">();

  for (const node of graph.keys()) {
    state.set(node, "unvisited");
  }

  function dfs(node: string) {
    const currentState = state.get(node)!;

    if (currentState === "visiting") return true;
    if (currentState === "visited") return false;

    state.set(node, "visiting");

    const neighbors = graph.get(node) || [];

    for (const neighbor of neighbors) {
      if (!state.has(neighbor)) {
        state.set(neighbor, "unvisited");
      }

      if (dfs(neighbor)) {
        return true;
      }
    }

    state.set(node, "visited");
    return false;
  }

  for (const node of graph.keys()) {
    if (dfs(node)) {
      return true;
    }
  }

  return false;
}

The algorithm is implemented within the hasCycle function. The function holds the state and the dfs function accesses it via Closures in JavaScript .

It's possible to have just the dfs(...) function then pass the state through the function argument.

To put the function to use:

export function validateTasks(tasks: Task[]) {
  const graph = new Map<string, string[]>();

  for (const task of tasks) {
    graph.set(task.id, task.dependsOn || []);
  }

  if (hasCycle(graph)) {
    throw new Error("Cycle detected in task dependencies.");
  }
}

Then, we'll validate the tasks before initializing the store with it.

const store = new StateStore();
validateTasks(tasks);
store.init(tasks);
const scheduler = new Scheduler(tasks, store);
const executor = new Executor();
const orchestrator = new Orchestrator(scheduler, executor, store);
await orchestrator.run();