Build your infrastructure with Terraform, Nomad and Consul on Scaleway

Post image

Terraform is a cloud agnostic automation tool to safely and efficiently manage your infrastructure as the configuration evolves. Terraform comes with Scaleway support since >= v0.7.0 which makes it a great tool to version and continuously develop your Scaleway infrastructure with ease.

In this blog post I show Terraform new capabilities by setting up a small Web App using Consul, Nomad and Fabio.

  • Consul is a tool for service discovery and configuration. Consul is distributed, highly available, and extremely scalable.

  • Nomad is a distributed scheduler which allows to run arbitrary tasks, ensures that services are always running and takes care of dynamic re-allocation if instances are unavailable.

  • fabio is a fast, modern, zero-conf load balancing HTTP router for deploying applications managed by consul.

Requirements

I assume you have Terraform >= v0.7.0 installed and have a Scaleway account. If not, you can sign-up in seconds here.

First, set the following environmental variables to allow Terraform to interact with the Scaleway APIs:

export SCALEWAY_ACCESS_KEY=<your-access-key>  
export SCALEWAY_ORGANIZATION=<your-organization-key>  

Exporting environmental variables allows you to run Terraform without specifying any credentials in the configuration file:

# main.tf
provider "scaleway" {}  
Getting started

I use a jump host to make my Consul cluster not publicly accessible. I setup a security group to disable external traffic traffic by blocking Nomad ports: 4646, 4647 and 4648 for HTTP, RPC and Serf.

The following configuration creates a security group that allows internal traffic and drops inbound traffic on the Nomad ports:

# modules/security_group/main.tf
resource "scaleway_security_group" "cluster" {  
  name        = "cluster"
  description = "cluster-sg"
}

resource "scaleway_security_group_rule" "accept-internal" {  
  security_group = "${scaleway_security_group.cluster.id}"

  action    = "accept"
  direction = "inbound"

  # NOTE this is just a guess - might not work for you.
  ip_range = "10.1.0.0/16"
  protocol = "TCP"
  port     = "${element(var.nomad_ports, count.index)}"
  count    = "${length(var.nomad_ports)}"
}

resource "scaleway_security_group_rule" "drop-external" {  
  security_group = "${scaleway_security_group.cluster.id}"

  action    = "drop"
  direction = "inbound"
  ip_range  = "0.0.0.0/0"
  protocol  = "TCP"

  port  = "${element(var.nomad_ports, count.index)}"
  count = "${length(var.nomad_ports)}"

  depends_on = ["scaleway_security_group_rule.accept-internal"]
}

Our network is no longer publicly accessible, we can start setting up the jump host:

# modules/jump_host/main.tf
resource "scaleway_server" "jump_host" {  
  name                = "jump_host"
  image               = "${var.image}"
  type                = "${var.type}"
  dynamic_ip_required = true

  tags = ["jump_host"]

  security_group = "${var.security_group}"
}

resource "scaleway_ip" "jump_host" {  
  server = "${scaleway_server.jump_host.id}"
}

output "public_ip" {  
  value = "${scaleway_ip.jump_host.ip}"
}

As you can see above, I use the scaleway_ip resource to request a public IP which can outlive the instance it's attached to. This way you can re-create the jump host without loosing the public IP you are using.

We'll be using our jump host module like this:

# main.tf
provider "scaleway" {}

module "security_group" {  
  source = "./modules/security_group"
}

module "jump_host" {  
  source = "./modules/jump_host"

  security_group = "${module.security_group.id}"
}

Next, let's setup our Consul cluster.

Setting up Consul

I start by configuring the core of our setup, the scaleway_server resource. To slightly simplify the configuration, I use Ubuntu 16.04 and systemd.
Note that the server image & commercial type have been moved into a separate files to keep things simple. Then, I modify the install.sh script to install the official consul ARM binary:

# modules/consul/main.tf
resource "scaleway_server" "server" {  
  count               = "${var.server_count}"
  name                = "consul-${count.index + 1}"
  image               = "${var.image}"
  type                = "${var.type}"
  dynamic_ip_required = false

  tags = ["consul"]

  connection {
    type         = "ssh"
    user         = "root"
    host         = "${self.private_ip}"
    bastion_host = "${var.bastion_host}"
    bastion_user = "root"
    agent        = true
  }

  provisioner "local-exec" {
    command = "curl -L -o /tmp/consul_0.6.4_amd64.zip https://releases.hashicorp.com/consul/0.6.4/consul_0.6.4_linux_amd64.zip"
  }

  provisioner "local-exec" {
    command = "curl -L -o /tmp/consul_0.6.4_arm.zip https://releases.hashicorp.com/consul/0.6.4/consul_0.6.4_linux_arm.zip"
  }

  provisioner "file" {
    source      = "/tmp/consul_0.6.4_amd64.zip"
    destination = "/tmp/consul_0.6.4_amd64.zip"
  }

  provisioner "file" {
    source      = "/tmp/consul_0.6.4_arm.zip"
    destination = "/tmp/consul_0.6.4_arm.zip"
  }

  provisioner "file" {
    source      = "${path.module}/scripts/rhel_system.service"
    destination = "/tmp/consul.service"
  }

  provisioner "remote-exec" {
    inline = [
      "echo ${var.server_count} > /tmp/consul-server-count",
      "echo ${scaleway_server.server.0.private_ip} > /tmp/consul-server-addr",
    ]
  }

  provisioner "remote-exec" {
    scripts = [
      "${path.module}/scripts/install.sh",
      "${path.module}/scripts/service.sh",
    ]
  }
}

To accommodate the bundled installation scripts, I package the entire setup into one terraform module called consul.

Since the host is not yet publicly accessible, I have to work around releases.hashicorp.com not being accessible. This is easily fixed by downloading the necessary archives first, and then copying them over via scp.

I use the connection attribute to tell terraform to use the jump host I created previously to connect to any Consul instances.

Then, I use my new Consul module in the main.tf configuration file:

# main.tf
provider "scaleway" {}

module "security_group" {  
  source = "./modules/security_group"
}

module "jump_host" {  
  source = "./modules/jump_host"

  security_group = "${module.security_group.id}"
}

module "consul" {  
  source = "./modules/consul"

  security_group = "${module.security_group.id}"
  bastion_host   = "${module.jump_host.public_ip}"
}

To start and run the Consul cluster, I have to execute the following Terraform commands:

$ terraform get       # tell terraform to lookup referenced consul module 
$ terraform apply

Once achieved, I have to verify that the cluster is composed of two servers. To perform this operation, I first have to retrieve the jump host public IP to connect to.

$ terraform output -module=jump_host
  public_ip = 212.47.227.252

Next, I lookup the private IPs of the Consul servers, and query the member list:

$ terraform show | grep private_ip
  private_ip = 10.1.40.120
  private_ip = 10.1.17.22
$ ssh root@212.47.227.252 'consul members -rpc-addr=10.1.40.120:8400'
Node      Address           Status  Type    Build  Protocol  DC  
consul-1  10.1.40.120:8301  alive   server  0.6.4  2         dc1  
consul-2  10.1.17.22:8301   alive   server  0.6.4  2         dc1

Setting up nomad

The Nomad setup is very similar to the Consul one. The two notable differences are:

  • The Nomad configuration file is generated inline
  • Nomad uses the existing Consul cluster to bootstrap itself
resource "scaleway_server" "server" {  
  count               = "${var.server_count}"
  name                = "nomad-${count.index + 1}"
  image               = "${var.image}"
  type                = "${var.type}"
  dynamic_ip_required = true
  tags                = ["cluster"]

  provisioner "file" {
    source      = "${path.module}/scripts/rhel_system.service"
    destination = "/tmp/nomad.service"
  }

  provisioner "file" {
    source      = "${path.module}/scripts/nomad_v0.4_linux_arm"
    destination = "/usr/local/bin/nomad"
  }

  provisioner "remote-exec" {
    inline = <<CMD
cat > /tmp/server.hcl <<EOF  
datacenter = "dc1"

bind_addr = "${self.private_ip}"

advertise {  
  # We need to specify our host's IP because we can't
  # advertise 0.0.0.0 to other nodes in our cluster.
  serf = "${self.private_ip}:4648"
  rpc = "${self.private_ip}:4647"
  http= "${self.private_ip}:4646"
}

# connect to consul for cluster management
consul {  
  address = "${var.consul_cluster_ip}:8500"
}

# every node will be running as server as well as client…
server {  
  enabled = true
  # … but only one node bootstraps the cluster
  bootstrap_expect = ${element(split(",", "1,0"), signum(count.index))}
}

client {  
  enabled = true

  # enable raw_exec driver. explanation will follow
  options = {
    "driver.raw_exec.enable" = "1"
  }
}
EOF  
CMD  
  }

  provisioner "remote-exec" {
    scripts = [
      "${path.module}/scripts/install.sh",
      "${path.module}/scripts/service.sh"
    ]
  }
}

Once again, I have extracted this part into a module so the main.tf Terraform configuration file hides the entire setup complexity:

# main.tf
provider "scaleway" {}

module "security_group" {  
  source = "./modules/security_group"
}

module "consul" {  
  source = "./modules/consul"

  security_group = "${module.security_group.id}"
}

module "nomad" {  
  source = "./modules/nomad"

  consul_cluster_ip = "${module.consul.server_ip}"
  security_group    = "${module.security_group.id}"
}

To start the Nomad cluster, I use the following Terraform commands:

$ terraform get    # tell terraform to lookup referenced consul module 
$ terraform plan   # we should see two new nomad servers
$ terraform apply

Once the operation completed, I verify the setup. I should see two nomad nodes, where one node is marked as leader:

$ ssh root@163.172.160.218 'nomad server-members -address=http://10.1.38.33:4646'
Name            Address     Port  Status  Leader  Protocol  Build  Datacenter  Region  
nomad-1.global  10.1.36.94  4648  alive   false   2         0.4.0  dc1         global  
nomad-2.global  10.1.38.33  4648  alive   true    2         0.4.0  dc1         global  

I also check that Nomad registered with the Consul cluster:

$ ssh root@212.47.227.252 'curl -s 10.1.42.50:8500/v1/catalog/services' | jq 'keys'
[
  'consul',
  'nomad',
  'nomad-client'
]

The output above looks good as I can see consul, nomad and nomad-client listed all together.

Before we proceed we need to verify that the ARM binary of Nomad properly reports resources, otherwise we can't schedule jobs in our cluster:

$ ssh root@163.172.171.232 'nomad node-status -self -address=http://10.1.38.151:4646'
ID     = c8c8a567  
Name   = nomad-1  
Class  = <none>  
DC     = dc1  
Drain  = false  
Status = ready  
Uptime = 4m55s

Allocated Resources  
CPU        Memory       Disk        IOPS  
0/5332000  0 B/2.0 GiB  0 B/16 EiB  0/0

Allocation Resource Utilization  
CPU        Memory  
0/5332000  0 B/2.0 GiB

Host Resource Utilization  
CPU             Memory           Disk  
844805/5332000  354 MiB/2.0 GiB  749 MiB/46 GiB  

Everything is looking great! It's time run some software!

Running fabio

fabio is a reserve proxy open sourced by eBay which supports consul out of the box. It is great tool because it will take care to route traffic internally to the correct node which runs a specific service.

I first need to schedule fabio on every nomad node. This can be done easily using Nomad system type.
Note that you have to modify the nomad/fabio.nomad file slightly, entering the cluster ip:

# nomad/fabio.nomad
job "fabio" {  
  datacenters = ["dc1"]

  # run on every nomad node
  type = "system"

  update {
    stagger = "5s"
    max_parallel = 1
  }

  group "fabio" {
    task "fabio" {
      driver = "raw_exec"

      config {
        command = "fabio_v1.2_linux_arm"
        # replace 10.1.42.50 with an IP from your cluster!
        args = ["-proxy.addr=:80", "-registry.consul.addr", "10.1.42.50:8500", "-ui.addr=:9999"]
      }

      artifact {
        source = "https://github.com/nicolai86/scaleway-terraform-demo/raw/master/binaries/fabio_v1.2_linux_arm"

        options {
          checksum = "md5:9cea33d5531a4948706f53b0e16283d5"
        }
      }

      resources {
        cpu = 20
        memory = 64
        network {
          mbits = 1

          port "http" {
            static = 80
          }
          port "ui" {
            static = 9999
          }
        }
      }
    }
  }
}

I'm not running Consul locally on every nomad node so I have to specify -registry.consul.addr. Please replace the IP 10.1.42.50 with any IP from your Consul cluster. We could get around this manual manipulation e.g. by using envconsul or installing Consul on every Nomad node and using DNS.

Anyway, let's use Nomad to run fabio on every node:

$ nomad run -address=http://163.172.171.232:4646 nomad/fabio.nomad
==> Monitoring evaluation "7116f4b8"
    Evaluation triggered by job "fabio"
    Allocation "1ae8159c" created: node "c8c8a567", group "fabio"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7116f4b8" finished with status "complete"

Fabio should have self registered with consul, let's check:

$ ssh root@163.172.157.49 'curl -s 163.172.157.49:8500/v1/catalog/services' | jq 'keys'
[
  'consul',
  'nomad',
  'nomad-client',
  'fabio'
]

Great, everything is fine! I can now run some applications on my fresh cluster.

Is go 1.7 out yet?

I've taken the is go 1.2 out yet API and adjusted it to check for go 1.7 instead of go 1.2.

To allow fabio to pick up the app, I need to include a service definition. I use one of my domains for this, randschau.eu.

service {  
  name = "isgo17outyet"
  tags = ["urlprefix-isgo17outyet.randschau.eu/"]
  port = "http"
  check {
    type = "http"
    name = "health"
    interval = "15s"
    timeout = "5s"
    path = "/"
  }
}

See the nomad directory for the complete isgo17outyet.nomad file.

$ nomad run -address=http://163.172.171.232:4646 -verbose nomad/isgooutyet.nomad
==> Monitoring evaluation "4c8ca49f-32a3-db17-92a0-d19f1ca8e3e8"
    Evaluation triggered by job "isgo17outyet"
    Allocation "29036d82-7cc6-6742-8db9-176fe83a7a3a" created: node "c8c8a567-a59b-b322-6428-7d6dc66af8d9", group "isgo17outyet"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "4c8ca49f-32a3-db17-92a0-d19f1ca8e3e8" finished with status "complete"

I verify that the app is actually healthy by checking the consul health checks:

$ ssh root@163.172.157.49 'curl 163.172.157.49:8500/v1/health/checks/isgo17outyet' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   298  100   298    0     0   2967      0 --:--:-- --:--:-- --:--:--  2980  
[
  {
    "Node": "consul-1",
    "CheckID": "66d7110ca9d9d844415a26592fbb872888895021",
    "Name": "health",
    "Status": "passing",
    "Notes": "",
    "Output": "",
    "ServiceID": "_nomad-executor-29036d82-7cc6-6742-8db9-176fe83a7a3a-isgo17outyet-isgo17outyet-urlprefix-isgo17outyet.randschau.eu/",
    "ServiceName": "isgo17outyet",
    "CreateIndex": 187,
    "ModifyIndex": 188
  }
]

Now, I can setup DNS routing to any nomad node to be able to access our new API. For now we'll just verify it's working
by setting appropriate request headers:

$  curl -v -H 'Host: isgo17outyet.randschau.eu' 163.172.171.232
*   Trying 163.172.171.232...
* Connected to 163.172.171.232 (163.172.171.232) port 9999 (#0)
> GET / HTTP/1.1
> Host: isgo17outyet.randschau.eu
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK  
< Content-Length: 120  
< Content-Type: text/html; charset=utf-8  
< Date: Sat, 23 Jul 2016 11:46:17 GMT  
<

<!DOCTYPE html><html><body><center>  
  <h2>Is Go 1.7 out yet?</h2>
  <h1>

    No.

  </h1>
</center></body></html>  
* Connection #0 to host 163.172.171.232 left intact

Closing thoughts

This was just a demo on how to use the new Scaleway provider with terraform.

We've skipped over using scaleway_volume & scaleway_volume_attachment to persist data over
servers, and will leave this for another time.

The source code is available online on GitHub. Comments, feature requests and contributions are always welcome!

That's it for now. Happy hacking : )

Author image

Raphael Randschau