DNS - Conditional forwarding in AWS VPC to external custom DNS

What do you do when your application sits in VPC EC2 and needs to resolve a private host name that is managed by a private custom DNS outside the VPC, for example one that is sitting in your data-centre.

One solution is to use a 'serverless' DNS solution by implementing DNSMasq on your instances. This avoids having to stand up a dedicated DNS service that costs and requires availability management. The other benefits of a serverless DNS solution is performance of resolution speeds as it minimises network calls. But how do you install and configure DNSMasq locally on each instance in a dynamic environment when AWS Auto Scaling automatically handles scale in and out?

There are plenty of solutions out there, here is my implementation using Puppet.

  1. Add the forwarding domains as DNSMasq forwarding rules to Puppet (as Hiera data or as values in manifests). Make sure the default rule is to use the VPC provided DNS. That is the VPC CIDR base address base plus 2 or use the local link address designated for VPC DNS. Below are the Hiera values I used to enable auto parameter lookup for the DNSMasq module.
    Only private domains are being forwarded. VPC DNS will resolve all other name queries.

    dnsmasq::configs_hash:
      forwarding-domains:
        ensure: present
        content: |
          server=169.254.169.253 # local link address for VPC DNS for default queries      server=/domain1.local/10.168.10.5
          server=/domain1.local/10.168.11.5
  2. Automate DNSMasq installation using Puppet to bootstrap instance via EC2 userdata. Below is an extract from a CloudFormation template that configures the launch configuration for an AutoScale group of Splunk Searchhead nodes.
    The UserData script will temporarily point the name server addresses to the external proxies to initialise the process. Then it clones the Puppet module splunk_instance_bootstrap. The install.sh script applies the configuration via Puppet Apply.
    DNSMasq is included in the bootstrap module to get full name resolution working at bootstrap time, more details later.

    SearchHeadLaunchConfig:
      Type: AWS::AutoScaling::LaunchConfiguration
      Properties:
        IamInstanceProfile: !Ref SplunkIamInstanceProfile
        KeyName: !Ref KeyName
        ImageId: !Ref AmiId
        UserData:
          Fn::Base64: !Sub |
            #!/bin/bash -v
    
            mv /etc/resolv.conf /etc/resolv.conf.orig
            echo nameserver 10.168.10.5 >> /etc/resolv.conf
            echo nameserver 10.168.11.5 >> /etc/resolv.conf
    
            export PUPPET_AWS_PROXY=${HttpsProxy}
            export https_proxy=${HttpsProxy}
            export http_proxy=${HttpsProxy}
            export no_proxy="169.254.169.254,localhost,127.0.0.1,.domain1.local"
    
            yum update -y --security
    
            function exit_reason
            {
                /opt/aws/bin/cfn-signal \
                --reason="$1" \
                --resource=SearchHeadASG \
                --stack="${AWS::StackName}" \
                --region="${AWS::Region}" \
                --exit-code=$2 \
                --https-proxy=${HttpsProxy}
                exit $2
            }
                /opt/aws/bin/cfn-init -c defaultOrder -s "${AWS::StackName}" \
                --resource=SearchHeadLaunchConfig \
                --region="${AWS::Region}" \
                --https-proxy=${HttpsProxy} \
            || exit_reason 'Failed to run cfn-init' 1
    
            INSTANCEID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
            CERTNAME=${ !INSTANCEID }.${SplunkRole}.splunk-nprd.aws.domain1.local
    
            git clone --depth 1 \
            ssh://git@stash.aws.domain1.local:7999/mspk/splunk_instance_bootstrap.git \
            /etc/puppetlabs/code/modules/splunk_instance_bootstrap
    
            /etc/puppetlabs/code/modules/splunk_instance_bootstrap/install.sh ${PuppetEnvironment} ${ !CERTNAME } || \
            exit_reason 'install bootstrap failed' 1
    
            # restore VPC dhcp settings to point dns to 127.0.0.1, local dnsmasq will do domain forwarding
            rm -f /etc/resolv.conf && mv /etc/resolv.conf.orig /etc/resolv.conf
            exit_reason "Splunk Search Head setup complete" 0
  3. the Bootstrap module configures the DNSMasq rules, the Puppet agent configuration files, and enables the Puppet agent service. From then on, the instance is managed by the Puppet Master. Here is an extract of the main class.

    class splunk_instance_bootstrap(
      $puppetserver_hostname,
      $dns_zone,
      ) {
    
      dnsmasq::conf { 'amazon-dns':
        content => 'server=169.254.169.253',
      }
    
      dnsmasq::conf { 'domain1-1':
        content => 'server=/domain1.local/10.168.10.5',
      }
    
      dnsmasq::conf { 'domain1-2':
        content => 'server=/domain1.local/10.168.11.5',
      }
    
      Puppetconf::Main {
        conf_path => '/etc/puppetlabs/puppet/puppet.conf',
        tag => 'puppetconf',
      }
    
      puppetconf::main { 'certname':
        value   => $agent_certname,
      }
    
      puppetconf::main { 'server':
        value   => $puppetserver_hostname,
      }
    
      service { 'puppet':
        ensure => running,
        enable => true,
        hasrestart => true,
        hasstatus  => true,
      }
    
      # setup ordering
      Puppetconf::Main <| tag == 'puppetconf' |> -> Service['puppet']
      Ini_setting <| tag == 'puppetconf' |> -> Service['puppet']
      File <| tag == 'puppetconf' |> -> Service['puppet']
    
    }

    It may appear redundant to replicate the dns rules again in the bootstrap module, however it is required to have DNS working before the first Puppet agent run where the latest catalog is downloaded. Puppet applies the catalog in one atomic transaction and the first run usually includes other packages and gems depending on the nodes role. Out of the box, we can't specify the ordering of which resources are applied first. This is a fact of the declarative model of Puppet.
    Without DNS working before the first run, downloading the packages and gems required for the node's role from internal/external repositories will fail and the overall agent run fail, including DNSMasq.
    To simulate imperative behaviour so we can specify the ordering of resources, Puppet has Stages. However, they are considered an advanced use case and introduce complications to your automation. This is an area where competitors like Ansible and Chef have an advantage - and a reason why I prefer Ansible.

  4. The Puppet Master is configured to autosign CSRs from agents using the splunk.aws.domain1.local suffix. This can be improved by including the instance-ID in the CSR and implement a signing policy to call the EC2 API to check the instance ID.
    The registration process is automatically initiated by the agent on first contact with the master.

    # autosign certs with cn *.splunk.domain1.local
    file { '/etc/puppetlabs/puppet/autosign.conf':
        content   => '*.splunk.aws.domain1.local',
        tag => 'puppetconf',
    }
  5. From here on, the DNS settings on the splunk instance, not to mention OS and Splunk settings, are managed by the Puppet agent according the Hiera values on the Puppet Master as seen in Step 1