Back Up Splunk Hot Buckets

There is strong concensus that backing up Splunk hot buckets using storage snapshot is reliable but there's not many, if any, complete examples of how to set this up. Here is an implementation using LVM and EBS, automation with Cloudformation and Puppet.

Backing up data that has been indexed by Splunk is critical to most organisations. Unfortunately backing up the most recent indexed data inside the hot buckets is not possible. And according to their documentation, not recommended by Splunk themselves. However it is possible to take snapshots and the key is to backup the snapshots.

We used Amazon Linux AMI with XFS filesystem in this project. LVM snapshots works with XFS freeze to ensure the filesystem is consistent when the snapshot is taken. So when mounting a backup of the LVM snapshot volume, there is no risk of data loss or corruption.

Here is the inspiration for this implementation and an explanation of how the process works.

The context is Splunk on AWS EC2 using EBS volumes for the indexer cluster. As storage is cheap, primary and replicated buckets are backed up.

EBS Volumes

Here is an excerpt from the CloudFormation template showing the disk configuration for Splunk storage. The template makes heavy use of parameters for resuability. The CloudFormation resource is actually an Auto Scaling launch configuration, specifying the configurations for each member of the Auto Scale group.

With regards to the various EBS Volume Size parameters, there is extra logic required by the Splunk engineers. There needs to be a calculated reserve to cater for the LVM snapshot growth. For example, in the extreme case that the logical volume is extremely busy or that the snapshot is left running longer than required, all block on the logical volume might get overwritten hence the snapshot volume required will be the same size as the source logical volume. That is, the EBS volume size required will be 200% of the Splunk storage needed for Splunk. So if Splunk needs 100GB for hot buckets, then the EBS volume required is 200GB, though you should throw in some extra for overheads like metadata.

Notice that there is option to provide an EBS snapshot ID for recovery purposes. The EBS snapshot will dictate the volume size and will contain the LVM snapshot.

# splunk_index_cluster_template.yaml

IndexerLaunchConfig:
  Type: 'AWS::AutoScaling::LaunchConfiguration'
  Properties:
    BlockDeviceMappings:
      - DeviceName: /dev/xvda
        Ebs:
          VolumeSize: '40'
          VolumeType: gp2
      - DeviceName: /dev/xvdf
        Ebs:
          VolumeSize: !Ref SplunkEbsVolumeSize
          VolumeType: io1
          Iops: 1000
          Encrypted: true
          DeleteOnTermination: true
      - DeviceName: /dev/xvdg
        Ebs:
          VolumeSize: !If
            - UseHotBucketsSnapshot
            - !Ref AWS::NoValue
            - !Ref HotBucketsEbsVolumeSize
          VolumeType: io1
          Iops: 1000
          Encrypted: true
          DeleteOnTermination: true
          SnapshotId: !If
            - UseHotBucketsSnapshot
            - !Ref ColdBucketsSnapshotId
            - !Ref AWS::NoValue
      - DeviceName: /dev/xvdh
        Ebs:
          VolumeSize: !If
            - UseColdBucketsSnapshot
            - !Ref AWS::NoValue
            - !Ref ColdBucketsEbsVolumeSize
          VolumeType: st1
          Encrypted: true
          DeleteOnTermination: true
          SnapshotId: !If
            - UseColdBucketsSnapshot
            - !Ref ColdBucketsSnapshotId
            - !Ref AWS::NoValue
      - DeviceName: /dev/xvdi
        Ebs:
          VolumeSize: !Ref FrozenEbsVolumeSize
          VolumeType: st1
          Encrypted: true
          DeleteOnTermination: true

LVM Configurations

Here is an excerpt from Puppet Hiera data for configuring LVM on the node. The Hiera keys align with the various module classes to enable automatic class parameter lookup to simplify Puppet code.

As per Puppet best practice, we use the Roles and Profile pattern to separate Puppet code from actual configurations to further simplify Hiera data

# roles/indexer_cluster.yaml

profile::operating_system::storage::splunk_paths:
  splunkhotdb:
    path: /opt/splunk/db
  splunkcolddb:
    path: /opt/splunk/colddb
  splunkfrozendb:
    path: /opt/splunk/frozendb

lvm::volume_groups:
  splunkhotvg:
    createonly: true
    physical_volumes:
      - /dev/xvdg
    logical_volumes:
      splunkhotlv:
        mountpath: /opt/splunk/db
        mountpath_require: true
        tag: splunk_lv
  splunkcoldvg:
    createonly: true
    physical_volumes:
      - /dev/xvdh
    logical_volumes:
      splunkcoldlv:
        mountpath: /opt/splunk/colddb
        mountpath_require: true
        tag: splunk_lv
  splunkfrozenvg:
    createonly: true
    physical_volumes:
      - /dev/xvdi
    logical_volumes:
      splunkfrozenlv:
        mountpath: /opt/splunk/frozendb
        mountpath_require: true
        tag: splunk_lv

Snapshots And Backups

There are so many ways of scheduling jobs in the Cloud world from cool tools to AWS' evergrowing list of services. In this project, Rundeck is used to automate as well document the various housekeeping tasks as well as system functions. This of course includes scheduling LVM snapshots and EBS snapshots.

The benefits of using Rundeck include:

  1. jobs can be schedules, manually rerun, or executed in response to an event. Triggering a job via API is a powerful construct, and in this project - allowing Splunk alert actions to trigger a Rundeck job for corrective actions.
  2. all tasks and jobs are centralised, and the above feedback loop allows for really powerful 'one stop shop' administration of the environment
  3. granular security architecture allowing fine grained roles based access control
  4. RBAC allows for delegation of tasks enabling self servicing according to organisational fit
  5. centralised logging allowing for comprehensive auditing, using Splunk of course! :)

The Backup Job

Here is the configuration for the backup job. The job uses Ansible to execute the LVM and EBS snapshots using the inline playbook written into the job specs. This setup aligns with devops principles of versioning all artefacts and creates highly readable code as documentation. Rundeck can filter which nodes the job are run against using regex so we can easily filter for nodes of a certain type such as only indexers or search heads.

# backup_splunk_volume.yaml

- defaultTab: output
  description: |+
    Backup a splunk volume using LVM and EBS snapshots
  name: backup_splunk_volume
  options:
    - name: volume_group
      description: volume group name
    - name: logical_volume
      description: logical volume device path to snapshot
    - name: snapshot_size
      description: Snapshot volume size
    - name: ebs_volume_id
      description: EBS volume ID to snapshot
  nodefilters:
    filter: '"Index cluster .*|Search.*|Index Cluster Master.*|HeavyForwarder.*|.*API Collector.*" '
  sequence:
    commands:
      - configuration:
          ansible-become: 'false'
          ansible-disable-limit: 'true'
          ansible-playbook-inline: |
            ---
            - name: backup splunk volume
              hosts: localhost
              connection: local
              gather_facts: false
              tasks:
                - name: Create LVM snapshot
                  lvol:
                    vg: ${option.volume_group}
                    lv: ${option.logical_volume}
                    snapshot: "${option.logical_volume}_snapshot"
                    size: ${option.size}
                - debug: msg="{{ output }}"
                - name: Create EBS snapshot
                  ec2_snapshot:
                    region: ${option.region}
                    description:
                    volume_id: ${option.ebs_volume_id}
                    snapshot_tags:
                        Application: Splunk
                        Owner: 123123
                  register: output
                - debug: msg="{{ output }}"
                - name: Delete snapshot volume
                  lvol:
                    vg: ${option.volume_group}
                    lv: ${option.logical_volume}
                    snapshot: "${option.logical_volume}_snapshot"
                    state: absent
                - debug: msg="{{ output }}"