Sunday, 29 May 2016

Making gluster play nicely with others

These days hyperconverged strategies are everywhere. But when you think about it, sharing the finite resources within a physical host requires an effective means of prioritisation and enforcement. Luckily, the Linux kernel already provides an infrastructure for this in the shape of cgroups, and the interface to these controls is now simplified with systemd integration.

So lets look at how you could use these capabilities to make Gluster a better neighbour in a collocated or hyperconverged  model. 

First some common systemd terms, we should to be familiar with;
slice : a slice is a concept that systemd uses to group together resources into a hierarchy. Resource constraints can then be applied to the slice, which defines 
  • how different slices may compete with each other for resources (e.g. weighting)
  • how resources within a slice are controlled (e.g. cpu capping)
unit : a systemd unit is a resource definition for controlling a specific system service
NB. More information about control groups with systemd can be found here

In this article, I'm keeping things simple by implementing a cpu cap on glusterfs processes. Hopefully, the two terms above are big clues, but conceptually it breaks down into two main steps;
  1. define a slice which implements a CPU limit
  2. ensure gluster's systemd unit(s) start within the correct slice.
So let's look at how this is done.

Defining a slice

Slice definitions can be found under /lib/systemd/system, but systemd provides a neat feature where /etc/systemd/system can be used provide local "tweaks". This override directory is where we'll place a slice definition. Create a file called glusterfs.slice, containing;

[Slice]
CPUQuota=200%

CPUQuota is our means of applying a cpu limit on all resources running within the slice. A value of 200% defines a 2 cores/execution threads limit.

Updating glusterd


Next step is to give gluster a nudge so that it shows up in the right slice. If you're using RHEL7 or Centos7, cpu accounting may be off by default (you can check in /etc/systemd/system.conf). This is OK, it just means we have an extra parameter to define. Follow these steps to change the way glusterd is managed by systemd

# cd /etc/systemd/system
# mkdir glusterd.service.d
# echo -e "[Service]\nCPUAccounting=true\nSlice=glusterfs.slice" > glusterd.service.d/override.conf

glusterd is responsible for starting the brick and self heal processes, so by ensuring glusterd starts in our cpu limited slice, we capture all of glusterd's child processes too. Now the potentially bad news...this 'nudge' requires a stop/start of gluster services. If your doing this on a live system you'll need to consider quorum, self heal etc etc. However, with the settings above in place, you can get gluster into the right slice by;

# systemctl daemon-reload
# systemctl stop glusterd
# killall glusterfsd && killall glusterfs
# systemctl daemon-reload
# systemctl start glusterd


You can see where gluster is within the control group hierarchy by looking at it's runtime settings

# systemctl show glusterd | grep slice
Slice=glusterfs.slice
ControlGroup=/glusterfs.slice/glusterd.service
Wants=glusterfs.slice
After=rpcbind.service glusterfs.slice systemd-journald.socket network.target basic.target

or use the systemd-cgls command to see the whole control group hierarchy

├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 19
├─glusterfs.slice
│ └─glusterd.service
│   ├─ 867 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
│   ├─1231 /usr/sbin/glusterfsd -s server-1 --volfile-id repl.server-1.bricks-brick-repl -p /var/lib/glusterd/vols/repl/run/server-1-bricks-brick-repl.pid 

 │   └─1305 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log
├─user.slice
│ └─user-0.slice
│   └─session-1.scope
│     ├─2075 sshd: root@pts/0  
│     ├─2078 -bash
│     ├─2146 systemd-cgls
│     └─2147 less
└─system.slice

At this point gluster is exactly where we want it! 

Time for some more systemd coolness ;) The resource constraints that are applied by the slice are dynamic, so if you need more cpu, you're one command away from getting it;

# systemctl set-property glusterfs.slice CPUQuota=350%

Try the 'systemd-cgtop' command to show the cpu usage across the complete control group hierarchy.

Now if jumping straight into applying resource constraints to gluster is a little daunting, why not test this approach with a tool like 'stress'. Stress is designed to simply consume components of the system - cpu, memory, disk. Here's an example .service file which uses stress to consume 4 cores

[Unit]
Description=CPU soak task

[Service]
Type=simple
CPUAccounting=true
ExecStart=/usr/bin/stress -c 4
Slice=glusterfs.slice

[Install]
WantedBy=multi-user.target

Now you can tweak the service, and the slice with different thresholds before you move on to bigger things! Use stress to avoid stress :)

And now the obligatory warning. Introducing any form of resource constraint may resort in unexpected outcomes especially in hyperconverged/collocated systems - so adequate testing is key.

With that said...happy hacking :)




4 comments:

  1. Excellent article and something I've been meaning to dive in to, so thanks for doing this.

    My only suggestion is for systemd drop-ins I suggest naming them for what their purpose is. "glusterd.service.d/99-cpu.conf" instead of "glusterd.service.d/override.conf", for instance. I find it's much easier to track and it leaves it possible for multiple config management modules to drop-in changes to a service. Using the numeric prefix allows you to prioritize drop-ins where a higher numbered file might override the settings in a lower numbered drop-in.

    ReplyDelete
  2. We found this weblog very great and we wanna thank you for that. We hope you keep up the great work!
    affordable storage

    ReplyDelete
  3. Good post! Thanks for sharing this information. I appreciate it. It is very beneficial for visitors.Visit: affordable storage

    ReplyDelete