Load balancing in Munki – thinking out loud.

This post is equal parts thinking out loud, and committing an idea to writing so I don’t forget. Please chime in if you have input. BTW: None of this code should be trusted as working – this is sketching only.

The Munki system I manage as my day job recently hit a milepost: when I released an Office 2016 update, it saturated the server.

Munki Web Admin ground to a halt (my first indication), and when I finally got logged in the server load was hovering between 110 and 120! Not good.

Short term, I threw a couple more cores at the VM, doubled up the RAM, and called it a day. But, that will only take me so far, and I am approaching the saturation point for the 10GB link from this VM to campus. I need solution that is a bit more sustainable.

Munki makes it easy on the surface to spread the updates out by way of catalogs. Divide your fleet into X groups, and then create Production catalogs 1 through X, assigning them in turn to machines in your fleet. Then decide how long you want take to roll out the software, and add the software to each of the X catalogs at intervals over the duration of the rollout.

There are a few problems with this approach:

  1. The Munki Admin (Me) is lazy, and shouldn’t be relied on to “release” software to each of these catalogs in turn over the duration of the rollout.
  2. If you have a bunch of machines (nee, manifests) in your fleet, you don’t want to have to go back through all of them to re-assign the catalogs.
  3. Will it work? Sure, but it’s a royal hack for a solution.

But what if we can functionally take this approach, and automate it in some fashion?

For starters, let’s consider mod_rewrite in Apache. As it turns out, you can set a RewriteMap value to point at a script, and dynamically generate a value, silently redirecting a request for /catalogs/production to a changing value – say /catalogs/production[0-9].

So, within a .htaccess file under the catalogs directory, we’ll place something like:

RewriteEngine   on
RewriteMap      lb    prg:/usr/local/cgi-bin/lb.py
RewriteRule     ^/(production)$ ${lb:$1}   [P,L] 

And the script would look like:

#!/usr/bin/env python
##
## randomly append a numeric value to the production catalog for each request.
##

import random

print "production" + str(random.randint(0,9))

By my reckoning, if the rewrite works, then the percentage of check-ins from clients will receive X piece of software (as a function of the percent of catalogs that have it included). I understand the math isn’t exact here because we are pseudo-randomizing the catalog that is delivered, and when you consider the number of times a client will check in over the course of a day (between 12 and 24 times), an active client has a high probability to get production0 offered up during that first day.

Maybe we roll the production catalog offered – stepping from 0 through 9 sequentially… I’m still thinking about that.

Now that the load balance piece is solved, we have to generate the 10 production catalogs and inject the software to each of them in turn over the duration of the rollout.

So let’s say we want to roll something out over the course of 48 hours, and we have 10 pseudo catalogs. So each 4.8 hours, we will run a function that will read in the .plist for the software being load-balanced, make use of a counter that we’ll set in the xml, and append that to the end of the make-catalogs generated catalog.

here’s a problem – if a prior version of softwareX is listed earlier in the catalog because it’s in the “master”, make-catalogs generated one, it will override what we add at the end. Hmmmm…

To be continued…

Leave a Reply