Well, I’m trained as a scientist and engineer, so I keep a notebook. This is something I have done religiously since I was in grad school, much to my wife’s dismay.
Since 1991 I have loved the National brand Chemistry Notebook (number 43-571), but National was bought a few years ago and the new owners cut a stupid corner by reducing the notebook from 128 pages to 120. Worse yet, this notebook has become rather expensive to buy, costing upward of $10 per book. The pages are still numbered for me, but the reduction from 128 to 120 remains an irritant.
So, when I recently changed jobs and, at the same time, ran out of notebooks I decided to switch to the Clairefontaine 9542C. This is a smaller notebook with paper that is slightly more opaque and quadrille ruled 5×5 to the inch.
Oddly, despite the fact that it is made in France and described with metric dimensions (14.8 cm x 21 cm) the ruling is specified as 5×5 to the inch. I agree that this is a convenient grid size for technical notebooks, but is there no metric ruling that matches? 0.5 cm come so mind, since that would end up very close to 5×5 to the inch, since 5 x 0.5 cm is 2.5 cm, and 2.54 cm is an inch. Perhaps it is marketed as 0.5 cm square grid in Europe but as 5×5 to the inch in the US?
Anyway, I needed to buy some more of these notebooks. Normally I pick them up from a stationery store near my apartment, but that is inconvenient and expensive.
I tried looking for them on Amazon (amazon.com, to be precise). While I can find them, it’s hard to tell which product is being sold because Amazon’s product information for these Clairefontaine notebooks is dreadful. And they’re expensive.
After being frustrated by the unusually low quality of Amazon’s offerings I tried searching Google for “clairefontaine 9542c”. To my surprise, I found an amazon.de page near the top of the organic results. Even more of a surprise was the fact that it was offering five of these lovely notebooks for about 10 euros, or only a little bit more than I was paying for one in the US.
Not reading German I decided to try amazon.co.uk. There I found these notebooks, again better described, priced at ten pounds for a package of five. I ordered two packages. Even with shipping to the US these notebooks come out at about half the price that I pay for them in the US.
It’s been a while since I’ve written about my toy data center. I started with two Intel NUCs and shortly thereafter expanded to four. Each of the first pair has a 240 G SSD and the second pair each sports a 480 G SSD.
For my development work I want a simple way to display the data in an object instance without having to modify the __str__(self) method every time I add, delete, or rename members. Here’s a technique I’ve adopted that relies on the fact that every object stores all of its members in a dictionary called self.__dict__. Making a string representation of the object is just a matter of returning a string representation of __dict__. This can be achieved in several ways. One of them is simply str(self.__dict__) and the other uses the JSON serializer json.dumps(), which lets you prettyprint the result.
Here’s a little Python demonstrator program:
# /usr/bin/python
""" demo - demonstrate a simple technique to display text representations
of Python objects using the __dict__ member and a json serializer.
$Id: demo.py,v 1.3 2015/07/18 13:07:15 marc Exp marc $
"""
import json
class something(object):
""" This is just a demonstration class. """
def __init__(self, id, name):
self.id = id
self.name = name
def rename(self, name):
self.name = name
def __str__(self):
return json.dumps(self.__dict__, indent=2, separators=(',', ': '))
# return str(self.__dict__)
def main():
o1 = something(1, "first object")
o2 = something(2, "second object")
print str(o1)
print str(o2)
o1.rename("dba third object")
print str(o1)
if __name__ == '__main__':
main()
Nice and easy for testing and debugging. Once I’m ready for production and no longer want the JSON representations I can introduce a DEBUG flag so that the non-DEBUG behavior of __str__(self) is appropriate to the production use.
[update]
What’s wrong with this? If I have a member that is itself an object, then the json.dumps() call fails. Ideally Python would call __str__() on a member if __str__() was called on the object.
On reading some more goodies, it’s clear that what I should be using is repr() and not str().
The TV in the kitchen has long had a Mac Mini attached to one of its inputs. We used it to watch Youtube videos, listen to music from iTunes and Google Music, to browse the web, to show photographs from our trips, and so on.
Sadly, the little Mini passed away earlier this year, refusing to power up. When we priced out replacement machines we discovered that the new Minis were a lot more expensive, even if a the same time more capable.
Given that we were not planning to store lots of data on the machine, we decided to leverage the lessons we had learned from building our little collection of NUC servers and design and build a small desktop on one of the NUC engines. We conducted some research and selected a machine sporting an i3 processor. The parts list we ended up with was:
Which brought the total expense to $ 297.73, substantially cheaper than the more highly configured i5-based servers that we described in a previous post.
We ordered the parts from Amazon and they arrived a few days later.
The next step was to get the BIOS patches needed for the machine and an install image.
The new BIOS image came from the Intel site. Note that the BIOS for the DYE line is different from that in the i5-based WYK line that we used for the servers. The BIOS patch that we downloaded is named gk0054.bio and we found it on an Intel page (easier to find with a search engine than with the Intel site navigation tools, but easy either way).
The Ubuntu desktop image is on the Ubuntu site … they ask you for a donation (give one if you can afford it, please).
The, by now familiar, steps to create an installable image on a USB flash drive are:
Where /dev/disk2 and /dev/rdisk2 are identified from examination of the output of the diskutil list call.
That done, we recorded the MAC address from the NUC packaging and updated our DHCP and DNS configurations so that the machine would get its host name and IP address from our infrastructure.
A couple of important differences between building a desktop and a server:
We added the WiFi and Bluetooth network card to the machine. We did not use the WiFi capability, since we were installing the machine in a location with good hard-wired Ethernet connectivity, but we did plan to use a Bluetooth keyboard and mouse on the machine.
The desktop install image for Ubuntu 14.04 is big, about 1/3 larger than the server image. The first device we used for the install was the same 1G drive that I had used for my initial server installs, before I got the network install working. What we didn’t realize, and dd did not tell us, is that the image was too big for the 1G drive. When we tried to do the install the first time we got a cryptic error message from the BIOS. It took us a while, stumbling around in the dark, to realize that the install image was too big for the drive we were using. After we rebuilt the install image on a 32G drive we had in a drawer, the install proceeded without error.
After the installation completed we had trouble getting the Bluetooth keyboard and mouse to work well. The machine ultimately paired with the keyboard, but we could not get input to it.
We then thought back on some of the information we’d seen for our earlier NUC research and verified that the machine actually has an integrated antenna. We opened up the case and found the antenna wires, which we connected to the wireless card as shown in this picture:
Shortly after we were logged on to the machine. We installed Chrome and connected up to a Google Music library and were playing music as background to a photo slide show within a few minutes.
The only remaining problem is that the Apple Wireless Trackpad that we’re using seems to regularly stop talking to the machine. The pointer freezes and we’re left using the tab key to navigate the fields of the active window.
The content of “cat /proc/cpuinfo” is actually four copies this, with small variations in core id (ranging between 0 and 1), the processor (ranging between 0 and 3), and the apcid (ranging from 0 to 3).
In order to add this information to my sysinfo.py I wrote a new module, cpuinfo.py, modeled on the df.py module that I used to add filesystem information.
""" Parse the content of /proc/cpuinfo and create JSON objects for each cpu
Written by Marc Donner
$Id: cpuinfo.py,v 1.7 2014/11/06 18:25:30 marc Exp marc $
"""
import subprocess
import json
import re
def main():
"""Main routine"""
print CPUInfo().to_json()
return
# Utility routine ...
#
# The /proc/cpuinfo content is a set of (attribute, value records)
# the separator between attribute and value is "/t+: "
#
# When there are multiple CPUs, there's a blank line between sets
# of lines.
#
class CPUInfo(object):
""" An object with key data from the content of the /proc/cpuinfo file """
def __init__(self):
self.cpus = {}
self.populated = False
def to_json(self):
""" Display the object as a JSON string (prettyprinted) """
if not self.populated:
self.populate()
return json.dumps(self.cpus, sort_keys=True, indent=2)
def get_array(self):
""" return the array of cpus """
if not self.populated:
self.populate()
return self.cpus["processors"]
def populate(self):
""" get the content of /proc/cpuinfo and populate the arrays """
self.cpus["processors"] = []
cpu = {}
cpu["processor"] = {}
text = str(subprocess.check_output(["cat", "/proc/cpuinfo"])).rstrip()
lines = text.split('\n')
# Use re.split because there's a varying number of tabs :-(
array = [re.split('\t+: ', x) for x in lines]
# cpuinfo is structured as n blocks of data, one per logical processor
# o each block has the processor id (0, 1, ...) as its first row.
# o each block ends with a blank row
# o some of the rows have attributes but no values
# (e.g. power_management)
for row in range(0, len(array[:])):
# New processor detected - attach this one to the output, then
if len(lines[row]) == 0:
# create a new processor
self.cpus["processors"].append(cpu)
cpu = {}
cpu["processor"] = {}
if len(array[row]) == 2:
(attribute, value) = array[row]
attribute = attribute.replace(" ", "_")
cpu["processor"][attribute] = value
self.cpus["processors"].append(cpu)
self.populated = True
if __name__ == '__main__':
main()
The state machine implicit in the main loop of populate() is plausibly efficient, though there remains something about it that annoys me. I need to think about edge cases and failure modes to see whether I can make it better.
The result is an augmented json object including info on the logical processors:
I am tempted to augment the module with a configuration capability that would let me set sysinfo up to restrict the set of data from /dev/cpuinfo that I actually include in the sysinfo structure. Do I need “fpu” and “fpu_exception” or “clflush_size” for the things that I will be using the sysinfo stuff for? I’m skeptical. If I make it a configurable filter I can always incorporate data elements after I decide they’re interesting.
Decisions, decisions.
Moreover, the multiple repetition of the CPU information is annoying. The four attributes that vary are, processor, core id, apicid, and initial apicid. The values are structured thus (initial apicid seems never to vary from apicid):
processor
core id
apicid
0
0
0
1
1
2
2
0
1
3
1
3
It would be much more sensible to reduce the size and complexity of the processors section by consolidating the common parts and displaying the variant sections in some sensible subsidiary fashion.
So I’m adding more capabilities to my sysinfo.py program. The next thing that I want to do is get a JSON result from df. This is a function whose description, from the man page, says “report file system disk space usage”.
Here is a sample of the output of df for one of my systems:
So I started by writing a little Python program that used the subprocess.check_output() method to capture the output of df.
This went through various iterations and ended up with this single line of python code, which requires eleven lines of comments to explain it:
#
# this next line of code is pretty tense ... let me explain what
# it does:
# subprocess.check_output(["df"]) runs the df command and returns
# the output as a string
# rstrip() trims of the last whitespace character, which is a '\n'
# split('\n') breaks the string at the newline characters ... the
# result is an array of strings
# the list comprehension then applies shlex.split() to each string,
# breaking each into tokens
# when we're done, we have a two-dimensional array with rows of
# tokens and we're ready to make objects out of them
#
df_array = [shlex.split(x) for x in
subprocess.check_output(["df"]).rstrip().split('\n')]
My original df.py code constructed the JSON result manually, a painfully finicky process. After I got it running I remembered a lesson I learned from my dear friend the late David Nochlin, namely that I should construct an object and then use a rendering library to create the JSON serialization.
So I did some digging around and discovered that the Python json library includes a fairly sensible serialization method that supports prettyprinting of the result. The result was a much cleaner piece of code:
# df.py
#
# parse the output of df and create JSON objects for each filesystem.
#
# $Id: df.py,v 1.5 2014/09/03 00:41:31 marc Exp $
#
# now let's parse the output of df to get filesystem information
#
# Filesystem 1K-blocks Used Available Use% Mounted on
# /dev/mapper/flapjack-root 959088096 3799548 906569700 1% /
# udev 1011376 4 1011372 1% /dev
# tmpfs 204092 288 203804 1% /run
# none 5120 0 5120 0% /run/lock
# none 1020452 0 1020452 0% /run/shm
# /dev/sda1 233191 50734 170016 23% /boot
import subprocess
import shlex
import json
def main():
"""Main routine - call the df utility and return a json structure."""
# this next line of code is pretty tense ... let me explain what
# it does:
# subprocess.check_output(["df"]) runs the df command and returns
# the output as a string
# rstrip() trims of the last whitespace character, which is a '\n'
# split('\n') breaks the string at the newline characters ... the
# result is an array of strings
# the list comprehension then applies shlex.split() to each string,
# breaking each into tokens
# when we're done, we have a two-dimensional array with rows of
# tokens and we're ready to make objects out of them
df_array = [shlex.split(x) for x in
subprocess.check_output(["df"]).rstrip().split('\n')]
df_num_lines = df_array[:].__len__()
df_json = {}
df_json["filesystems"] = []
for row in range(1, df_num_lines):
df_json["filesystems"].append(df_to_json(df_array[row]))
print json.dumps(df_json, sort_keys=True, indent=2)
return
def df_to_json(tokenList):
"""Take a list of tokens from df and return a python object."""
# If df's ouput format changes, we'll be in trouble, of course.
# the 0 token is the name of the filesystem
# the 1 token is the size of the filesystem in 1K blocks
# the 2 token is the amount used of the filesystem
# the 5 token is the mount point
result = {}
fsName = tokenList[0]
fsSize = tokenList[1]
fsUsed = tokenList[2]
fsMountPoint = tokenList[5]
result["filesystem"] = {}
result["filesystem"]["name"] = fsName
result["filesystem"]["size"] = fsSize
result["filesystem"]["used"] = fsUsed
result["filesystem"]["mount_point"] = fsMountPoint
return result
if __name__ == '__main__':
main()
which, in turn, produces a rather nice df output in JSON.
Now I have four machines. Keeping them in sync is the challenge. Worse yet, knowing whether they are in sync or out of sync is a challenge.
So the first step is to make a tool to inventory each machine. In order to use the inventory utility in a scalable way, I want to design it to produce machine-readable results so that I can easily incorporate them into whatever I need.
What I want is a representation that is both friendly to humans and to computers. This suggests a self-describing text representation like XML or JSON. After a little thought I picked JSON.
What sorts of things do I want to know about the machine? Well, let’s start with the hardware and the operating system software plus things like the quantity of RAM and other system resources. Some of that information is available from uname and other is availble from the sysinfo(2) function.
To get the information from the sysinfo(2) function I had to do several things:
Install sysinfo on each machine
sudo apt-get install sysinfo
Write a little program to call sysinfo(2) and report out the results
getSysinfo.c
Of course this program, getSysinfo.c is a quick-and-dirty – the error handling is almost nonexistent and I ought to have generalized the mechanism to work from a data structure that includes the name of the flag and the attribute name and doesn’t have the clumsy sequence of if statements.
/*
* getSysinfo.c
*
* $Id: getSysinfo.c,v 1.4 2014/08/31 17:29:43 marc Exp $
*
* Started 2014-08-31 by Marc Donner
*
* Using the sysinfo(2) call to report on system information
*
*/
#include <stdio.h> /* for printf */
#include <stdlib.h> /* for exit */
#include <unistd.h> /* for getopt */
#include <sys/sysinfo.h> /* for sysinfo */
int main(int argc, char **argv) {
/* Call the sysinfo(2) system call with a pointer to a structure */
/* and then display the results */
struct sysinfo toDisplay;
int rc;
if ( rc = sysinfo(&toDisplay) ) {
printf(" rc: %d\n", rc);
exit(rc);
}
int c;
int opt_a = 0;
int opt_b = 0;
int opt_f = 0;
int opt_g = 0;
int opt_h = 0;
int opt_m = 0;
int opt_r = 0;
int opt_s = 0;
int opt_u = 0;
int opt_w = 0;
int opt_help = 0;
int opt_none = 1;
while ( (c = getopt(argc, argv, "abfghmrsuw?")) != -1) {
opt_none = 0;
switch (c) {
case 'a':
opt_a = 1;
break;
case 'b':
opt_b = 1;
break;
case 'f':
opt_f = 1;
break;
case 'g':
opt_g = 1;
break;
case 'h':
opt_h = 1;
break;
case 'm':
opt_m = 1;
break;
case 'r':
opt_r = 1;
break;
case 's':
opt_s = 1;
break;
case 'u':
opt_u = 1;
break;
case 'w':
opt_w = 1;
break;
case '?':
opt_help = 1;
break;
}
}
if ( opt_none || opt_help ) {
showHelp();
return 100;
} else {
if ( opt_u || opt_a ) { printf(" \"uptime\": %lu\n", toDisplay.uptime); }
if ( opt_r || opt_a ) { printf(" \"totalram\": %lu\n", toDisplay.totalram); }
if ( opt_f || opt_a ) { printf(" \"freeram\": %lu\n", toDisplay.freeram); }
if ( opt_b || opt_a ) { printf(" \"bufferram\": %lu\n", toDisplay.bufferram); }
if ( opt_s || opt_a ) { printf(" \"sharedram\": %lu\n", toDisplay.sharedram); }
if ( opt_w || opt_a ) { printf(" \"totalswap\": %lu\n", toDisplay.totalswap); }
if ( opt_g || opt_a ) { printf(" \"freeswap\": %lu\n", toDisplay.freeswap); }
if ( opt_h || opt_a ) { printf(" \"totalhigh\": %lu\n", toDisplay.totalhigh); }
if ( opt_m || opt_a ) { printf(" \"mem_unit\": %d\n", toDisplay.mem_unit); }
return 0;
}
}
int showHelp() {
printf( "Syntax: getSysinfo [options]\n" );
printf( "\nDisplay results from the sysinfo(2) result structure\n\n" );
printf( "Options:\n" );
printf( " -b : bufferram\n" );
printf( " -f : freeram\n" );
printf( " -g : freeswap\n" );
printf( " -h : totalhigh\n" );
printf( " -m : mem_unit\n" );
printf( " -r : totalram\n" );
printf( " -s : sharedram\n" );
printf( " -u : uptime\n" );
printf( " -w : totalswap\n\n" );
printf( "getSysinfo also accepts arbitrary combinations of permitted options." );
return 100;
}
And with this in place, the python program sysinfo.py required to pull together various other bits and pieces becomes possible:
Notice the little trick with the Makefile variables HOST, HOSTS, SSH_FILES, PUSH_HOSTS, and PUSH_FILES that lets one host push to the others for distributing the code but lets it call on all of the hosts when gathering data.
With all of this machinery in place and distributed to all of the UNIX machines in my little network, I was now able to type ‘make ssh’ and get the resulting output:
Well, my nice DNS service with two secondaries and a primary is all well and good, but my logs are now scattered across three machines. If I want to play with the stats or diagnose a problem or see when something went wrong, I now have to grep around on three different machines.
Obviously I could consolidate the logs using syslog. That’s what it’s designed for, so why don’t I do that. Let’s see what I have to do to make that work properly:
Set up rsyslogd on flapjack to properly stash the DNS messages
Set up DNS on flapjack to log to syslog
Set up the rsyslogd service on flapjack to receive syslog messages over the network
Set up rsyslog on waffle to forward dns log messages to flapjack
Set up rsyslog on pancake to forward dns log messages to flapjack
Set up the DNS secondary configurations to use syslog instead of local logs
Distribute the updates and restart the secondaries
Test everything
A side benefit of using syslog to accumulate my dns logs is that they’ll now be timestamped so I can do more sophisticated data analysis if I ever get a Round Tuit.
Here’s the architecture of the setup I’m going to pursue:
So the first step is to set up the primary DNS server on flapjack to write to syslog. This has several parts:
Declare a “facility” in syslog that DNS can write to. For historical reasons (Hi, Eric!) syslog has a limited number of separate facilities that can accumulate logs. The configuration file links sources to facilities, allowing the configuration master to do various clever filtering of the log messages that come in.
Tell DNS to log to the “facility”
Restart both bind9 and rsyslogd to get everything working.
The logging for Bind9 is specified in a file called at /etc/bind/named.conf.local. The default setup involves appending log records to a file named /var/log/named/query.log.
We’ll keep using that file for our logs going forward, since some other housekeeping knows about that location and no one else is intent on interfering with it.
Because I have decided to use the facility named local6 for DNS.
In order to make the rsyslogd daemon on flapjack listen to messages from DNS, I have to declare the facility active.
The syslog service on flapjack is provided by a server called rsyslogd. It’s an alternative to the other two main stream syslog products – syslog-ng and sysklogd. I picked rsyslogd because it comes as the standard logging service on Ubuntu 12.04 and 14.04, the distros I am using in my house. You might call me lazy, you might call me pragmatic, but don’t call me late for happy hour.
In order to make rsyslogd do what I need, I have to take control of the management of two configuration files: /etc/rsyslog.conf and /etc/rsyslog.d/50-default.conf. As is my wont, I do this by creating a project directory ~/projects/r/rsyslog/ with a Makefile and the editable versions of the two files under RCS control. Here’s the Makefile:
Actually, this Makefile ends up in ~/projects/r/rsyslog/flapjack, since waffle and pancake will end up with different rsyslogd configurations and I separate the different control directories this way.
In order to log using syslog I need to define a facility, local6, in the 50-default.conf file. The new assertion looks like this:
local6.* -/var/log/named/query.log
With a restart of each of the appropriate daemons, we’re off to the races and the new logs appear in the log file. I needed to change the ownership of the /var/log/named/query.log from bind to syslog in order for the new writer to be able to write, but that was the work of a moment.
Now comes the task of making the logs from the two secondary DNS servers go across the network to flapjack. This involved a lot of little bits and pieces.
First of all, I had to tell the rsyslogd daemon on flapjack to listen to the rsyslog UDP port. I could have turned on the more reliable TCP logging facility or the even more reliable queueing facility, but let’s get real. These are DNS query logs we’re talking about. I don’t really care if some of them fall on the floor. And anyway, the traffic levels on donner.lan are so low that I’d be very surprised if the loss rate is significant anyway.
To turn on UDP listening on flapjack all I had to do was uncomment two lines in the /etc/rsyslog.conf file:
One more restart of rsyslogd on flapjack and we’re good to go.
The next step is to make the DNS name service on waffle and pancake send their logs to the local6 facility. In addition, I had to set up rsyslog on waffle and flapjack with a local6 facility, though this time the facility has to know to send the logs across to flapjack by UDP rather than writing locally.
The change to the named.conf.local file for waffle and pancake’s DNS secondary service was identical to the change to flapjack’s primary service, so kudos to the designers of bind9 and syslogd for good modularization.
To make waffle and pancake forward their logs over to flapjack required that the /etc/rsyslog.d/50-default.conf file define local6 in this way:
local6.* @syslog
Notice that the @ tells rsyslogd to forward logs to local6 via UDP. I could have put the IP address of flapjack right after the @ or I could have put in flapjack. Instead, I created a DNS listing for a service host named syslog … it happens to have the same IP address as flapjack, but it gives me a level of indirection if I should desire to relocate the syslog service to another host.
With a restart of rsyslogd and bind9 on both waffle and pancake, we are up and running. All DNS logs are now consolidated on a single host, namely flapjack.
Well, I now have four different UNIX machines and I’ve been doing sysadmin tasks on all of them. As a result I now have four home directories that are out of sync.
How annoying.
Ultimately I plan to create a file server on one of my machines and provide the same home directory on all of them, but I haven’t done that yet, so I need some temporary crutches to tide me over until I get the file server built. In particular, I need to find out what is where.
The first thing I did was establish trust among the machines, making flapjack, the oldest, into the ‘master’ trusted by the others. This I did by creating an SSH private key using ssh-keygen on the master and putting the matching public key in .ssh/authorized_keys on the other machines.
Then I decided to automate the discovery of what directories were on which machine. This is made easier because of my personal trick for organizing files, namely to have a set of top level subdirectories named org/, people/, and projects/ in my home directory. Each of these has twenty-six subdirectories named a through z, with appropriately named subdirectories under them. This I find helps me put related things together. It is not an alternative to search but rather a complement.
Anyway, the result is that I could build a Makefile that automates reaching out to all of my machines and gathering information. Here’s the Makefile:
# $Id: Makefile,v 1.7 2014/07/04 18:57:44 marc Exp marc $
FORCE = force
HOSTS = flapjack frenchtoast pancake waffle
FILES = Makefile
checkin: ${FORCE}
ci -l ${FILES}
uname: ${FORCE}
for h in ${HOSTS}; \
do ssh $$h uname -a \
| sed -e 's/^/'$$h': /'; \
done
host_find: ${FORCE}
echo > host_find.txt
for h in ${HOSTS}; \
do ssh $$h find -print \
| sed -e 's/^/'$$h': /' \
>> host_find.txt; done
clusters.txt: host_find.txt
sed -e 's|\(/[^/]*/[a-z]/[^/]*\)/.*$$|\1|' host_find.txt \
| uniq -c \
| grep -v '^ *1 ' \
> clusters.txt
force:
Ideally, of course, I’d get the list of host names in the variable HOSTS from my configuration database, but having neglected to build one yet, I am just listing my machines by name there.
The first important target host_find does an ssh to all of the machines, including itself, and runs find, prefixing the host name on each line so that I can determine which files exist on which machine. This creates a file named host_find.txt which I can probably dispense with now that the machinery is working.
The second important target, clusters.txt, passes the host_find.txt output through a SED script. This SED script does a rather careful substitution of patterns like /org/z/zodiac/blah-blah-blah with /org/z/zodiac. Then the pipe through uniq -c counts up the number of identical path prefixes. That’s fine, but there are lots of subdirectories /org/f that are empty and I don’t want them cluttering up my result, so the grep -v '^ *1 ' pipe segment excludes the lines with a count of 1.
The result of running that tonight is the following report:
And … voila! I have a map that I can use to figure out how to consolidate the many scattered parts of my home directory.
[2014-07-04 – updated the Makefile so that it is more friendly to web browsers.]
[2014-07-29 – a friend of mine critiqued my Makefile code and pointed out that gmake has powerful iteration functions of its own, eliminating the need for me to incorporate shell code in my targets. The result is quite elegant, I must say!]
#
# Find out what files exist on all of the hosts on donner.lan
# Started in June 2014 by Marc Donner
#
# $Id: Makefile,v 1.12 2014/07/30 02:07:07 marc Exp $
#
FORCE = force
# This ought to be the result of a call to the CMDB
HOSTS = flapjack frenchtoast pancake waffle
FILES = Makefile host_find.txt clusters.txt
#
# This provides us with the ISO 8601 date (YYYY-MM-DD)
#
DATE := $(shell /bin/date +"%Y-%m-%d")
help: ${FORCE}
cat Makefile
checkin: ${FORCE}
ci -l ${FILES}
# A finger exercise to ensure that we can see the base info on the hosts
HOSTS_UNAME := $(HOSTS:%=.%_uname.txt)
uname: ${HOSTS_UNAME}
cat ${HOSTS_UNAME}
.%_uname.txt: ${FORCE}
ssh $* uname -a | sed -e 's/^/:'$*': /' > $@
HOSTS_UPTIME := $(HOSTS:%=.%_uptime.txt)
uptime: ${HOSTS_UPTIME}
cat ${HOSTS_UPTIME}
.%_uptime.txt: ${FORCE}
ssh $* uptime | sed -e 's/^/:'$*': /' > $@
# Another finger exercise to verify the location of the ssh landing
# point home directory
HOSTS_PWD := $(HOSTS:%=.%_pwd.txt)
pwd: ${HOSTS_PWD}
cat ${HOSTS_PWD}
.%_pwd.txt: ${FORCE}
ssh $* pwd | sed -e 's/^/:'$*': /' > $@
# Run find on all of the ${HOSTS} and prefix mark all of the results,
# accumulating them all in host_find.txt
HOSTS_FIND := $(HOSTS:%=.%_find.txt)
find: ${HOSTS_FIND}
.%_find.txt: ${FORCE}
echo '# ' ${DATE} > $@
ssh $* find -print | sed -e 's/^/:'$*': /' >> $@
# Get rid of the empty directories and report the number of files in each
# non-empty directory
clusters.txt: ${HOSTS_FIND}
cat ${HOSTS_FIND} \
| sed -e 's|\(/[^/]*/[a-z]/[^/]*\)/.*$$|\1|' \
| uniq -c \
| grep -v '^ *1 ' \
| sort -t ':' -k 3 \
> clusters.txt
force: