Disclaimer don't get the wrong idea about what you've found here

What appears below are my personal notes I wish were part of my long-term memory but don't always seem to fit. I strive for accuracy and clarity and appreciate feedback. If applying any of this information anywhere, confirm for youself the correctness of your work as what you see below might very well be, albeit unintentionally, incorrect or misleading. These notes are here as an easy reference for myself.

Information worthy of a more formal presentation will appear elsewhere than this "Scratch" area. - ksb


Linux Cluster Notes

The following are my Linux cluster notes gathered while taking over ownership of a small (6 node) cluster running RedHat ES 4. The structure is that there is one gateway or access node and 5 nodes behind that.

Table of Contents References
  1. Networking (Multi-homed)
  2. up2date
  3. DNS, bind, named, chroot
  4. NTP Network Time Protocol
  5. ssh
  6. sudo
  7. services
  8. locate / updatedb
  • None yet...

Networking (Multi-homed)

Each of the nodes have 2 network interfaces. The only one that really needs it is the access node, but the rest had them already, so I'm using them. On each, one interface is for the internal 10.X.X.X network and the other is the external interface to the Internet. For all but the access node I leave the external interface down. Here are what the 3 relevent files look like on each internal node

$ cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=clust-control
NOZEROCONF=yes # No need for a 169.254.0.0/16 route

$ cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=static
BROADCAST=10.1.255.255
HWADDR=00:12:79:D6:92:41
IPADDR=10.1.2.12
NETMASK=255.255.0.0
NETWORK=10.1.0.0
GATEWAY=10.1.2.10
ONBOOT=yes
TYPE=Ethernet

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
BOOTPROTO=static
BROADCAST=131.243.2.255
HWADDR=00:12:79:D6:92:40
IPADDR=131.243.2.107
NETMASK=255.255.255.0
NETWORK=131.243.2.0
GATEWAY=131.243.2.1
ONBOOT=no
TYPE=Ethernet
The access node looks slightly different in that both interfaces have ONBOOT=yes and there is only one GATEWAY=131.243.2.1 setting in /etc/sysconfig/network and none in the ifcfg-eth? files.

To bring an interface up or down use the ifup eth1 or ifdown eth1 commands, checking for which interfaces are up with the ifconfig command (which with the -a option will show down interfaces as well). I found that it is better to reset the entire network using service network restart rather than using ifdown eth1 because otherwise the default route gets lost.

# service network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Setting network parameters:                                [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]

up2date

I'm using RHN's (Red Hat Network's) up2date program to keep these machines, well, up-to-date with the RHN. The first time this is run, you are prompted to enter in account information to be able to connect to the RHN for updates. This is very straightforward. Details of the configuration can be seen and set using the up2date-config command, whose details are kept in /etc/sysconfig/rhn/up2date.

When checking for updates, I run up2date as shown below. (First though on the nodes where the external network interface is down, I bring it up so it can connect to the RHN). Check to see what could be updated by using the --dry-run option

# up2date --nox -u --dry-run

Fetching Obsoletes list for channel: rhel-i386-es-4...

Fetching rpm headers...
########################################

Name                                    Version        Rel     
----------------------------------------------------------
...
the list of potential RPMs to upgrade follows or the line:
All packages are currently up to date
Kernel upgrade are disabled by default (as can be seen using up2date-config), to force Kernel upgrade add the -f option to the above up2date command.

To see the entire list of RPM use the --showall option, which you will probably want to pipe through grep or less:

# up2date --nox --showall | grep subversion
subversion-1.1.4-2.ent.i386
subversion-devel-1.1.4-2.ent.i386
subversion-perl-1.1.4-2.ent.i386

DNS, bind, named, chroot

The access node is the name server for the rest of the nodes in the cluster. As such I installed the bind-chroot RPM (which brings in the bind RPM). This makes the named service (aka bind) run in a chroot (aka jail) as the named user under /var/named/chroot. This is a security precaution so that an exploit in bind will be limited to what the named user can do in that chroot jail within /var/named/chroot.

The files used by named all live in the jail (under /var/named/chroot). Since I was moving and existing bind config from one machine to another, I just started up named (via /etc/init.d/named {start|stop|status|restart}) and watched /var/log/messages for complaints about missing files (via tail /var/log/messages. Eventually I ended up dropping all the zone files in /var/named/chroot/var/named. After that, I enabled the named service, as described in the Services section.

In order for each of the cluster nodes to use that DNS server on the access node, their /etc/resolve.conf file need to point them to it:

$ cat /etc/resolv.conf 
search splts-cluster
nameserver 10.1.2.10

NTP Network Time Protocol

The access node is also the NTP server for the rest of the nodes in the cluster. I've added the following into /etc/ntp.conf on the access node:

server chronos1.lbl.gov
restrict chronos1.lbl.gov mask 255.255.255.255 nomodify notrap noquery
restrict 10.1.2.0 mask 255.255.255.0 nomodify notrap
restrict 10.1.3.0 mask 255.255.255.0 nomodify notrap
The first two lines are for who this access node should use for NTP services, and the last two allow members of those networks to contact this machines for NTP services, but not for modify or trap (whatever that means)

On each of the cluster nodes:

# tail -2 /etc/ntp.conf 
server splts-access.splts-cluster prefer
restrict splts-access.splts-cluster mask 255.255.255.255 nomodify notrap noquery

# cat /etc/ntp/ntpservers 
splts-access.splts-cluster

# cat /etc/ntp/step-tickers 
splts-access.splts-cluster

A handy command to see what is up with the NTP on a given node is the ntpstat command:

# ntpstat 
synchronised to NTP server (10.1.2.10) at stratum 3 
   time correct to within 91 ms
   polling server every 1024 s
This shows that the machine is synchronized to the given server and how accurate it's clock is with respect to that server's clock. NOTE: this command (or rather NTP itself) can take a few minutes to synchronize with the server, so don't be impatient if it comes back saying unsynchronised after getting what you think should be a correct configuration.

SSH

A common cluster security model used is one where if you authenticate yourself on the access node, then you are good on the rest of the clusters as that user. This is done using ssh keys, but to tighten up the access node, I add the following in /etc/ssh/sshd_config

Protocol 2
PermitRootLogin no
DenyUsers cluster
This will disable the use of ssh v1 protocol (which can be determined via a telnet node 22 and looking at the first line printed by the server - it should say "2.0" not "1.99"). The next line disables ssh logins by root (always a good idea). The last one is only on the access node and disables login by the cluster user.

sudo

In order to allow users of a given group to sudo su - cluster and become the cluster user without being prompted for a password. Add the following lines to /etc/sudoers by using the visudo command:

Defaults logfile=/var/log/sudo.log

# Let members of the cluster group sudo su - cluster without a password
%cluster   ALL = NOPASSWD: /bin/su - cluster
That first line just turns on sudo logging to /var/log/sudo.log. The last two enables password-free sudo su - cluster for members of the cluster group.

System Services

To check the system services, which are on for which run levels, use the chkconfig command. For example chkconfig --list lists all the services and all their states. Or for a specific services like NTP:

# chkconfig --list ntpd
ntpd            0:off   1:off   2:off   3:off   4:off   5:off   6:off
# chkconfig --level 35 ntpd on
# chkconfig --list ntpd
ntpd            0:off   1:off   2:off   3:on    4:off   5:on    6:off
This checked that NTP was off for all run levels, then set it to be on for run levels 3 and 5 and then confirmed it. This really just manipulates symlinks under /etc/rc.d/rc?.d dirs (checking, creating & removing them). To manually start, stop, check on status, etc. a given service you can directly run the real script in the /etc/init.d/ dir:
# /etc/init.d/named
Usage: /etc/init.d/named {start|stop|status|restart|condrestart|reload|probe}
# /etc/init.d/named status
number of zones: 10
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is OFF
server is up and running

To determine the default run level, look in the file /etc/inittab for the line that start with "id:":

# egrep "^id:" /etc/inittab
id:5:initdefault:
This is usually 3 (network with no X) or 5 (network with X). To see the current run level use either the runlevel or who -r commands:
$ runlevel 
N 3
$ who -r
         run-level 3  Apr  5 11:52                   last=S
The who -r is nice since it also show you the time last run level change and the last run level.

locate / updatedb

Apparently the generation of the "updatedb" database is not turned on by default on RedHat. To do this edit /etc/updatedb.conf and set DAILY_UPDATE=yes. Then run updatedb as root to generate the database.
Keith S. Beattie is responsible for this document, located at http://dst.lbl.gov/~ksb/Scratch/Linux_cluster_notes.html, which is subject to LBNL's Privacy & Security Notice, Copyright Status and Disclaimers.

Last Modified: Monday, 25-Feb-2013 16:57:57 PST