Thursday, February 25, 2010

panda athena jobs crash with ls/rm/bash segfault

During January we had a number of issues getting our Panda analysis queue back on-line.
15.5.1 jobs kept crashing with a "rm segfault".
After a lot of tracking we managed to hone it down to a very specific LD_PRELOAD and LD_LIBRARY_PATH combination causing the core binaries to segfault.

This behavior only occurs for a 8 character length window of the LD_LIBRARY_PATH, so the length of experimental software path is a clincher as to whether a site will see it.
The same problem occurs in CentOS and RHEL5 as well, so it's an upstream issue.

The ticket is in to Redhat...

https://bugzilla.redhat.com/show_bug.cgi?id=563759

For the moment we have just modified our local config to append an extra useless LD_LIBRARY_PATH path. Unfortunately CMT setup cleans the path of any entries which do not contain libraries or do not exist, so we created an extra directory in the software area and created an empty dummy.so file.

Sunday, May 24, 2009

Reusable cfengine; modularising cfengine policy files

Cfengine is a great installation, configuration and maintenance tool for running a fleet of machines. It allows you to easily create groups (classes) of machines via a number of methods, then associate actions with each of these classes, allowing fine grained control of each class of host.

However most howtos and examples of cfengine tend to skip over one of its most useful features; the import function. These examples tend to have all actions defined in a single file, which can get pretty unwieldy.

Once you have moved beyond simply installing a system with kickstart or debian preseed and want to move to completely setup a machine for a purpose, you tend to start to restructure your policy files, placing all the actions associated with one component in one area. These component setups often tend to be quite self contained, not interfering with the setup of the rest of the system. Moving the actions for this component into a single file and importing them makes for better readability in the main policy file, but also create an effective “policy module”. Further, if you attempt to use actions most likely to work across multiple OS versions, such as using package commands and API's wherever possible, you end up with reusable content come upgrade time.

Here is a quick example of how we have used this functionality...

cfagent.conf
classes:

ds = ( ags1 ags2 ags3 )
pe1950 = ( agc9 agc10 agc11 agc12 agc13 agc14 ds )
dellopenmanage = ( pe1950 )


cf.main
import:

dellopenmanage:: action/sl4/dell_openmanage.cf
dellopenmanage:: action/sl4/dell_openmanage_frontpanel.cf

This setup installs the Dell openmanage software, and sets the LCD frontpanel of the host to be: ATLAS: hostname

dell_openmanage.cf
copy:

# All yum repos MUST be firstpass for first time boot installation
# processes to work on the first run
firstpass::

any::
$(sl4_files)/dell_openmanage/dell.repo mode=0644 dest=/etc/yum.repos.d/dell.repo server=$(policyhost) type=sum

packages:

any::
srvadmin-all action=install

shellcommands:

any::

# Set disable the web admin service and the seemingly useless shrsvc
"/sbin/chkconfig --level 123456 dsm_om_connsvc off"
"/sbin/chkconfig --level 123456 dsm_om_shrsvc off"
# Enable the dataeng, this seems to actually do stuff
"/sbin/chkconfig --level 345 dataeng on"
# Start the dataeng or all omsa commands will fail
"/etc/init.d/dataeng start"


dell_openmanage_frontpanel.cf
shellcommands:

any::

# Set the frontpanel LCD to the hostname
"/usr/bin/omconfig chassis frontpanel lcdindex=1 config=custom text='ATLAS: $(host)'" umask=022


Using this kind of layout you can completely modularise your cfengine policies, and end up with a single core policy file which is readable. More importantly if you want to setup a new host class with a subset of the available actions you just add the modules you want to the class.

Wednesday, December 3, 2008

GroundWork: Enabling TLS certificate login

By default, GroundWork supports authentication by password, with a native or LDAP backend. However, using our existing grid certificates would be much easier. Here's how: (We'll assume you're already using SSL to connect, which means your /usr/local/groundwork/apache2/conf/extra/http-ssl.conf is setup and ready to go)

  1. Edit /usr/local/groundwork/nagios/etc/htpasswd.users, adding lines for your DN a la:
    /C=AU/O=APACGrid/OU=The University of Melbourne/CN=Tom Fifield:xxj31ZMTZzkVA

  2. Edit /usr/local/groundwork/apache2/conf/httpd.conf, following the directions to 'Uncomment to disable Guava Single Sign On" and then paste in
    SSLRequireSSL
    SSLVerifyClient require
    SSLVerifyDepth 5
    SSLCACertificatePath /etc/grid-security/certificates/
    SSLOptions +FakeBasicAuth
    SSLVerifyClient require
    Order allow,deny
    Allow from all
    AuthUserFile /usr/local/groundwork/nagios/etc/htpasswd.users
    AuthType Basic
    AuthName "Nagios: YOUR CERTIFICATE MUST BE REGISTERED"
    Require valid-user
    in the many Directory sections that are involved.

  3. Use groundwork to tell nagios to let us in:
    Login to groundwork using your normal username and password, go to Control and Nagios CGI configuration. Next, append the DNs you added to the htpasswd file to the necessary permissions sections. Save and restart.

Friday, September 12, 2008

Laying down the GroundWork

Without a doubt, Nagios is a great way to monitor hosts and services on the grid. But those of us who've ever edited the convoluted configuration files by hand know the joy of getting syntax errors and a overload of falsely-triggered alert emails enough to go on an office-destroying rampage. Thankfully, there are several good solutions out there in the form of frameworks.

At Australia-ATLAS we use Groundwork.

Far more fully featured than 'configuration generators' like NagiosAdmin(German), Lilac(alpha) and ignoramus(lacking), Groundwork wraps nagios entirely and is very stable.

Groundwork uses a MySQL backend to manage all of the configuration before it is committed (the standard .cfg files are eventually fed to nagios) which makes the interface smooth to use. Existing users of nagios take heart - it is well supported to load previously painstakingly produced .cfg files into Groundwork through the 'Load' functionality. In this way, scripts written to automatically generate workernode host instances can still be used - though this can also be done using Groundwork's 'clone host' tool.

Groundwork also takes care of all the mundane things like, the nagios daemon itself, managing users, roles and add-on packages.


All that sounds like a bit of an ad. Why would a time-starved grid admin move to groundwork?

Configuration is easier.

You no longer need to remember anything: all of the options you have are in a drop-down box or multi-select list. That also means no more typos! Finding any host, service, command or profile is a two-click operation. You tend to use groups more because they're so much simpler to create - instead of adding them to every host in a file, you just select them from a list.

There is a couple of times when the improved robustness of Groundwork can be a little annoying. For example, when you update a service check, you need to remember to deploy it to hosts/hostgroups otherwise you can commit changes and wonder why nothing has changed.

However, these are small in comparison with the improved productivity you gain.

So why not give Groundwork a try - you can get it at http://www.groundworkopensource.com/.

Thursday, September 11, 2008

Cfengine; fixes syslog-ng's wagon good

We just finished implementing syslog-ng to send all logs from the nodes in the TIER2 to a single logging server.

Seems simple at first, unfortunately most of the grid services do not log via the standard logging interface and make their own log files. This gets even worse when syslog-ng will not start if a log file it is supposed to track does not exist.

After a bit of pain we realised that cfengine could detect the presence of the gLite log files, rewrite the syslog-ng server config and restart syslog-ng.

A couple of shell scripts and we are away, for each log file they produce something like:

classes:

s_var_log_gridftp_session = ( FileExists("/var/log/gridftp-session.log") )

editfiles:
# /var/log/gridftp-session.log
###################################
s_var_log_gridftp_session::
{ /etc/syslog-ng.conf
DefineClasses "newsyslog_ng"
BeginGroupIfNoLineContaining "# s_var_log_gridftp_session v15"
DeleteLinesContaining "s_var_log_gridftp_session"
Append "# s_var_log_gridftp_session v15"
Append "source s_var_log_gridftp_session { file ('/var/log/gridftp-session.log' follow_freq(30) log_prefix('log_gridftp_session: ')); };"
Append "log { source(s_var_log_gridftp_session); destination(d_stunnel); };"
EndGroup
}
!s_var_log_gridftp_session::
{ /etc/syslog-ng.conf
DefineClasses "newsyslog_ng"
DeleteLinesContaining "s_var_log_gridftp_session"
}

shellcommands:

any::

newsyslog_ng::
"/sbin/service syslog stop" umask=022
"/sbin/chkconfig --level 2345 syslog off"
"/sbin/chkconfig --add syslog-ng"
"/sbin/chkconfig --add syslog-ng-stunnel"
"/etc/init.d/syslog-ng-stunnel restart" umask=022
"/sbin/service syslog-ng restart" umask=022

So central logging is a go, and there is only one master config file for all nodes. Even better if you start a new service, its logs get added automatically.