Data Center Works Inc


Cores, Cores Everywhere, and Not a Drop to Drink

Our accountant's company once had a problem with core files: /var/core was filling up regularly, so they'd delete them all, and then find out too late that they'd needed to act on one of them.

The underlying problem was that several of the programs they ran created cores with wild abandon. Another, more important program, created a core when it ran into a particular situation. If that occurred they needed to re-run the program with corrected data the next day

This was like the predicament of the ancient mariner, surrounded with all the salt water he could want, but without any fresh to drink. The system administrators at our accountant were surrounded by cores, but they had no way of telling the uninteresting ones from ones indicating a real problem.

Filtering out the Extraordinary from the Routine

So we sat down and analyzed several weeks of cores with mdb, the Solaris Modular Debugger, and found one pattern: the cores that we didn't care about were always unhandled Java exceptions, from a particular suite of programs run by a particular user.

This meant that if we automatically analyzed all the cores nightly and created text files containing the analysis, we could filter out any which were from a particular user, uid 110, from java and which exited with a SIGABRT. Once we'd thrown those away, we could read the others to see if we needed to schedule a rerun, or if we had found a new and different problem.

Fortunately, there is a script, MDeBug, that will do an interactive analysis of application or system core files using mdb(1). With it, we can write a cron script to analyze all the day's core files, and than another to identify cores that are interesting to the system administrator.

Analyzing with MDeBug

MDeBug is a shell script written by Gopinath Rao, that runs the mdb debugger to analyze either operating system or applications dumps. It's available at http://developers.sun.com/solaris/articles/mdebug/mdebug.html, and looks like this when run:

$ su root -c '/usr/local/bin/mdebug'

               Welcome to the MDeBug Session
               ******************************

Select one of the following:
         1. Run MDeBug against a Kernel Crash dump
         2. Run MDeBug against an Application core
         3. Exit
Enter your selection:2
Enter the binary name which generated the core:/bin/ksh
Enter the core file name:/var/core/core.ksh.856
Done!



When run on a core file from ksh, it reports:

  ******************************************************************************  Application core Dump Analysis Output                     MDeBug Rev 1.0
  Sun Aug  5 16:26:39 EDT 2007                   Files: /bin/ksh  /var/core/core.ksh.856
  ******************************************************************************


                ** Core file status **
                ------------------------
debugging core file of ksh (32-bit) from froggy
file: /usr/bin/ksh
initial argv: -ksh
threading model: multi-threaded
status: process terminated by SIGSEGV (Segmentation Fault)


                ** Thread stack($c) **
                ----------------------
libc.so.1`kill+8(52de8, b, 358, 0, 80, 0)
job_walk+0x1bc(257b8, b, 58244, 52de8, 3f35c, 52db0)
b_kill+0x23c(3, 25400, 55f68, 1b694, 0, b)
sh_exec+0x71c(53400, 53000, 0, 55f68, 4234, 53000)
0x29e58(581c8, 59298, 59298, 53800, 1, 53400)
main+0xa30(20000000, ffbff284, ffbff284, 53000, 53000, 3f400)
_start+0x108(0, 0, 0, 0, 0, 0)


                ** Shared objects **
                ----------------------
    BASE    LIMIT     SIZE NAME
   10000    42000    32000 /usr/bin/ksh
ff280000 ff354000    d4000 /lib/libc.so.1
ff398000 ff39c000     4000 /platform/sun4u/lib/libc_psr.so.1
ff3b0000 ff3dc000    2c000 /lib/ld.so.1


                Thread stack for MT app
                ------------------------
stack pointer for thread 1: ffbfebc0
[ ffbfebc0 libc.so.1`kill+8() ]
  ffbfec20 job_walk+0x1bc()
  ffbfec80 b_kill+0x23c()
  ffbfece8 sh_exec+0x71c()
  ffbfef28 0x29e58()
  ffbfefb8 main+0xa30()
  ffbff0f0 _start+0x108()

Scripting the Analysis

The information we need is in the first part of the report: the program name, the reason it exited (SIGSEGV), but not the uid. Fortunately, one of the options to the Solaris coreadm program is to set the name of the core file from a printf-like string. In particular, it allows one to include %u, the uid the program was running under when it dumped core. So we set the coreadm options to include it, as well as the gid, pid and so on:

 #coreadm -g /var/core/core_%n_%f_%u_%g_%t_%p \
  	-i /var/core/core_%n_%f_%u_%g_%t_%p \
      -e log -e global -e global-setid -e process -e proc-setid



The next thing we needed was a policy about how long we should keep the cores and analyses, and after discussing it with the data center manager, we agreed to write a script that would



This turned into a program to find new cores, to analyze and find old ones to delete, with any gzipped files saved forever.

#!/bin/sh
#
# core_cron_daemon -- for all new files in /var/core that match 
#	a pattern set via coreadm, analyze them, email the results 
#	to the system administration team, and then a week later delete them.
#	If you need to save a core, just gzip it and it won't 
#	match the pattern. 
#	The pattern is core_<hostname>_java_110_102_1181276668_2441
#	which is             host      prog uid gid timestamp  pid
#	Requires a modified mdebug script and /bin/mdb, and has to
#	be run by root's crontab in order to read /var/core.
#
ProgName=`basename $0`
BIN=/usr/local/bin

main() {
	cd /var/core

	# Find today's core files
	find . -name 'core_*' -mtime -1 -print | grep -v '\.gz' |\
	while read file junk; do
		file=`basename $file`
		analyze $file >analysis_of_$file
		mv $file analyzed_$file
	done

	# Throw away last week's core files and reports
	find . -name 'analyzed_*' -o -name 'analysis*' -mtime +7 -print |\
	grep -v '\.gz' |\
      while read file junk; do
		rm $file
	done

}

analyze() {
	name=$1

	app=`echo $name | nawk -F_ '{print $3}'`
	path=`which $app`
	set '' $path
	case "$2" in
	"no") path="-" ;; # Insert a placeholder
	*) ;;
	esac

	$BIN/mdebug 2 $path $name
}

main "$@"

When this script is run, it creates analysis_of_* files containing the output of MDeBug and renames the core files to analyzed_* so they won't be analyzed twice. A week later, it cleans them up, unless they've been gzipped. If they are gzipped, it means the system administrators want to keep them around, without wasting space, and leaves them alone.

Modifying MDeBug

As the comment in core_cron_daemon says, we need to modify mdebug so it can be run non-interactively by the script. We added a check to see if any parameters were passed, and if it was, to take the option, program name and core file from the command-line. The code for option 2 (application core files) looked like this:

else
        # Take parameters from the command-line
        case "$1" in
        2) #application analysis
          bin=$2
          cor=$3
          appcore_analysis

          thr_model=`echo "::status" | mdb $bin $cor | grep thread | cut -f2 -d":"`
          if [ "$thr_model" = " multi-threaded" ]
          then
        mdb $bin $cor 2>/dev/null <<EOA
        =nn"Thread stack for MT app"
        ="------------------------"
          ::walk thread | ::findstack
EOA
          fi
        ;;
        3) exit 0 ;;
        *) echo "$1 not implemented, halting. " ;;
        esac
fi
# End of the script -- 02/21/2002 –grao



We then run this in cron every morning, after all the nightly processing is done and the backups have started.

Daily Checks

The system administrators will have the analyses sitting waiting for them when they arrive in the morning. After a bit of experimentation, we found that all the cores from java and uid 110 were the unhandled exceptions, and the system administrators could find all the important core files with just

$ ls /var/core | grep -v java_110'

If there are any cores they're interested in, they can then look at the analysis_* files and decide if they need to do anything. Usually they see nothing, occasionally a case where they needed to schedule a rerun, and once in a long while they find a new and different core, from a problem which they might not have caught otherwise.

The last case is the interesting one: if you're not looking foe cores, often you and your customers will miss important problems. Running a nightly core analysis can often catch a nasty problem before it gets worse, so we recommend it. To make it easy, we've provided both the modified mdebug script and the core_cron_daemon script on our