Data Center Works Inc


Libprof, or How to Profile the Unprofilable

On a project some time ago, we happened to be working with an engineer from HotSOS. We were fixing the customer's code, he was fixing the customer's Oracle performance with his (somewhat brilliant) profiler, http://www.hotsos.com/profiler.html

At the time, we thought this was really cool, and wondered what else should be profiled. This year we found out: shared libraries.

An Application that Did Nothing, Slowly

One of our customers has an application that spends all its time doing I/O. Slowly.

So we set out to profile it, and found that it was spending all its time in a low-level read function that did little except call system libraries. So we used the the profiler in truss, and found that it was spending a negligible amount of time in read and write.

With profiles from the program and from the system both saying “not my fault”, we had the computer version of that hardware and software salesmen pointing fingers at each other and saying “blame him”.

Pretending It was Perfmeter

We don't want to say anything more about the real customer program, but while doing a different experiment we found that the gnome-perfmeter program has almost exactly the same behavior. For a minute, let's pretend we are concerned about the performance of perfmeter.

Running truss -c -p <pid>, waiting several minutes and then typing ^C gave us the system-call usage, subject to a reasonably small margin of error in the timing:


$  truss -c -p 950
^C
syscall               seconds   calls  errors
read                     .003     147      15
write                    .007     121
fstat                    .002      88
ioctl                    .004     189
getmsg                   .001      44
putmsg                   .006      44
readv                    .000      20
sysinfo                  .000       3
pollsys                  .016     176
--------  ------   ----
sys totals:              .043     832     15
usr time:                .067
elapsed:              219.270


This means that the most expensive system calls are read and write, but together they only add up to 0.010 seconds system time, out of the 0.043 seconds for all the system calls. And that's out of a total of 0.110 seconds for both user and system time. Not much!

The program had been running for 219.27 seconds, so it had spent 219.17 of them doing nothing whatsoever.

We're not sure what it wasn't doing, but it was doing a lot of it.

Profiling the C library

The next place to look was the c library. However, it is a shared library, not a .a, so we couldn't just compile and link statically and run gprof. We needed a shared library profiler.

In a previous life, we'd developed much of apptrace(1), so shared libraries are old friends. We easily wrote a profiler library which we called libprof, and created the interposers for libc.

For our gnome-perfmeter example, libprof reports:

$ libprof /bin/gnome-perfmeter
^C
Resource usage profile for process 3757, gnome-perfmeter
Name                        Duration     Pct.        Calls Sec/Call
----                        ------------ -----       ----- ----------
Entire program              35.90309000 100.0           1 35.90309000
Non-library time             3.45257372  9.62           1 3.45257372
----
poll                        29.03500100 80.87         186 0.15610216
fopen                        0.69696600  1.94         110 0.00633605
select                       0.56633600  1.58          20 0.02831680
strchr                       0.44610100  1.24      392223 0.00000114
access                       0.28468800  0.79        1337 0.00021293
...


This indicates that the poll library call is the bottleneck among library calls, with 186 calls, with an average response time of 0.156 seconds. This is far larger than anything reported by strace/truss, so poll is a likely candidate for the bottleneck function in gnome-perfmeter.

Diagnosing the Problem

It was exactly the same in the customer program, except it called flock(2) right after the poll returned. Then it waited for all the other copies of itself to release their locks before doing any work. When its work was done, it unlocked the file and returned to the poll.

This turned out to be unnecessary: after some discussion and code-reading we discovered that only in one case did it need to take a lock, when the poll returned ready to read from a particular pipe. Once we moved the flock down past the read and into the switch with the pipe read, the program flew. The one case reading the pipe was still slow, but now all the other cases weren't blocked by it.


The libprof script and libraries are available on our tools page..