256 lines
14 KiB
Plaintext
256 lines
14 KiB
Plaintext
Developer notes regarding trace(1), by David van Moolenbroek.
|
|
|
|
|
|
OVERALL CODE STRUCTURE
|
|
|
|
The general tracing engine is in trace.c. It passes IPC-level system call
|
|
enter and leave events off to call.c, which handles IPC-level system call
|
|
printing and passes off system calls to be interpreted by a service-specific
|
|
system call handler whenever possible. All the service-specific code is in the
|
|
service/ subdirectory, grouped by destination service. IOCTLs are a special
|
|
case, which are handled in ioctl.c and passed on to driver-type-grouped IOCTL
|
|
handlers in the ioctl/ subdirectory (this grouping is not strict). Some of the
|
|
generated output goes through the formatting code in format.c, and all of it
|
|
ends up in output.c. The remaining source files contain support code.
|
|
|
|
|
|
ADDING A SYSTEM CALL HANDLER
|
|
|
|
In principle, every system call stops the traced process twice: once when the
|
|
system call is started (the call-enter event) and once when the system call
|
|
returns (the call-leave event). The tracer uses the call-enter event to print
|
|
the request being made, and the call-leave event to print the result of the
|
|
call. The output format is supposed to mimic largely what the system call
|
|
looks like from a C program, although with additional information where that
|
|
makes sense. The general output format for system calls is:
|
|
|
|
name(parameters) = result
|
|
|
|
..where "name" is the name of the system call, "parameters" is a list of system
|
|
call parameters, and "result" is the result of the system call. If possible,
|
|
the part up to and including the equals sign is printed from the call-enter
|
|
event, and the result is printed from the call-leave event. However, many
|
|
system calls actually pass a pointer to a block of memory that is filled with
|
|
meaningful content as part of the system call. For that reason, it is also
|
|
possible that the call-enter event stops printing somewhere inside the
|
|
parameters block, and the call-leave event prints the rest of the parameters,
|
|
as well as the equals sign and the result after it. The place in the printed
|
|
system call where the call-enter printer stops and the call-leave printer is
|
|
supposed to pick up again, is referred to as the "call split".
|
|
|
|
The tracer has to a handler structure for every system call that can be made by
|
|
a user program to any of the the MINIX3 services. This handler structure
|
|
provides three elements: the name of the system call, an "out" function that
|
|
handles printing of the call-enter part of the system call, and an "in"
|
|
function that handles printing of the call-leave part of the system call. The
|
|
"out" function is expected to print zero or more call parameters, and then
|
|
return a call type, which indicates whether all parameters have been printed
|
|
yet, or not. In fact, there are three call types, shown here with an example
|
|
which has a "|" pipe symbol added to indicate the call split:
|
|
|
|
CT_DONE: write(5, "foo", 3) = |3
|
|
CT_NOTDONE: read(5, |"foo", 1024) = 3
|
|
CT_NORETURN: execve("foo", ["foo"], []")| = -1 [ENOENT]
|
|
|
|
The CT_DONE call type indicates that the handler is done printing all the
|
|
parameters during the call-enter event, and the call split will be after the
|
|
equals sign. The CT_NOTDONE call type indicates that the handler is not done
|
|
printing all parameters yet, thus yielding a call split in the middle of the
|
|
parameters block (or even right after the opening parenthesis). The no-return
|
|
(CT_NORETURN) call type is used for a small number of functions that do not
|
|
return on success. Currently, these are the exit(), execve(), and sigreturn()
|
|
system calls. For these calls, no result will be printed at all, unless such
|
|
a call fails, in which case a failure result is printed after all. The call
|
|
split is such that the entire parameters block is printed upon entering the
|
|
call, but the equals sign and result are printed only if the call does return.
|
|
|
|
Now more about the handler structure for the system call. First of all, each
|
|
system call has a name, which must be a static string. It may be supplied
|
|
either as a string, or as a function that returns a name string. The latter is
|
|
for cases where one message-level system call is used to implement multiple
|
|
C-level system calls (such as setitimer() and getitimer() both going through
|
|
PM_ITIMER). The name function has the following prototype:
|
|
|
|
const char *svc_syscall_name(const message *m_out);
|
|
|
|
..where "m_out" is a local copy of the request message, which the name function
|
|
can use to decide what string to return for the system call. As a sidenote,
|
|
in the future, the system call name will be used to implement call filtering.
|
|
|
|
An "out" printer function has the following prototype:
|
|
|
|
int svc_syscall_out(struct trace_proc *proc, const message *m_out);
|
|
|
|
Here, "proc" is a pointer to the process structure containing information about
|
|
the process making the system call; proc->pid returns the process PID, but the
|
|
function should not access any other fields of this structure directly.
|
|
Instead, many of the output primitive and helper functions (which are all
|
|
prefixed with "put_") take this pointer as part of the call. "m_out" is a
|
|
local copy of the request message, and the printer may access its fields as it
|
|
sees fit.
|
|
|
|
The printer function should simply print parameters. The call name and the
|
|
opening parenthesis are printed by the main output routine.
|
|
|
|
All simple call parameters should be printed using the put_field() and
|
|
put_value() functions. The former prints a parameter or field name as flat
|
|
text; the latter is a printf-like interface to the former. By default, call
|
|
paramaters are simply printed as "value", but if printing all names is enabled,
|
|
call parameters are printed as "name=value". Thus, all parameters should be
|
|
given a name, even if this name does not show up by default. Either way, these
|
|
two functions take care of deciding whether to print the name, as well as of
|
|
printing separators between the parameters. More about printing more complex
|
|
parameters (such as structures) in a bit.
|
|
|
|
The out printer function must return one of the three CT_ call type values. If
|
|
it returns CT_DONE, the main output routine will immediately print the closing
|
|
parenthesis and equals sign. If it returns CF_NORETURN, a closing parenthesis
|
|
will be printed. If it return CF_NOTDONE, only a parameter field separator
|
|
(that is, a comma and a space) will be printed--after all, it can be assumed
|
|
that more parameters will be printed later.
|
|
|
|
An "in" printer function has the following prototype:
|
|
|
|
void svc_syscall_in(struct trace_proc *proc, const message *m_out,
|
|
const message *m_in, int failed);
|
|
|
|
Again, "proc" is the traced process of which its current system call has now
|
|
returned. "m_out" is again the request message, guaranteed to be unchanged
|
|
since the "out" call. "m_in" is the reply message from the service. "failed"
|
|
is either 0 to indicate that the call appears to have succeeded, or PF_FAILED
|
|
to indicate that the call definitely failed. If PF_FAILED is set, the call
|
|
has failed either at the IPC level or at the system call level (or for another,
|
|
less common reason). In that case, the contents of "m_in" may be garbage and
|
|
"m_in" must not be used at all.
|
|
|
|
For CF_NOTDONE type calls, the in printer function should first print the
|
|
remaining parameters. Here especially, it is important to consider that the
|
|
entire call may fail. In that case, the parameters of which the contents were
|
|
still going to be printed may also contain garbage, since they were never
|
|
filled. The expected behavior is to print such parameters as pointer or "&.."
|
|
or something else to indicate that their actual contents are not valid.
|
|
|
|
Either way, once a CF_NOTDONE type call function is done printing the remaining
|
|
parameters, it must call put_equals(proc) to print the closing parenthesis of
|
|
the call and the equals sign. CF_NORETURN calls must also use put_equals(proc)
|
|
to print the equals sign.
|
|
|
|
Then comes the result part. If the call failed, the in printer function *must*
|
|
use put_result(proc) to print the failure result. This call not only takes
|
|
care of converting negative error codes from m_in->m_type into "-1 [ECODE]" but
|
|
also prints appropriate failure codes for IPC-level and other exceptional
|
|
failures. Only if the system call did not fail, may the in printer function
|
|
choose to not call put_result(proc), which on success simply prints
|
|
m_in->m_type as an integer. Similarly, if the system call succeeded, the in
|
|
printer function may print extended results after the primary result, generally
|
|
in parentheses. For example, getpid() and getppid() share the same system call
|
|
and thus the tracer prints both return values, one as the primary result of the
|
|
actual call and one in parentheses with a clarifying name as extended result:
|
|
|
|
getpid() = 3 (ppid=1)
|
|
|
|
It should now be clear that printing extended results makes no sense if the
|
|
system call failed.
|
|
|
|
Besidse put_equals and put_result, the following more or less generic support
|
|
functions are available to print the various parts of the requests and replies.
|
|
|
|
put_field - output a parameter, structure field, and so on; this function
|
|
should be used for just about every actual value
|
|
put_value - printf-like version of put_field
|
|
put_text - output plain text; for call handlers, this should be used only to
|
|
to add things right after a put_field call, never on its own
|
|
put_fmt - printf-like version of put_text, should generally not be used
|
|
from call handlers at all
|
|
put_open - open a nested block of fields, surrounded by parentheses,
|
|
brackets, or something like that; this is used for structures,
|
|
arrays, and any other similar nontrivial case of nesting
|
|
put_close - close a previously opened block of fields; the nesting depth is
|
|
actually tracked (to keep per-level separators etc), so each
|
|
put_open call must have a corresponding put_close call
|
|
put_open_struct - perform several tasks necessary to start printing the
|
|
fields of a structure; note that this function may fail!
|
|
put_close_struct - end successful printing of a structure
|
|
put_ptr - print a pointer in the traced process
|
|
put_buf - print a buffer or string
|
|
put_flags - print a bitwise flags field
|
|
put_tail - helper function for printing the continuation part of an array
|
|
|
|
Many of these support functions take a flags field which takes PF_-prefixed
|
|
flags to modify the output they generate. The value of 'failed' in the in
|
|
printer function may actually be passed (bitwise-OR'ed in) as the PF_FAILED
|
|
flag to these support functions, and they will do the right thing. For
|
|
example, a call to put_open_struct with the PF_FAILED flag will end up simply
|
|
printing the pointer to the structure, and not allow printing of the contents
|
|
of the structure.
|
|
|
|
The above support functions are documented (at a basic level) within the code,
|
|
but in many cases, it may be useful to look up how they are used in practice by
|
|
the existing handlers. The same goes for various less clear cases; while there
|
|
is basic support for printing structures, support for printing arrays must be
|
|
coded fully by hand, as has been done for many places. A serious attempt has
|
|
been made to make the output consistent across the board (mainly thanks to the
|
|
output format of strace, on which the output of this tracer has been based,
|
|
sometimes very strictly and sometimes more loosely, but that aside) so it is
|
|
always advisable to follow the ways of the existing handlers. Also keep in
|
|
mind that there are already printer functions for several generic structures,
|
|
and these should be used whenever possible (e.g., see the put_fd() comment).
|
|
|
|
Finally, the default_out and default_in functions may be used as printer
|
|
functions for call with no parameters, and for functions which need no more
|
|
than put_result() to print their system call result, respectively.
|
|
|
|
|
|
INTERNALS: MULTIPROCESS OUTPUT AND PREEMPTION
|
|
|
|
Things get interesting when multiple processes are traced at once. Due to the
|
|
nature of process scheduling, system calls may end up being preempted between
|
|
the call-enter and call-leave phases. This means that the output of a system
|
|
call has to be suspended to give way to an event from another traced process.
|
|
Such preemption may occur with literally all calls; not just "blocking" calls.
|
|
|
|
The tracer goes through some lengths to aid the user in following the output in
|
|
the light of preemtion. The most important aspect is that the output of the
|
|
call-enter phase is recorded, so that in the case of preemption, the call-leave
|
|
phase can start by replaying the record. As a result, the user gets to see the
|
|
whole system call on a single line, instead of just the second half. Such
|
|
system call resumptions are marked with a "*" in their prefix, to show that
|
|
the call was not just entered. The output therefore looks like this:
|
|
|
|
2| syscall() = <..>
|
|
3| othercall() = 0
|
|
2|*syscall() = 0
|
|
|
|
Signals that arrive during a call will cause a resumption of the call as well.
|
|
As a result, a call may be resumed multiple times:
|
|
|
|
2| syscall() = <..>
|
|
3| othercall() = 0
|
|
2|*syscall() = ** SIGUSR1 ** ** SIGUSR2 ** <..>
|
|
3| othercall() = -1 [EBUSY]
|
|
2|*syscall() = ** SIGHUP ** <..>
|
|
3| othercall() = 0
|
|
2|*syscall() = 0
|
|
|
|
This entire scenario shows one single system call from process 2.
|
|
|
|
In the current implementation, the output that should be recorded and/or cause
|
|
the "<..>" preemption marker, as well as the cases where the recorded text must
|
|
be replayed, are marked by the code explicitly. Replay takes place in three
|
|
cases: upon the call-leave event (obviously), upon receiving a signal (as shown
|
|
above), and when it is required that a suspended no-return call is shown as
|
|
completed before continuing with other output. The last case applies to exit()
|
|
and execve(), and both are documented in the code quite extensively. Generally
|
|
speaking, in all output lines where no recording or replay actions are
|
|
performed, the recording will not be replayed but also not removed. This
|
|
allows for intermediate lines for that process in the output. Practically
|
|
speaking, future support for job control could even print when a process get
|
|
stopped and continued, for that process, while preempting the output for the
|
|
ongoing system call for that same process.
|
|
|
|
It is possible that the output of the call-enter phase exhausts the recording
|
|
buffer for its process. In this case, a new, shorter text is generated upon
|
|
process resumption. There are many other aspects to proper output formatting
|
|
in the light of preemption, but most of them should be documented as part of
|
|
the code reasonably well.
|