 8b18d03deb
			
		
	
	
		8b18d03deb
		
	
	
	
	
		
			
			Also fix two small IOCTL-related bugs: - do not print an argument pointer for argument-less IOCTLs; - print IOCTL contents with -V given once, just like structures. Change-Id: Iec7373003d71937fd34ee4b9db6c6cec0c916411
		
			
				
	
	
		
			362 lines
		
	
	
		
			20 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			362 lines
		
	
	
		
			20 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Developer notes regarding trace(1), by David van Moolenbroek.
 | |
| 
 | |
| 
 | |
| OVERALL CODE STRUCTURE
 | |
| 
 | |
| The general tracing engine is in trace.c.  It passes IPC-level system call
 | |
| enter and leave events off to call.c, which handles IPC-level system call
 | |
| printing and passes off system calls to be interpreted by a service-specific
 | |
| system call handler whenever possible.  All the service-specific code is in the
 | |
| service/ subdirectory, grouped by destination service.  IOCTLs are a special
 | |
| case, which are handled in ioctl.c and passed on to driver-type-grouped IOCTL
 | |
| handlers in the ioctl/ subdirectory (this grouping is not strict).  Some of the
 | |
| generated output goes through the formatting code in format.c, and all of it
 | |
| ends up in output.c.  The remaining source files contain support code.
 | |
| 
 | |
| 
 | |
| ADDING A SYSTEM CALL HANDLER
 | |
| 
 | |
| In principle, every system call stops the traced process twice: once when the
 | |
| system call is started (the call-enter event) and once when the system call
 | |
| returns (the call-leave event).  The tracer uses the call-enter event to print
 | |
| the request being made, and the call-leave event to print the result of the
 | |
| call.  The output format is supposed to mimic largely what the system call
 | |
| looks like from a C program, although with additional information where that
 | |
| makes sense.  The general output format for system calls is:
 | |
| 
 | |
|   name(parameters) = result
 | |
| 
 | |
| ..where "name" is the name of the system call, "parameters" is a list of system
 | |
| call parameters, and "result" is the result of the system call.  If possible,
 | |
| the part up to and including the equals sign is printed from the call-enter
 | |
| event, and the result is printed from the call-leave event.  However, many
 | |
| system calls actually pass a pointer to a block of memory that is filled with
 | |
| meaningful content as part of the system call.  For that reason, it is also
 | |
| possible that the call-enter event stops printing somewhere inside the
 | |
| parameters block, and the call-leave event prints the rest of the parameters,
 | |
| as well as the equals sign and the result after it.  The place in the printed
 | |
| system call where the call-enter printer stops and the call-leave printer is
 | |
| supposed to pick up again, is referred to as the "call split".
 | |
| 
 | |
| The tracer has to a handler structure for every system call that can be made by
 | |
| a user program to any of the the MINIX3 services.  This handler structure
 | |
| provides three elements: the name of the system call, an "out" function that
 | |
| handles printing of the call-enter part of the system call, and an "in"
 | |
| function that handles printing of the call-leave part of the system call.  The
 | |
| "out" function is expected to print zero or more call parameters, and then
 | |
| return a call type, which indicates whether all parameters have been printed
 | |
| yet, or not.  In fact, there are three call types, shown here with an example
 | |
| which has a "|" pipe symbol added to indicate the call split:
 | |
| 
 | |
|   CT_DONE:       write(5, "foo", 3) = |3
 | |
|   CT_NOTDONE:    read(5, |"foo", 1024) = 3
 | |
|   CT_NORETURN:   execve("foo", ["foo"], []")| = -1 [ENOENT]
 | |
| 
 | |
| The CT_DONE call type indicates that the handler is done printing all the
 | |
| parameters during the call-enter event, and the call split will be after the
 | |
| equals sign.  The CT_NOTDONE call type indicates that the handler is not done
 | |
| printing all parameters yet, thus yielding a call split in the middle of the
 | |
| parameters block (or even right after the opening parenthesis).  The no-return
 | |
| (CT_NORETURN) call type is used for a small number of functions that do not
 | |
| return on success.  Currently, these are the exit(), execve(), and sigreturn()
 | |
| system calls.  For these calls, no result will be printed at all, unless such
 | |
| a call fails, in which case a failure result is printed after all.  The call
 | |
| split is such that the entire parameters block is printed upon entering the
 | |
| call, but the equals sign and result are printed only if the call does return.
 | |
| 
 | |
| Now more about the handler structure for the system call.  First of all, each
 | |
| system call has a name, which must be a static string.  It may be supplied
 | |
| either as a string, or as a function that returns a name string.  The latter is
 | |
| for cases where one message-level system call is used to implement multiple
 | |
| C-level system calls (such as setitimer() and getitimer() both going through
 | |
| PM_ITIMER).  The name function has the following prototype:
 | |
| 
 | |
|   const char *svc_syscall_name(const message *m_out);
 | |
| 
 | |
| ..where "m_out" is a local copy of the request message, which the name function
 | |
| can use to decide what string to return for the system call.  As a sidenote,
 | |
| in the future, the system call name will be used to implement call filtering.
 | |
| 
 | |
| An "out" printer function has the following prototype:
 | |
| 
 | |
|   int svc_syscall_out(struct trace_proc *proc, const message *m_out);
 | |
| 
 | |
| Here, "proc" is a pointer to the process structure containing information about
 | |
| the process making the system call; proc->pid returns the process PID, but the
 | |
| function should not access any other fields of this structure directly.
 | |
| Instead, many of the output primitive and helper functions (which are all
 | |
| prefixed with "put_") take this pointer as part of the call.  "m_out" is a
 | |
| local copy of the request message, and the printer may access its fields as it
 | |
| sees fit.
 | |
| 
 | |
| The printer function should simply print parameters.  The call name and the
 | |
| opening parenthesis are printed by the main output routine.
 | |
| 
 | |
| All simple call parameters should be printed using the put_field() and
 | |
| put_value() functions.  The former prints a parameter or field name as flat
 | |
| text; the latter is a printf-like interface to the former.  By default, call
 | |
| paramaters are simply printed as "value", but if printing all names is enabled,
 | |
| call parameters are printed as "name=value".  Thus, all parameters should be
 | |
| given a name, even if this name does not show up by default.  Either way, these
 | |
| two functions take care of deciding whether to print the name, as well as of
 | |
| printing separators between the parameters.  More about printing more complex
 | |
| parameters (such as structures) in a bit.
 | |
| 
 | |
| The out printer function must return one of the three CT_ call type values.  If
 | |
| it returns CT_DONE, the main output routine will immediately print the closing
 | |
| parenthesis and equals sign.  If it returns CF_NORETURN, a closing parenthesis
 | |
| will be printed.  If it return CF_NOTDONE, only a parameter field separator
 | |
| (that is, a comma and a space) will be printed--after all, it can be assumed
 | |
| that more parameters will be printed later.
 | |
| 
 | |
| An "in" printer function has the following prototype:
 | |
| 
 | |
|   void svc_syscall_in(struct trace_proc *proc, const message *m_out,
 | |
|           const message *m_in, int failed);
 | |
| 
 | |
| Again, "proc" is the traced process of which its current system call has now
 | |
| returned.  "m_out" is again the request message, guaranteed to be unchanged
 | |
| since the "out" call.  "m_in" is the reply message from the service.  "failed"
 | |
| is either 0 to indicate that the call appears to have succeeded, or PF_FAILED
 | |
| to indicate that the call definitely failed.  If PF_FAILED is set, the call
 | |
| has failed either at the IPC level or at the system call level (or for another,
 | |
| less common reason).  In that case, the contents of "m_in" may be garbage and
 | |
| "m_in" must not be used at all.
 | |
| 
 | |
| For CF_NOTDONE type calls, the in printer function should first print the
 | |
| remaining parameters.  Here especially, it is important to consider that the
 | |
| entire call may fail.  In that case, the parameters of which the contents were
 | |
| still going to be printed may also contain garbage, since they were never
 | |
| filled.  The expected behavior is to print such parameters as pointer or "&.."
 | |
| or something else to indicate that their actual contents are not valid.
 | |
| 
 | |
| Either way, once a CF_NOTDONE type call function is done printing the remaining
 | |
| parameters, it must call put_equals(proc) to print the closing parenthesis of
 | |
| the call and the equals sign.  CF_NORETURN calls must also use put_equals(proc)
 | |
| to print the equals sign.
 | |
| 
 | |
| Then comes the result part.  If the call failed, the in printer function *must*
 | |
| use put_result(proc) to print the failure result.  This call not only takes
 | |
| care of converting negative error codes from m_in->m_type into "-1 [ECODE]" but
 | |
| also prints appropriate failure codes for IPC-level and other exceptional
 | |
| failures.  Only if the system call did not fail, may the in printer function
 | |
| choose to not call put_result(proc), which on success simply prints
 | |
| m_in->m_type as an integer.  Similarly, if the system call succeeded, the in
 | |
| printer function may print extended results after the primary result, generally
 | |
| in parentheses.  For example, getpid() and getppid() share the same system call
 | |
| and thus the tracer prints both return values, one as the primary result of the
 | |
| actual call and one in parentheses with a clarifying name as extended result:
 | |
| 
 | |
|   getpid() = 3 (ppid=1)
 | |
| 
 | |
| It should now be clear that printing extended results makes no sense if the
 | |
| system call failed.
 | |
| 
 | |
| Besidse put_equals and put_result, the following more or less generic support
 | |
| functions are available to print the various parts of the requests and replies.
 | |
| 
 | |
|   put_field - output a parameter, structure field, and so on; this function
 | |
|               should be used for just about every actual value
 | |
|   put_value - printf-like version of put_field
 | |
|   put_text  - output plain text; for call handlers, this should be used only to
 | |
|               to add things right after a put_field call, never on its own
 | |
|   put_fmt   - printf-like version of put_text, should generally not be used
 | |
|               from call handlers at all
 | |
|   put_open  - open a nested block of fields, surrounded by parentheses,
 | |
|               brackets, or something like that; this is used for structures,
 | |
|               arrays, and any other similar nontrivial case of nesting
 | |
|   put_close - close a previously opened block of fields; the nesting depth is
 | |
|               actually tracked (to keep per-level separators etc), so each
 | |
|               put_open call must have a corresponding put_close call
 | |
|   put_open_struct  - perform several tasks necessary to start printing the
 | |
|                      fields of a structure; note that this function may fail!
 | |
|   put_close_struct - end successful printing of a structure
 | |
|   put_ptr   - print a pointer in the traced process
 | |
|   put_buf   - print a buffer or string
 | |
|   put_flags - print a bitwise flags field
 | |
|   put_tail  - helper function for printing the continuation part of an array
 | |
| 
 | |
| Many of these support functions take a flags field which takes PF_-prefixed
 | |
| flags to modify the output they generate.  The value of 'failed' in the in
 | |
| printer function may actually be passed (bitwise-OR'ed in) as the PF_FAILED
 | |
| flag to these support functions, and they will do the right thing.  For
 | |
| example, a call to put_open_struct with the PF_FAILED flag will end up simply
 | |
| printing the pointer to the structure, and not allow printing of the contents
 | |
| of the structure.
 | |
| 
 | |
| The above support functions are documented (at a basic level) within the code,
 | |
| but in many cases, it may be useful to look up how they are used in practice by
 | |
| the existing handlers.  The same goes for various less clear cases; while there
 | |
| is basic support for printing structures, support for printing arrays must be
 | |
| coded fully by hand, as has been done for many places.  A serious attempt has
 | |
| been made to make the output consistent across the board (mainly thanks to the
 | |
| output format of strace, on which the output of this tracer has been based,
 | |
| sometimes very strictly and sometimes more loosely, but that aside) so it is
 | |
| always advisable to follow the ways of the existing handlers.  Also keep in
 | |
| mind that there are already printer functions for several generic structures,
 | |
| and these should be used whenever possible (e.g., see the put_fd() comment).
 | |
| 
 | |
| Finally, the default_out and default_in functions may be used as printer
 | |
| functions for call with no parameters, and for functions which need no more
 | |
| than put_result() to print their system call result, respectively.
 | |
| 
 | |
| 
 | |
| ADDING AN IOCTL HANDLER
 | |
| 
 | |
| There are many IOCTL requests, and many have their own associated data types.
 | |
| Like with system calls, the idea is to provide an actual implementation for any
 | |
| IOCTLs that can actually occur in the wild.  This consists of printing the full
 | |
| IOCTL name, as well as its argument.  First something about how handling IOCTLs
 | |
| is grouped into files in the ioctl subdirectory, then about the actual
 | |
| procedure the IOCTLs are handled.
 | |
| 
 | |
| Grouping of IOCTL handling in the ioctl subdirectory is currently based on the
 | |
| IOCTLs' associated device type.  This is not a performance optimization: for
 | |
| any given IOCTL, there is no way for the main IOCTL code (in ioctl.c) to know
 | |
| which group, if any, contains a handler for the IOCTL, so it simply queries all
 | |
| groups.  The grouping is there only to keep down the size of individual source
 | |
| files, and as such not even strict: for example, networking IOCTLs are
 | |
| technically a subset of character IOCTLs, and kept separate only because there
 | |
| are so many of them.  The point here is mainly that the separation is not at
 | |
| all set in stone.  However, the svrctl group is an exception: svrctl(2)
 | |
| requests are very much like IOCTLs, and thus also treated as such, but they are
 | |
| in a different namespace.  Thus, their handlers are in a separate file.
 | |
| 
 | |
| As per the ioctl_table structure, each group has a function to return the name
 | |
| of an IOCTL it knows (typically <group>_ioctl_name), and a function to handle
 | |
| IOCTL arguments (typically <group>_ioctl_arg).  Whenever an IOCTL system call
 | |
| is made, each group's name function is queried.  This function has the
 | |
| following prototype:
 | |
| 
 | |
|   const char *group_ioctl_name(unsigned long req);
 | |
| 
 | |
| The "req" parameter contains the IOCTL request code.  The function is to return
 | |
| a static non-NULL string if it knows the name for the request code, or NULL
 | |
| otherwise.  If the function returns a non-NULL string, that name will be used
 | |
| for the IOCTL.  In addition, if the IOCTL has an argument at all, i.e. it is
 | |
| not of the basic _IO() type, that group (and only that group!) will be queried
 | |
| about the IOCTL argument, by calling the group's IOCTL argument function.  The
 | |
| IOCTL argument function has the following prototype:
 | |
| 
 | |
|   int group_ioctl_arg(struct trace_proc *proc, unsigned long req, void *ptr,
 | |
|           int dir);
 | |
| 
 | |
| For a single IOCTL, this function may be called up to three times.  The first
 | |
| time, "ptr" will be NULL, and based on the same IOCTL request code "req", the
 | |
| function must return any bitwise combination of two flags: IF_OUT and IF_IN.
 | |
| 
 | |
| The returned flags determine whether and how the IOCTL's argument will be
 | |
| printed: before and/or after performing the IOCTL system call.  These two flags
 | |
| effectively correspond to the "write" and "read" argument directions of IOCTLs:
 | |
| IF_OUT indicates that the argument should be printed before the IOCTL request,
 | |
| and this is to be used only for IOCTLs of type _IOW() and _IOWR().  IF_IN
 | |
| indicates that the argument should be printed after the IOCTL request (but if
 | |
| it was successful only), and is to be used only for IOCTLs of type _IOR() and
 | |
| _IOWR().
 | |
| 
 | |
| The returned flag combination determines how the IOCTL is formatted.  The
 | |
| following possible return values result in the following output formats, again
 | |
| with the "|" indicating the call split, "out" being the IOCTL argument contents
 | |
| printed before the IOCTL call, and "in" being the IOCTL argument printed after
 | |
| the IOCTL call:
 | |
| 
 | |
|   0:             ioctl(3, IOCFOO, &0xaddress) = |0
 | |
|   IF_OUT:        ioctl(3, IOCFOO, {out}) = |0
 | |
|   IF_OUT|IF_IN:  ioctl(3, IOCFOO, {out}) = |0 {in}
 | |
|   IF_IN:         ioctl(3, IOCFOO, |{in}) = 0
 | |
| 
 | |
| Both IF_ flags are optional, mainly because it is not always needed to print
 | |
| both sides for an _IOWR() request.  However, using the wrong flag (e.g., IF_OUT
 | |
| for an _IOR() request, which simply makes no sense) will trigger an assert.
 | |
| Also, the function should basically never return 0 for an IOCTL it recognizes.
 | |
| Again, for IOCTLs of type _IO(), which have no argument, the argument function
 | |
| is not called at all.
 | |
| 
 | |
| Now the important part.  For each flag that is returned on the initial call to
 | |
| the argument function, the argument function will be called again, this time to
 | |
| perform actual printing of the argument.  For these subsequent calls, "ptr"
 | |
| will point to the argument data which has been copied to the local address
 | |
| space, and "dir" will contain one of the returned flags (that is, either IF_OUT
 | |
| or IF_IN) to indicate whether the function is called before or after the IOCTL
 | |
| call.  As should now be obvious, if the first call returned IF_OUT | IF_IN, the
 | |
| function will be called again with "dir" set to IF_OUT, and if the IOCTL call
 | |
| did not fail, once more (for the third time), now with "dir" set to IF_IN.
 | |
| 
 | |
| For these calls with an actual "ptr" value and a direction, the function should
 | |
| indeed print the argument as appropriate, using "proc" as process pointer for
 | |
| use in calls to the printing functions.  The general approach is to print non-
 | |
| structure arguments as single values with no field name, and structure
 | |
| arguments by printing its fields with their field names.  The main code (in
 | |
| ioctl.c) ensures that the output is enclosed in curly brackets, thus making the
 | |
| output look like a structure anyway.
 | |
| 
 | |
| For these subsequent calls, the argument function's return value should be
 | |
| IF_ALL if all parts of the IOCTL argument have been printed, or 0 otherwise.
 | |
| In the latter case, the main code will add a final ".." field to indicate to
 | |
| the user that not all parts of the argument have been printed, very much like
 | |
| the "all" parameter of put_close_struct.
 | |
| 
 | |
| If no name can be found for the IOCTL request code, the argument will simply be
 | |
| printed as a pointer.  The same happens in error cases, for example if copying
 | |
| in the IOCTL data resulted in an error.
 | |
| 
 | |
| There is no support for dealing with multiple IOCTLs with the exact same
 | |
| request code--something that should not, but sadly does, occur in practice.
 | |
| For now, the preferred approach would be to implement only support for the
 | |
| IOCTL that is most likely to be found in practice, and possibly to put a horse
 | |
| head in the bed of whoever introduced the duplicate request code.
 | |
| 
 | |
| 
 | |
| INTERNALS: MULTIPROCESS OUTPUT AND PREEMPTION
 | |
| 
 | |
| Things get interesting when multiple processes are traced at once.  Due to the
 | |
| nature of process scheduling, system calls may end up being preempted between
 | |
| the call-enter and call-leave phases.  This means that the output of a system
 | |
| call has to be suspended to give way to an event from another traced process.
 | |
| Such preemption may occur with literally all calls; not just "blocking" calls.
 | |
| 
 | |
| The tracer goes through some lengths to aid the user in following the output in
 | |
| the light of preemtion.  The most important aspect is that the output of the
 | |
| call-enter phase is recorded, so that in the case of preemption, the call-leave
 | |
| phase can start by replaying the record.  As a result, the user gets to see the
 | |
| whole system call on a single line, instead of just the second half.  Such
 | |
| system call resumptions are marked with a "*" in their prefix, to show that
 | |
| the call was not just entered.  The output therefore looks like this:
 | |
| 
 | |
|       2| syscall() = <..>
 | |
|       3| othercall() = 0
 | |
|       2|*syscall() = 0
 | |
| 
 | |
| Signals that arrive during a call will cause a resumption of the call as well.
 | |
| As a result, a call may be resumed multiple times:
 | |
| 
 | |
|       2| syscall() = <..>
 | |
|       3| othercall() = 0
 | |
|       2|*syscall() = ** SIGUSR1 ** ** SIGUSR2 ** <..>
 | |
|       3| othercall() = -1 [EBUSY]
 | |
|       2|*syscall() = ** SIGHUP ** <..>
 | |
|       3| othercall() = 0
 | |
|       2|*syscall() = 0
 | |
| 
 | |
| This entire scenario shows one single system call from process 2.
 | |
| 
 | |
| In the current implementation, the output that should be recorded and/or cause
 | |
| the "<..>" preemption marker, as well as the cases where the recorded text must
 | |
| be replayed, are marked by the code explicitly.  Replay takes place in three
 | |
| cases: upon the call-leave event (obviously), upon receiving a signal (as shown
 | |
| above), and when it is required that a suspended no-return call is shown as
 | |
| completed before continuing with other output.  The last case applies to exit()
 | |
| and execve(), and both are documented in the code quite extensively.  Generally
 | |
| speaking, in all output lines where no recording or replay actions are
 | |
| performed, the recording will not be replayed but also not removed.  This
 | |
| allows for intermediate lines for that process in the output.  Practically
 | |
| speaking, future support for job control could even print when a process get
 | |
| stopped and continued, for that process, while preempting the output for the
 | |
| ongoing system call for that same process.
 | |
| 
 | |
| It is possible that the output of the call-enter phase exhausts the recording
 | |
| buffer for its process.  In this case, a new, shorter text is generated upon
 | |
| process resumption.  There are many other aspects to proper output formatting
 | |
| in the light of preemption, but most of them should be documented as part of
 | |
| the code reasonably well.
 |