diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-06-09 10:06:18 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-06-09 10:06:18 -0700 |
commit | d1e521adad250ab8c979861c857fa2b1542c9741 (patch) | |
tree | 8c83de6b27c8a1aae0952cb86a25be7f27a51f9b /Documentation/trace | |
parent | 595a56ac1b0d5f0a16a89589ef55ffd35c1967a2 (diff) | |
parent | 388d8bdb87e01bcea6d0b2bf797b5f6d7b2401fb (diff) |
Merge tag 'trace-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"No new features this release. Mostly clean ups, restructuring and
documentation.
- Have ftrace_bug() show ftrace errors before the WARN, as the WARN
will reboot the box before the error messages are printed if
panic_on_warn is set.
- Have traceoff_on_warn disable tracing sooner (before prints)
- Write a message to the trace buffer that its being disabled when
disable_trace_on_warning() is set.
- Separate out synthetic events from histogram code to let it be used
by other parts of the kernel.
- More documentation on histogram design.
- Other small fixes and clean ups"
* tag 'trace-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove obsolete PREEMPTIRQ_EVENTS kconfig option
tracing/doc: Fix ascii-art in histogram-design.rst
tracing: Add a trace print when traceoff_on_warning is triggered
ftrace,bug: Improve traceoff_on_warn
selftests/ftrace: Distinguish between hist and synthetic event checks
tracing: Move synthetic events to a separate file
tracing: Fix events.rst section numbering
tracing/doc: Fix typos in histogram-design.rst
tracing: Add hist_debug trace event files for histogram debugging
tracing: Add histogram-design document
tracing: Check state.disabled in synth event trace functions
tracing/probe: reverse arguments to list_add
tools/bootconfig: Add a summary of test cases and return error
ftrace: show debugging information when panic_on_warn set
Diffstat (limited to 'Documentation/trace')
-rw-r--r-- | Documentation/trace/histogram-design.rst | 2115 |
1 files changed, 2115 insertions, 0 deletions
diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst new file mode 100644 index 000000000000..eef840043da9 --- /dev/null +++ b/Documentation/trace/histogram-design.rst @@ -0,0 +1,2115 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Histogram Design Notes +====================== + +:Author: Tom Zanussi <zanussi@kernel.org> + +This document attempts to provide a description of how the ftrace +histograms work and how the individual pieces map to the data +structures used to implement them in trace_events_hist.c and +tracing_map.c. + +Note: All the ftrace histogram command examples assume the working + directory is the ftrace /tracing directory. For example:: + + # cd /sys/kernel/debug/tracing + +Also, the histogram output displayed for those commands will be +generally be truncated - only enough to make the point is displayed. + +'hist_debug' trace event files +============================== + +If the kernel is compiled with CONFIG_HIST_TRIGGERS_DEBUG set, an +event file named 'hist_debug' will appear in each event's +subdirectory. This file can be read at any time and will display some +of the hist trigger internals described in this document. Specific +examples and output will be described in test cases below. + +Basic histograms +================ + +First, basic histograms. Below is pretty much the simplest thing you +can do with histograms - create one with a single key on a single +event and cat the output:: + + # echo 'hist:keys=pid' >> events/sched/sched_waking/trigger + + # cat events/sched/sched_waking/hist + + { pid: 18249 } hitcount: 1 + { pid: 13399 } hitcount: 1 + { pid: 17973 } hitcount: 1 + { pid: 12572 } hitcount: 1 + ... + { pid: 10 } hitcount: 921 + { pid: 18255 } hitcount: 1444 + { pid: 25526 } hitcount: 2055 + { pid: 5257 } hitcount: 2055 + { pid: 27367 } hitcount: 2055 + { pid: 1728 } hitcount: 2161 + + Totals: + Hits: 21305 + Entries: 183 + Dropped: 0 + +What this does is create a histogram on the sched_waking event using +pid as a key and with a single value, hitcount, which even if not +explicitly specified, exists for every histogram regardless. + +The hitcount value is a per-bucket value that's automatically +incremented on every hit for the given key, which in this case is the +pid. + +So in this histogram, there's a separate bucket for each pid, and each +bucket contains a value for that bucket, counting the number of times +sched_waking was called for that pid. + +Each histogram is represented by a hist_data struct. + +To keep track of each key and value field in the histogram, hist_data +keeps an array of these fields named fields[]. The fields[] array is +an array containing struct hist_field representations of each +histogram val and key in the histogram (variables are also included +here, but are discussed later). So for the above histogram we have one +key and one value; in this case the one value is the hitcount value, +which all histograms have, regardless of whether they define that +value or not, which the above histogram does not. + +Each struct hist_field contains a pointer to the ftrace_event_field +from the event's trace_event_file along with various bits related to +that such as the size, offset, type, and a hist_field_fn_t function, +which is used to grab the field's data from the ftrace event buffer +(in most cases - some hist_fields such as hitcount don't directly map +to an event field in the trace buffer - in these cases the function +implementation gets its value from somewhere else). The flags field +indicates which type of field it is - key, value, variable, variable +reference, etc., with value being the default. + +The other important hist_data data structure in addition to the +fields[] array is the tracing_map instance created for the histogram, +which is held in the .map member. The tracing_map implements the +lock-free hash table used to implement histograms (see +kernel/trace/tracing_map.h for much more discussion about the +low-level data structures implementing the tracing_map). For the +purposes of this discussion, the tracing_map contains a number of +buckets, each bucket corresponding to a particular tracing_map_elt +object hashed by a given histogram key. + +Below is a diagram the first part of which describes the hist_data and +associated key and value fields for the histogram described above. As +you can see, there are two fields in the fields array, one val field +for the hitcount and one key field for the pid key. + +Below that is a diagram of a run-time snapshot of what the tracing_map +might look like for a given run. It attempts to show the +relationships between the hist_data fields and the tracing_map +elements for a couple hypothetical keys and values.:: + + +------------------+ + | hist_data | + +------------------+ +----------------+ + | .fields[] |---->| val = hitcount |----------------------------+ + +----------------+ +----------------+ | + | .map | | .size | | + +----------------+ +--------------+ | + | .offset | | + +--------------+ | + | .fn() | | + +--------------+ | + . | + . | + . | + +----------------+ <--- n_vals | + | key = pid |----------------------------|--+ + +----------------+ | | + | .size | | | + +--------------+ | | + | .offset | | | + +--------------+ | | + | .fn() | | | + +----------------+ <--- n_fields | | + | unused | | | + +----------------+ | | + | | | | + +--------------+ | | + | | | | + +--------------+ | | + | | | | + +--------------+ | | + n_keys = n_fields - n_vals | | + +The hist_data n_vals and n_fields delineate the extent of the fields[] | | +array and separate keys from values for the rest of the code. | | + +Below is a run-time representation of the tracing_map part of the | | +histogram, with pointers from various parts of the fields[] array | | +to corresponding parts of the tracing_map. | | + +The tracing_map consists of an array of tracing_map_entrys and a set | | +of preallocated tracing_map_elts (abbreviated below as map_entry and | | +map_elt). The total number of map_entrys in the hist_data.map array = | | +map->max_elts (actually map->map_size but only max_elts of those are | | +used. This is a property required by the map_insert() algorithm). | | + +If a map_entry is unused, meaning no key has yet hashed into it, its | | +.key value is 0 and its .val pointer is NULL. Once a map_entry has | | +been claimed, the .key value contains the key's hash value and the | | +.val member points to a map_elt containing the full key and an entry | | +for each key or value in the map_elt.fields[] array. There is an | | +entry in the map_elt.fields[] array corresponding to each hist_field | | +in the histogram, and this is where the continually aggregated sums | | +corresponding to each histogram value are kept. | | + +The diagram attempts to show the relationship between the | | +hist_data.fields[] and the map_elt.fields[] with the links drawn | | +between diagrams:: + + +-----------+ | | + | hist_data | | | + +-----------+ | | + | .fields | | | + +---------+ +-----------+ | | + | .map |---->| map_entry | | | + +---------+ +-----------+ | | + | .key |---> 0 | | + +---------+ | | + | .val |---> NULL | | + +-----------+ | | + | map_entry | | | + +-----------+ | | + | .key |---> pid = 999 | | + +---------+ +-----------+ | | + | .val |--->| map_elt | | | + +---------+ +-----------+ | | + . | .key |---> full key * | | + . +---------+ +---------------+ | | + . | .fields |--->| .sum (val) |<-+ | + +-----------+ +---------+ | 2345 | | | + | map_entry | +---------------+ | | + +-----------+ | .offset (key) |<----+ + | .key |---> 0 | 0 | | | + +---------+ +---------------+ | | + | .val |---> NULL . | | + +-----------+ . | | + | map_entry | . | | + +-----------+ +---------------+ | | + | .key | | .sum (val) or | | | + +---------+ +---------+ | .offset (key) | | | + | .val |--->| map_elt | +---------------+ | | + +-----------+ +---------+ | .sum (val) or | | | + | map_entry | | .offset (key) | | | + +-----------+ +---------------+ | | + | .key |---> pid = 4444 | | + +---------+ +-----------+ | | + | .val | | map_elt | | | + +---------+ +-----------+ | | + | .key |---> full key * | | + +---------+ +---------------+ | | + | .fields |--->| .sum (val) |<-+ | + +---------+ | 65523 | | + +---------------+ | + | .offset (key) |<----+ + | 0 | + +---------------+ + . + . + . + +---------------+ + | .sum (val) or | + | .offset (key) | + +---------------+ + | .sum (val) or | + | .offset (key) | + +---------------+ + +Abbreviations used in the diagrams:: + + hist_data = struct hist_trigger_data + hist_data.fields = struct hist_field + fn = hist_field_fn_t + map_entry = struct tracing_map_entry + map_elt = struct tracing_map_elt + map_elt.fields = struct tracing_map_field + +Whenever a new event occurs and it has a hist trigger associated with +it, event_hist_trigger() is called. event_hist_trigger() first deals +with the key: for each subkey in the key (in the above example, there +is just one subkey corresponding to pid), the hist_field that +represents that subkey is retrieved from hist_data.fields[] and the +hist_field_fn_t fn() associated with that field, along with the +field's size and offset, is used to grab that subkey's data from the +current trace record. + +Once the complete key has been retrieved, it's used to look that key +up in the tracing_map. If there's no tracing_map_elt associated with +that key, an empty one is claimed and inserted in the map for the new +key. In either case, the tracing_map_elt associated with that key is +returned. + +Once a tracing_map_elt available, hist_trigger_elt_update() is called. +As the name implies, this updates the element, which basically means +updating the element's fields. There's a tracing_map_field associated +with each key and value in the histogram, and each of these correspond +to the key and value hist_fields created when the histogram was +created. hist_trigger_elt_update() goes through each value hist_field +and, as for the keys, uses the hist_field's fn() and size and offset +to grab the field's value from the current trace record. Once it has +that value, it simply adds that value to that field's +continually-updated tracing_map_field.sum member. Some hist_field +fn()s, such as for the hitcount, don't actually grab anything from the +trace record (the hitcount fn() just increments the counter sum by 1), +but the idea is the same. + +Once all the values have been updated, hist_trigger_elt_update() is +done and returns. Note that there are also tracing_map_fields for +each subkey in the key, but hist_trigger_elt_update() doesn't look at +them or update anything - those exist only for sorting, which can +happen later. + +Basic histogram test +-------------------- + +This is a good example to try. It produces 3 value fields and 2 key +fields in the output:: + + # echo 'hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger + +To see the debug data, cat the kmem/kmalloc's 'hist_debug' file. It +will show the trigger info of the histogram it corresponds to, along +with the address of the hist_data associated with the histogram, which +will become useful in later examples. It then displays the number of +total hist_fields associated with the histogram along with a count of +how many of those correspond to keys and how many correspond to values. + +It then goes on to display details for each field, including the +field's flags and the position of each field in the hist_data's +fields[] array, which is useful information for verifying that things +internally appear correct or not, and which again will become even +more useful in further examples:: + + # cat events/kmem/kmalloc/hist_debug + + # event histogram + # + # trigger info: hist:keys=common_pid,call_site.sym:vals=hitcount,bytes_req,bytes_alloc:sort=hitcount:size=2048 [active] + # + + hist_data: 000000005e48c9a5 + + n_vals: 3 + n_keys: 2 + n_fields: 5 + + val fields: + + hist_data->fields[0]: + flags: + VAL: HIST_FIELD_FL_HITCOUNT + type: u64 + size: 8 + is_signed: 0 + + hist_data->fields[1]: + flags: + VAL: normal u64 value + ftrace_event_field name: bytes_req + type: size_t + size: 8 + is_signed: 0 + + hist_data->fields[2]: + flags: + VAL: normal u64 value + ftrace_event_field name: bytes_alloc + type: size_t + size: 8 + is_signed: 0 + + key fields: + + hist_data->fields[3]: + flags: + HIST_FIELD_FL_KEY + ftrace_event_field name: common_pid + type: int + size: 8 + is_signed: 1 + + hist_data->fields[4]: + flags: + HIST_FIELD_FL_KEY + ftrace_event_field name: call_site + type: unsigned long + size: 8 + is_signed: 0 + +The commands below can be used to clean things up for the next test:: + + # echo '!hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger + +Variables +========= + +Variables allow data from one hist trigger to be saved by one hist +trigger and retrieved by another hist trigger. For example, a trigger +on the sched_waking event can capture a timestamp for a particular +pid, and later a sched_switch event that switches to that pid event +can grab the timestamp and use it to calculate a time delta between +the two events:: + + # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> + events/sched/sched_waking/trigger + + # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> + events/sched/sched_switch/trigger + +In terms of the histogram data structures, variables are implemented +as another type of hist_field and for a given hist trigger are added +to the hist_data.fields[] array just after all the val fields. To +distinguish them from the existing key and val fields, they're given a +new flag type, HIST_FIELD_FL_VAR (abbreviated FL_VAR) and they also +make use of a new .var.idx field member in struct hist_field, which +maps them to an index in a new map_elt.vars[] array added to the +map_elt specifically designed to store and retrieve variable values. +The diagram below shows those new elements and adds a new variable +entry, ts0, corresponding to the ts0 variable in the sched_waking +trigger above. + +sched_waking histogram +----------------------:: + + +------------------+ + | hist_data |<-------------------------------------------------------+ + +------------------+ +-------------------+ | + | .fields[] |-->| val = hitcount | | + +----------------+ +-------------------+ | + | .map | | .size | | + +----------------+ +-----------------+ | + | .offset | | + +-----------------+ | + | .fn() | | + +-----------------+ | + | .flags | | + +-----------------+ | + | .var.idx | | + +-------------------+ | + | var = ts0 | | + +-------------------+ | + | .size | | + +-----------------+ | + | .offset | | + +-----------------+ | + | .fn() | | + +-----------------+ | + | .flags & FL_VAR | | + +-----------------+ | + | .var.idx |----------------------------+-+ | + +-----------------+ | | | + . | | | + . | | | + . | | | + +-------------------+ <--- n_vals | | | + | key = pid | | | | + +-------------------+ | | | + | .size | | | | + +-----------------+ | | | + | .offset | | | | + +-----------------+ | | | + | .fn() | | | | + +-----------------+ | | | + | .flags & FL_KEY | | | | + +-----------------+ | | | + | .var.idx | | | | + +-------------------+ <--- n_fields | | | + | unused | | | | + +-------------------+ | | | + | | | | | + +-----------------+ | | | + | | | | | + +-----------------+ | | | + | | | | | + +-----------------+ | | | + | | | | | + +-----------------+ | | | + | | | | | + +-----------------+ | | | + n_keys = n_fields - n_vals | | | + | | | + +This is very similar to the basic case. In the above diagram, we can | | | +see a new .flags member has been added to the struct hist_field | | | +struct, and a new entry added to hist_data.fields representing the ts0 | | | +variable. For a normal val hist_field, .flags is just 0 (modulo | | | +modifier flags), but if the value is defined as a variable, the .flags | | | +contains a set FL_VAR bit. | | | + +As you can see, the ts0 entry's .var.idx member contains the index | | | +into the tracing_map_elts' .vars[] array containing variable values. | | | +This idx is used whenever the value of the variable is set or read. | | | +The map_elt.vars idx assigned to the given variable is assigned and | | | +saved in .var.idx by create_tracing_map_fields() after it calls | | | +tracing_map_add_var(). | | | + +Below is a representation of the histogram at run-time, which | | | +populates the map, along with correspondence to the above hist_data and | | | +hist_field data structures. | | | + +The diagram attempts to show the relationship between the | | | +hist_data.fields[] and the map_elt.fields[] and map_elt.vars[] with | | | +the links drawn between diagrams. For each of the map_elts, you can | | | +see that the .fields[] members point to the .sum or .offset of a key | | | +or val and the .vars[] members point to the value of a variable. The | | | +arrows between the two diagrams show the linkages between those | | | +tracing_map members and the field definitions in the corresponding | | | +hist_data fields[] members.:: + + +-----------+ | | | + | hist_data | | | | + +-----------+ | | | + | .fields | | | | + +---------+ +-----------+ | | | + | .map |---->| map_entry | | | | + +---------+ +-----------+ | | | + | .key |---> 0 | | | + +---------+ | | | + | .val |---> NULL | | | + +-----------+ | | | + | map_entry | | | | + +-----------+ | | | + | .key |---> pid = 999 | | | + +---------+ +-----------+ | | | + | .val |--->| map_elt | | | | + +---------+ +-----------+ | | | + . | .key |---> full key * | | | + . +---------+ +---------------+ | | | + . | .fields |--->| .sum (val) | | | | + . +---------+ | 2345 | | | | + . +--| .vars | +---------------+ | | | + . | +---------+ | .offset (key) | | | | + . | | 0 | | | | + . | +---------------+ | | | + . | . | | | + . | . | | | + . | . | | | + . | +---------------+ | | | + . | | .sum (val) or | | | | + . | | .offset (key) | | | | + . | +---------------+ | | | + . | | .sum (val) or | | | | + . | | .offset (key) | | | | + . | +---------------+ | | | + . | | | | + . +---------------->+---------------+ | | | + . | ts0 |<--+ | | + . | 113345679876 | | | | + . +---------------+ | | | + . | unused | | | | + . | | | | | + . +---------------+ | | | + . . | | | + . . | | | + . . | | | + . +---------------+ | | | + . | unused | | | | + . | | | | | + . +---------------+ | | | + . | unused | | | | + . | | | | | + . +---------------+ | | | + . | | | + +-----------+ | | | + | map_entry | | | | + +-----------+ | | | + | .key |---> pid = 4444 | | | + +---------+ +-----------+ | | | + | .val |--->| map_elt | | | | + +---------+ +-----------+ | | | + . | .key |---> full key * | | | + . +---------+ +---------------+ | | | + . | .fields |--->| .sum (val) | | | | + +---------+ | 2345 | | | | + +--| .vars | +---------------+ | | | + | +---------+ | .offset (key) | | | | + | | 0 | | | | + | +---------------+ | | | + | . | | | + | . | | | + | . | | | + | +---------------+ | | | + | | .sum (val) or | | | | + | | .offset (key) | | | | + | +---------------+ | | | + | | .sum (val) or | | | | + | | .offset (key) | | | | + | +---------------+ | | | + | | | | + | +---------------+ | | | + +---------------->| ts0 |<--+ | | + | 213499240729 | | | + +---------------+ | | + | unused | | | + | | | | + +---------------+ | | + . | | + . | | + . | | + +---------------+ | | + | unused | | | + | | | | + +---------------+ | | + | unused | | | + | | | | + +---------------+ | | + +For each used map entry, there's a map_elt pointing to an array of | | +.vars containing the current value of the variables associated with | | +that histogram entry. So in the above, the timestamp associated with | | +pid 999 is 113345679876, and the timestamp variable in the same | | +.var.idx for pid 4444 is 213499240729. | | + +sched_switch histogram | | +---------------------- | | + +The sched_switch histogram paired with the above sched_waking | | +histogram is shown below. The most important aspect of the | | +sched_switch histogram is that it references a variable on the | | +sched_waking histogram above. | | + +The histogram diagram is very similar to the others so far displayed, | | +but it adds variable references. You can see the normal hitcount and | | +key fields along with a new wakeup_lat variable implemented in the | | +same way as the sched_waking ts0 variable, but in addition there's an | | +entry with the new FL_VAR_REF (short for HIST_FIELD_FL_VAR_REF) flag. | | + +Associated with the new var ref field are a couple of new hist_field | | +members, var.hist_data and var_ref_idx. For a variable reference, the | | +var.hist_data goes with the var.idx, which together uniquely identify | | +a particular variable on a particular histogram. The var_ref_idx is | | +just the index into the var_ref_vals[] array that caches the values of | | +each variable whenever a hist trigger is updated. Those resulting | | +values are then finally accessed by other code such as trace action | | +code that uses the var_ref_idx values to assign param values. | | + +The diagram below describes the situation for the sched_switch | | +histogram referred to before:: + + # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> | | + events/sched/sched_switch/trigger | | + | | + +------------------+ | | + | hist_data | | | + +------------------+ +-----------------------+ | | + | .fields[] |-->| val = hitcount | | | + +----------------+ +-----------------------+ | | + | .map | | .size | | | + +----------------+ +---------------------+ | | + +--| .var_refs[] | | .offset | | | + | +----------------+ +---------------------+ | | + | | .fn() | | | + | var_ref_vals[] +---------------------+ | | + | +-------------+ | .flags | | | + | | $ts0 |<---+ +---------------------+ | | + | +-------------+ | | .var.idx | | | + | | | | +---------------------+ | | + | +-------------+ | | .var.hist_data | | | + | | | | +---------------------+ | | + | +-------------+ | | .var_ref_idx | | | + | | | | +-----------------------+ | | + | +-------------+ | | var = wakeup_lat | | | + | . | +-----------------------+ | | + | . | | .size | | | + | . | +---------------------+ | | + | +-------------+ | | .offset | | | + | | | | +---------------------+ | | + | +-------------+ | | .fn() | | | + | | | | +---------------------+ | | + | +-------------+ | | .flags & FL_VAR | | | + | | +---------------------+ | | + | | | .var.idx | | | + | | +---------------------+ | | + | | | .var.hist_data | | | + | | +---------------------+ | | + | | | .var_ref_idx | | | + | | +---------------------+ | | + | | . | | + | | . | | + | | . | | + | | +-----------------------+ <--- n_vals | | + | | | key = pid | | | + | | +-----------------------+ | | + | | | .size | | | + | | +---------------------+ | | + | | | .offset | | | + | | +---------------------+ | | + | | | .fn() | | | + | | +---------------------+ | | + | | | .flags | | | + | | +---------------------+ | | + | | | .var.idx | | | + | | +-----------------------+ <--- n_fields | | + | | | unused | | | + | | +-----------------------+ | | + | | | | | | + | | +---------------------+ | | + | | | | | | + | | +---------------------+ | | + | | | | | | + | | +---------------------+ | | + | | | | | | + | | +---------------------+ | | + | | | | | | + | | +---------------------+ | | + | | n_keys = n_fields - n_vals | | + | | | | + | | | | + | | +-----------------------+ | | + +---------------------->| var_ref = $ts0 | | | + | +-----------------------+ | | + | | .size | | | + | +---------------------+ | | + | | .offset | | | + | +---------------------+ | | + | | .fn() | | | + | +---------------------+ | | + | | .flags & FL_VAR_REF | | | + | +---------------------+ | | + | | .var.idx |--------------------------+ | + | +---------------------+ | + |