summaryrefslogtreecommitdiffstats
path: root/Documentation/trace
diff options
context:
space:
mode:
authorChangbin Du <changbin.du@intel.com>2018-02-17 13:39:38 +0800
committerJonathan Corbet <corbet@lwn.net>2018-03-07 10:23:10 -0700
commit1f198e22bc3a0af747a7cf7b444de49ff76b6869 (patch)
tree4bd4e057bfd0c1df535f680cf8cf1afc5ec519d2 /Documentation/trace
parent8fa4e720e8d919271cdf0da3c0856333246398a4 (diff)
trace doc: convert trace/ftrace.txt to rst format
This converts the plain text documentation to reStructuredText format and add it into Sphinx TOC tree. No essential content change. Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/ftrace.rst3332
-rw-r--r--Documentation/trace/ftrace.txt3220
-rw-r--r--Documentation/trace/index.rst1
3 files changed, 3333 insertions, 3220 deletions
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
new file mode 100644
index 000000000000..636aa9bf5674
--- /dev/null
+++ b/Documentation/trace/ftrace.rst
@@ -0,0 +1,3332 @@
+========================
+ftrace - Function Tracer
+========================
+
+Copyright 2008 Red Hat Inc.
+
+:Author: Steven Rostedt <srostedt@redhat.com>
+:License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
+ John Kacur, and David Teigland.
+
+- Written for: 2.6.28-rc2
+- Updated for: 3.10
+- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
+- Converted to rst format - Changbin Du <changbin.du@intel.com>
+
+Introduction
+------------
+
+Ftrace is an internal tracer designed to help out developers and
+designers of systems to find what is going on inside the kernel.
+It can be used for debugging or analyzing latencies and
+performance issues that take place outside of user-space.
+
+Although ftrace is typically considered the function tracer, it
+is really a frame work of several assorted tracing utilities.
+There's latency tracing to examine what occurs between interrupts
+disabled and enabled, as well as for preemption and from a time
+a task is woken to the task is actually scheduled in.
+
+One of the most common uses of ftrace is the event tracing.
+Through out the kernel is hundreds of static event points that
+can be enabled via the tracefs file system to see what is
+going on in certain parts of the kernel.
+
+See events.txt for more information.
+
+
+Implementation Details
+----------------------
+
+See :doc:`ftrace-design` for details for arch porters and such.
+
+
+The File System
+---------------
+
+Ftrace uses the tracefs file system to hold the control files as
+well as the files to display output.
+
+When tracefs is configured into the kernel (which selecting any ftrace
+option will do) the directory /sys/kernel/tracing will be created. To mount
+this directory, you can add to your /etc/fstab file::
+
+ tracefs /sys/kernel/tracing tracefs defaults 0 0
+
+Or you can mount it at run time with::
+
+ mount -t tracefs nodev /sys/kernel/tracing
+
+For quicker access to that directory you may want to make a soft link to
+it::
+
+ ln -s /sys/kernel/tracing /tracing
+
+.. attention::
+
+ Before 4.1, all ftrace tracing control files were within the debugfs
+ file system, which is typically located at /sys/kernel/debug/tracing.
+ For backward compatibility, when mounting the debugfs file system,
+ the tracefs file system will be automatically mounted at:
+
+ /sys/kernel/debug/tracing
+
+ All files located in the tracefs file system will be located in that
+ debugfs file system directory as well.
+
+.. attention::
+
+ Any selected ftrace option will also create the tracefs file system.
+ The rest of the document will assume that you are in the ftrace directory
+ (cd /sys/kernel/tracing) and will only concentrate on the files within that
+ directory and not distract from the content with the extended
+ "/sys/kernel/tracing" path name.
+
+That's it! (assuming that you have ftrace configured into your kernel)
+
+After mounting tracefs you will have access to the control and output files
+of ftrace. Here is a list of some of the key files:
+
+
+ Note: all time values are in microseconds.
+
+ current_tracer:
+
+ This is used to set or display the current tracer
+ that is configured.
+
+ available_tracers:
+
+ This holds the different types of tracers that
+ have been compiled into the kernel. The
+ tracers listed here can be configured by
+ echoing their name into current_tracer.
+
+ tracing_on:
+
+ This sets or displays whether writing to the trace
+ ring buffer is enabled. Echo 0 into this file to disable
+ the tracer or 1 to enable it. Note, this only disables
+ writing to the ring buffer, the tracing overhead may
+ still be occurring.
+
+ The kernel function tracing_off() can be used within the
+ kernel to disable writing to the ring buffer, which will
+ set this file to "0". User space can re-enable tracing by
+ echoing "1" into the file.
+
+ Note, the function and event trigger "traceoff" will also
+ set this file to zero and stop tracing. Which can also
+ be re-enabled by user space using this file.
+
+ trace:
+
+ This file holds the output of the trace in a human
+ readable format (described below). Note, tracing is temporarily
+ disabled while this file is being read (opened).
+
+ trace_pipe:
+
+ The output is the same as the "trace" file but this
+ file is meant to be streamed with live tracing.
+ Reads from this file will block until new data is
+ retrieved. Unlike the "trace" file, this file is a
+ consumer. This means reading from this file causes
+ sequential reads to display more current data. Once
+ data is read from this file, it is consumed, and
+ will not be read again with a sequential read. The
+ "trace" file is static, and if the tracer is not
+ adding more data, it will display the same
+ information every time it is read. This file will not
+ disable tracing while being read.
+
+ trace_options:
+
+ This file lets the user control the amount of data
+ that is displayed in one of the above output
+ files. Options also exist to modify how a tracer
+ or events work (stack traces, timestamps, etc).
+
+ options:
+
+ This is a directory that has a file for every available
+ trace option (also in trace_options). Options may also be set
+ or cleared by writing a "1" or "0" respectively into the
+ corresponding file with the option name.
+
+ tracing_max_latency:
+
+ Some of the tracers record the max latency.
+ For example, the maximum time that interrupts are disabled.
+ The maximum time is saved in this file. The max trace will also be
+ stored, and displayed by "trace". A new max trace will only be
+ recorded if the latency is greater than the value in this file
+ (in microseconds).
+
+ By echoing in a time into this file, no latency will be recorded
+ unless it is greater than the time in this file.
+
+ tracing_thresh:
+
+ Some latency tracers will record a trace whenever the
+ latency is greater than the number in this file.
+ Only active when the file contains a number greater than 0.
+ (in microseconds)
+
+ buffer_size_kb:
+
+ This sets or displays the number of kilobytes each CPU
+ buffer holds. By default, the trace buffers are the same size
+ for each CPU. The displayed number is the size of the
+ CPU buffer and not total size of all buffers. The
+ trace buffers are allocated in pages (blocks of memory
+ that the kernel uses for allocation, usually 4 KB in size).
+ If the last page allocated has room for more bytes
+ than requested, the rest of the page will be used,
+ making the actual allocation bigger than requested or shown.
+ ( Note, the size may not be a multiple of the page size
+ due to buffer management meta-data. )
+
+ Buffer sizes for individual CPUs may vary
+ (see "per_cpu/cpu0/buffer_size_kb" below), and if they do
+ this file will show "X".
+
+ buffer_total_size_kb:
+
+ This displays the total combined size of all the trace buffers.
+
+ free_buffer:
+
+ If a process is performing tracing, and the ring buffer should be
+ shrunk "freed" when the process is finished, even if it were to be
+ killed by a signal, this file can be used for that purpose. On close
+ of this file, the ring buffer will be resized to its minimum size.
+ Having a process that is tracing also open this file, when the process
+ exits its file descriptor for this file will be closed, and in doing so,
+ the ring buffer will be "freed".
+
+ It may also stop tracing if disable_on_free option is set.
+
+ tracing_cpumask:
+
+ This is a mask that lets the user only trace on specified CPUs.
+ The format is a hex string representing the CPUs.
+
+ set_ftrace_filter:
+
+ When dynamic ftrace is configured in (see the
+ section below "dynamic ftrace"), the code is dynamically
+ modified (code text rewrite) to disable calling of the
+ function profiler (mcount). This lets tracing be configured
+ in with practically no overhead in performance. This also
+ has a side effect of enabling or disabling specific functions
+ to be traced. Echoing names of functions into this file
+ will limit the trace to only those functions.
+
+ The functions listed in "available_filter_functions" are what
+ can be written into this file.
+
+ This interface also allows for commands to be used. See the
+ "Filter commands" section for more details.
+
+ set_ftrace_notrace:
+
+ This has an effect opposite to that of
+ set_ftrace_filter. Any function that is added here will not
+ be traced. If a function exists in both set_ftrace_filter
+ and set_ftrace_notrace, the function will _not_ be traced.
+
+ set_ftrace_pid:
+
+ Have the function tracer only trace the threads whose PID are
+ listed in this file.
+
+ If the "function-fork" option is set, then when a task whose
+ PID is listed in this file forks, the child's PID will
+ automatically be added to this file, and the child will be
+ traced by the function tracer as well. This option will also
+ cause PIDs of tasks that exit to be removed from the file.
+
+ set_event_pid:
+
+ Have the events only trace a task with a PID listed in this file.
+ Note, sched_switch and sched_wake_up will also trace events
+ listed in this file.
+
+ To have the PIDs of children of tasks with their PID in this file
+ added on fork, enable the "event-fork" option. That option will also
+ cause the PIDs of tasks to be removed from this file when the task
+ exits.
+
+ set_graph_function:
+
+ Functions listed in this file will cause the function graph
+ tracer to only trace these functions and the functions that
+ they call. (See the section "dynamic ftrace" for more details).
+
+ set_graph_notrace:
+
+ Similar to set_graph_function, but will disable function graph
+ tracing when the function is hit until it exits the function.
+ This makes it possible to ignore tracing functions that are called
+ by a specific function.
+
+ available_filter_functions:
+
+ This lists the functions that ftrace has processed and can trace.
+ These are the function names that you can pass to
+ "set_ftrace_filter" or "set_ftrace_notrace".
+ (See the section "dynamic ftrace" below for more details.)
+
+ dyn_ftrace_total_info:
+
+ This file is for debugging purposes. The number of functions that
+ have been converted to nops and are available to be traced.
+
+ enabled_functions:
+
+ This file is more for debugging ftrace, but can also be useful
+ in seeing if any function has a callback attached to it.
+ Not only does the trace infrastructure use ftrace function
+ trace utility, but other subsystems might too. This file
+ displays all functions that have a callback attached to them
+ as well as the number of callbacks that have been attached.
+ Note, a callback may also call multiple functions which will
+ not be listed in this count.
+
+ If the callback registered to be traced by a function with
+ the "save regs" attribute (thus even more overhead), a 'R'
+ will be displayed on the same line as the function that
+ is returning registers.
+
+ If the callback registered to be traced by a function with
+ the "ip modify" attribute (thus the regs->ip can be changed),
+ an 'I' will be displayed on the same line as the function that
+ can be overridden.
+
+ If the architecture supports it, it will also show what callback
+ is being directly called by the function. If the count is greater
+ than 1 it most likely will be ftrace_ops_list_func().
+
+ If the callback of the function jumps to a trampoline that is
+ specific to a the callback and not the standard trampoline,
+ its address will be printed as well as the function that the
+ trampoline calls.
+
+ function_profile_enabled:
+
+ When set it will enable all functions with either the function
+ tracer, or if configured, the function graph tracer. It will
+ keep a histogram of the number of functions that were called
+ and if the function graph tracer was configured, it will also keep
+ track of the time spent in those functions. The histogram
+ content can be displayed in the files:
+
+ trace_stats/function<cpu> ( function0, function1, etc).
+
+ trace_stats:
+
+ A directory that holds different tracing stats.
+
+ kprobe_events:
+
+ Enable dynamic trace points. See kprobetrace.txt.
+
+ kprobe_profile:
+
+ Dynamic trace points stats. See kprobetrace.txt.
+
+ max_graph_depth:
+
+ Used with the function graph tracer. This is the max depth
+ it will trace into a function. Setting this to a value of
+ one will show only the first kernel function that is called
+ from user space.
+
+ printk_formats:
+
+ This is for tools that read the raw format files. If an event in
+ the ring buffer references a string, only a pointer to the string
+ is recorded into the buffer and not the string itself. This prevents
+ tools from knowing what that string was. This file displays the string
+ and address for the string allowing tools to map the pointers to what
+ the strings were.
+
+ saved_cmdlines:
+
+ Only the pid of the task is recorded in a trace event unless
+ the event specifically saves the task comm as well. Ftrace
+ makes a cache of pid mappings to comms to try to display
+ comms for events. If a pid for a comm is not listed, then
+ "<...>" is displayed in the output.
+
+ If the option "record-cmd" is set to "0", then comms of tasks
+ will not be saved during recording. By default, it is enabled.
+
+ saved_cmdlines_size:
+
+ By default, 128 comms are saved (see "saved_cmdlines" above). To
+ increase or decrease the amount of comms that are cached, echo
+ in a the number of comms to cache, into this file.
+
+ saved_tgids:
+
+ If the option "record-tgid" is set, on each scheduling context switch
+ the Task Group ID of a task is saved in a table mapping the PID of
+ the thread to its TGID. By default, the "record-tgid" option is
+ disabled.
+
+ snapshot:
+
+ This displays the "snapshot" buffer and also lets the user
+ take a snapshot of the current running trace.
+ See the "Snapshot" section below for more details.
+
+ stack_max_size:
+
+ When the stack tracer is activated, this will display the
+ maximum stack size it has encountered.
+ See the "Stack Trace" section below.
+
+ stack_trace:
+
+ This displays the stack back trace of the largest stack
+ that was encountered when the stack tracer is activated.
+ See the "Stack Trace" section below.
+
+ stack_trace_filter:
+
+ This is similar to "set_ftrace_filter" but it limits what
+ functions the stack tracer will check.
+
+ trace_clock:
+
+ Whenever an event is recorded into the ring buffer, a
+ "timestamp" is added. This stamp comes from a specified
+ clock. By default, ftrace uses the "local" clock. This
+ clock is very fast and strictly per cpu, but on some
+ systems it may not be monotonic with respect to other
+ CPUs. In other words, the local clocks may not be in sync
+ with local clocks on other CPUs.
+
+ Usual clocks for tracing::
+
+ # cat trace_clock
+ [local] global counter x86-tsc
+
+ The clock with the square brackets around it is the one in effect.
+
+ local:
+ Default clock, but may not be in sync across CPUs
+
+ global:
+ This clock is in sync with all CPUs but may
+ be a bit slower than the local clock.
+
+ counter:
+ This is not a clock at all, but literally an atomic
+ counter. It counts up one by one, but is in sync
+ with all CPUs. This is useful when you need to
+ know exactly the order events occurred with respect to
+ each other on different CPUs.
+
+ uptime:
+ This uses the jiffies counter and the time stamp
+ is relative to the time since boot up.
+
+ perf:
+ This makes ftrace use the same clock that perf uses.
+ Eventually perf will be able to read ftrace buffers
+ and this will help out in interleaving the data.
+
+ x86-tsc:
+ Architectures may define their own clocks. For
+ example, x86 uses its own TSC cycle clock here.
+
+ ppc-tb:
+ This uses the powerpc timebase register value.
+ This is in sync across CPUs and can also be used
+ to correlate events across hypervisor/guest if
+ tb_offset is known.
+
+ mono:
+ This uses the fast monotonic clock (CLOCK_MONOTONIC)
+ which is monotonic and is subject to NTP rate adjustments.
+
+ mono_raw:
+ This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
+ which is montonic but is not subject to any rate adjustments
+ and ticks at the same rate as the hardware clocksource.
+
+ boot:
+ This is the boot clock (CLOCK_BOOTTIME) and is based on the
+ fast monotonic clock, but also accounts for time spent in
+ suspend. Since the clock access is designed for use in
+ tracing in the suspend path, some side effects are possible
+ if clock is accessed after the suspend time is accounted before
+ the fast mono clock is updated. In this case, the clock update
+ appears to happen slightly sooner than it normally would have.
+ Also on 32-bit systems, it's possible that the 64-bit boot offset
+ sees a partial update. These effects are rare and post
+ processing should be able to handle them. See comments in the
+ ktime_get_boot_fast_ns() function for more information.
+
+ To set a clock, simply echo the clock name into this file::
+
+ echo global > trace_clock
+
+ trace_marker:
+
+ This is a very useful file for synchronizing user space
+ with events happening in the kernel. Writing strings into
+ this file will be written into the ftrace buffer.
+
+ It is useful in applications to open this file at the start
+ of the application and just reference the file descriptor
+ for the file::
+
+ void trace_write(const char *fmt, ...)
+ {
+ va_list ap;
+ char buf[256];
+ int n;
+
+ if (trace_fd < 0)
+ return;
+
+ va_start(ap, fmt);
+ n = vsnprintf(buf, 256, fmt, ap);
+ va_end(ap);
+
+ write(trace_fd, buf, n);
+ }
+
+ start::
+
+ trace_fd = open("trace_marker", WR_ONLY);
+
+ trace_marker_raw:
+
+ This is similar to trace_marker above, but is meant for for binary data
+ to be written to it, where a tool can be used to parse the data
+ from trace_pipe_raw.
+
+ uprobe_events:
+
+ Add dynamic tracepoints in programs.
+ See uprobetracer.txt
+
+ uprobe_profile:
+
+ Uprobe statistics. See uprobetrace.txt
+
+ instances:
+
+ This is a way to make multiple trace buffers where different
+ events can be recorded in different buffers.
+ See "Instances" section below.
+
+ events:
+
+ This is the trace event directory. It holds event tracepoints
+ (also known as static tracepoints) that have been compiled
+ into the kernel. It shows what event tracepoints exist
+ and how they are grouped by system. There are "enable"
+ files at various levels that can enable the tracepoints
+ when a "1" is written to them.
+
+ See events.txt for more information.
+
+ set_event:
+
+ By echoing in the event into this file, will enable that event.
+
+ See events.txt for more information.
+
+ available_events:
+
+ A list of events that can be enabled in tracing.
+
+ See events.txt for more information.
+
+ hwlat_detector:
+
+ Directory for the Hardware Latency Detector.
+ See "Hardware Latency Detector" section below.
+
+ per_cpu:
+
+ This is a directory that contains the trace per_cpu information.
+
+ per_cpu/cpu0/buffer_size_kb:
+
+ The ftrace buffer is defined per_cpu. That is, there's a separate
+ buffer for each CPU to allow writes to be done atomically,
+ and free from cache bouncing. These buffers may have different
+ size buffers. This file is similar to the buffer_size_kb
+ file, but it only displays or sets the buffer size for the
+ specific CPU. (here cpu0).
+
+ per_cpu/cpu0/trace:
+
+ This is similar to the "trace" file, but it will only display
+ the data specific for the CPU. If written to, it only clears
+ the specific CPU buffer.
+
+ per_cpu/cpu0/trace_pipe
+
+ This is similar to the "trace_pipe" file, and is a consuming
+ read, but it will only display (and consume) the data specific
+ for the CPU.
+
+ per_cpu/cpu0/trace_pipe_raw
+
+ For tools that can parse the ftrace ring buffer binary format,
+ the trace_pipe_raw file can be used to extract the data
+ from the ring buffer directly. With the use of the splice()
+ system call, the buffer data can be quickly transferred to
+ a file or to the network where a server is collecting the
+ data.
+
+ Like trace_pipe, this is a consuming reader, where multiple
+ reads will always produce different data.
+
+ per_cpu/cpu0/snapshot:
+
+ This is similar to the main "snapshot" file, but will only
+ snapshot the current CPU (if supported). It only displays
+ the content of the snapshot for a given CPU, and if
+ written to, only clears this CPU buffer.
+
+ per_cpu/cpu0/snapshot_raw:
+
+ Similar to the trace_pipe_raw, but will read the binary format
+ from the snapshot buffer for the given CPU.
+
+ per_cpu/cpu0/stats:
+
+ This displays certain stats about the ring buffer:
+
+ entries:
+ The number of events that are still in the buffer.
+
+ overrun:
+ The number of lost events due to overwriting when
+ the buffer was full.
+
+ commit overrun:
+ Should always be zero.
+ This gets set if so many events happened within a nested
+ event (ring buffer is re-entrant), that it fills the
+ buffer and starts dropping events.
+
+ bytes:
+ Bytes actually read (not overwritten).
+
+ oldest event ts:
+ The oldest timestamp in the buffer
+
+ now ts:
+ The current timestamp
+
+ dropped events:
+ Events lost due to overwrite option being off.
+
+ read events:
+ The number of events read.
+
+The Tracers
+-----------
+
+Here is the list of current tracers that may be configured.
+
+ "function"
+
+ Function call tracer to trace all kernel functions.
+
+ "function_graph"
+
+ Similar to the function tracer except that the
+ function tracer probes the functions on their entry
+ whereas the function graph tracer traces on both entry
+ and exit of the functions. It then provides the ability
+ to draw a graph of function calls similar to C code
+ source.
+
+ "blk"
+
+ The block tracer. The tracer used by the blktrace user
+ application.
+
+ "hwlat"
+
+ The Hardware Latency tracer is used to detect if the hardware
+ produces any latency. See "Hardware Latency Detector" section
+ below.
+
+ "irqsoff"
+
+ Traces the areas that disable interrupts and saves
+ the trace with the longest max latency.
+ See tracing_max_latency. When a new max is recorded,
+ it replaces the old trace. It is best to view this
+ trace with the latency-format option enabled, which
+ happens automatically when the tracer is selected.
+
+ "preemptoff"
+
+ Similar to irqsoff but traces and records the amount of
+ time for which preemption is disabled.
+
+ "preemptirqsoff"
+
+ Similar to irqsoff and preemptoff, but traces and
+ records the largest time for which irqs and/or preemption
+ is disabled.
+
+ "wakeup"
+
+ Traces and records the max latency that it takes for
+ the highest priority task to get scheduled after
+ it has been woken up.
+ Traces all tasks as an average developer would expect.
+
+ "wakeup_rt"
+
+ Traces and records the max latency that it takes for just
+ RT tasks (as the current "wakeup" does). This is useful
+ for those interested in wake up timings of RT tasks.
+
+ "wakeup_dl"
+
+ Traces and records the max latency that it takes for
+ a SCHED_DEADLINE task to be woken (as the "wakeup" and
+ "wakeup_rt" does).
+
+ "mmiotrace"
+
+ A special tracer that is used to trace binary module.
+ It will trace all the calls that a module makes to the
+ hardware. Everything it writes and reads from the I/O
+ as well.
+
+ "branch"
+
+ This tracer can be configured when tracing likely/unlikely
+ calls within the kernel. It will trace when a likely and
+ unlikely branch is hit and if it was correct in its prediction
+ of being correct.
+
+ "nop"
+
+ This is the "trace nothing" tracer. To remove all
+ tracers from tracing simply echo "nop" into
+ current_tracer.
+
+
+Examples of using the tracer
+----------------------------
+
+Here are typical examples of using the tracers when controlling
+them only with the tracefs interface (without using any
+user-land utilities).
+
+Output format:
+--------------
+
+Here is an example of the output format of the file "trace"::
+
+ # tracer: function
+ #
+ # entries-in-buffer/entries-written: 140080/250280 #P:4
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
+ bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
+ bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
+ sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
+ bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
+ bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
+ bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
+ bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
+ bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
+ sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
+ ....
+
+A header is printed with the tracer name that is represented by
+the trace. In this case the tracer is "function". Then it shows the
+number of events in the buffer as well as the total number of entries
+that were written. The difference is the number of entries that were
+lost due to the buffer filling up (250280 - 140080 = 110200 events
+lost).
+
+The header explains the content of the events. Task name "bash", the task
+PID "1977", the CPU that it was running on "000", the latency format
+(explained below), the timestamp in <secs>.<usecs> format, the
+function name that was traced "sys_close" and the parent function that
+called this function "system_call_fastpath". The timestamp is the time
+at which the function was entered.
+
+Latency trace format
+--------------------
+
+When the latency-format option is enabled or when one of the latency
+tracers is set, the trace file gives somewhat more information to see
+why a latency happened. Here is a typical trace::
+
+ # tracer: irqsoff
+ #
+ # irqsoff latency trace v1.1.5 on 3.8.0-test+
+ # --------------------------------------------------------------------
+ # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+ # -----------------
+ # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
+ # -----------------
+ # => started at: __lock_task_sighand
+ # => ended at: _raw_spin_unlock_irqrestore
+ #
+ #
+ # _------=> CPU#
+ # / _-----=> irqs-off
+ # | / _----=> need-resched
+ # || / _---=> hardirq/softirq
+ # ||| / _--=> preempt-depth
+ # |||| / delay
+ # cmd pid ||||| time | caller
+ # \ / ||||| \ | /
+ ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
+ ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
+ ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
+ ps-6143 2d..1 306us : <stack trace>
+ => trace_hardirqs_on_caller
+ => trace_hardirqs_on
+ => _raw_spin_unlock_irqrestore
+ => do_task_stat
+ => proc_tgid_stat
+ => proc_single_show
+ => seq_read
+ => vfs_read
+ => sys_read
+ => system_call_fastpath
+
+
+This shows that the current tracer is "irqsoff" tracing the time
+for which interrupts were disabled. It gives the trace version (which
+never changes) and the version of the kernel upon which this was executed on
+(3.8). Then it displays the max latency in microseconds (259 us). The number
+of trace entries displayed and the total number (both are four: #4/4).
+VP, KP, SP, and HP are always zero and are reserved for later use.
+#P is the number of online CPUs (#P:4).
+
+The task is the process that was running when the latency
+occurred. (ps pid: 6143).
+
+The start and stop (the functions in which the interrupts were
+disabled and enabled respectively) that caused the latencies:
+
+ - __lock_task_sighand is where the interrupts were disabled.
+ - _raw_spin_unlock_irqrestore is where they were enabled again.
+
+The next lines after the header are the trace itself. The header
+explains which is which.
+
+ cmd: The name of the process in the trace.
+
+ pid: The PID of that process.
+
+ CPU#: The CPU which the process was running on.
+
+ irqs-off: 'd' interrupts are disabled. '.' otherwise.
+ .. caution:: If the architecture does not support a way to
+ read the irq flags variable, an 'X' will always
+ be printed here.
+
+ need-resched:
+ - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
+ - 'n' only TIF_NEED_RESCHED is set,
+ - 'p' only PREEMPT_NEED_RESCHED is set,
+ - '.' otherwise.
+
+ hardirq/softirq:
+ - 'Z' - NMI occurred inside a hardirq
+ - 'z' - NMI is running
+ - 'H' - hard irq occurred inside a softirq.
+ - 'h' - hard irq is running
+ - 's' - soft irq is running
+ - '.' - normal context.
+
+ preempt-depth: The level of preempt_disabled
+
+The above is mostly meaningful for kernel developers.
+
+ time:
+ When the latency-format option is enabled, the trace file
+ output includes a timestamp relative to the start of the
+ trace. This differs from the output when latency-format
+ is disabled, which includes an absolute timestamp.
+
+ delay:
+ This is just to help catch your eye a bit better. And
+ needs to be fixed to be only relative to the same CPU.
+ The marks are determined by the difference between this
+ current trace and the next trace.
+
+ - '$' - greater than 1 second
+ - '@' - greater than 100 milisecond
+ - '*' - greater than 10 milisecond
+ - '#' - greater than 1000 microsecond
+ - '!' - greater than 100 microsecond
+ - '+' - greater than 10 microsecond
+ - ' ' - less than or equal to 10 microsecond.
+
+ The rest is the same as the 'trace' file.
+
+ Note, the latency tracers will usually end with a back trace
+ to easily find where the latency occurred.
+
+trace_options
+-------------
+
+The trace_options file (or the options directory) is used to control
+what gets printed in the trace output, or manipulate the tracers.
+To see what is available, simply cat the file::
+
+ cat trace_options
+ print-parent
+ nosym-offset
+ nosym-addr
+ noverbose
+ noraw
+ nohex
+ nobin
+ noblock
+ trace_printk
+ annotate
+ nouserstacktrace
+ nosym-userobj
+ noprintk-msg-only
+ context-info
+ nolatency-format
+ record-cmd
+ norecord-tgid
+ overwrite
+ nodisable_on_free
+ irq-info
+ markers
+ noevent-fork
+ function-trace
+ nofunction-fork
+ nodisplay-graph
+ nostacktrace
+ nobranch
+
+To disable one of the options, echo in the option prepended with
+"no"::
+
+ echo noprint-parent > trace_options
+
+To enable an option, leave off the "no"::
+
+ echo sym-offset > trace_options
+
+Here are the available options:
+
+ print-parent
+ On function traces, display the calling (parent)
+ function as well as the function being traced.
+ ::
+
+ print-parent:
+ bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul
+
+ noprint-parent:
+ bash-4000 [01] 1477.606694: simple_strtoul
+
+
+ sym-offset
+ Display not only the function name, but also the
+ offset in the function. For example, instead of
+ seeing just "ktime_get", you will see
+ "ktime_get+0xb/0x20".
+ ::
+
+ sym-offset:
+ bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
+
+ sym-addr
+ This will also display the function address as well
+ as the function name.
+ ::
+
+ sym-addr:
+ bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
+
+ verbose
+ This deals with the trace file when the
+ latency-format option is enabled.
+ ::
+
+ bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
+ (+0.000ms): simple_strtoul (kstrtoul)
+
+ raw
+ This will display raw numbers. This option is best for
+ use with user applications that can translate the raw
+ numbers better than having it done in the kernel.
+
+ hex
+ Similar to raw, but the numbers will be in a hexadecimal format.
+
+ bin
+ This will print out the formats in raw binary.
+
+ block
+ When set, reading trace_pipe will not block when polled.
+
+ trace_printk
+ Can disable trace_printk() from writing into the buffer.
+
+ annotate
+ It is sometimes confusing when the CPU buffers are full
+ and one CPU buffer had a lot of events recently, thus
+ a shorter time frame, were another CPU may have only had
+ a few events, which lets it have older events. When
+ the trace is reported, it shows the oldest events first,
+ and it may look like only one CPU ran (the one with the
+ oldest events). When the annotate option is set, it will
+ display when a new CPU buffer started::
+
+ <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
+ <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
+ <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
+ ##### CPU 2 buffer started ####
+ <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
+ <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
+ <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
+
+ userstacktrace
+ This option changes the trace. It records a
+ stacktrace of the current user space thread after
+ each trace event.
+
+ sym-userobj
+ when user stacktrace are enabled, look up which
+ object the address belongs to, and print a
+ relative address. This is especially useful when
+ ASLR is on, otherwise you don't get a chance to
+ resolve the address to object/file/line after
+ the app is no longer running
+
+ The lookup is performed when you read
+ trace,trace_pipe. Example::
+
+ a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
+ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
+
+
+ printk-msg-only
+ When set, trace_printk()s will only show the format
+ and not their parameters (if trace_bprintk() or
+ trace_bputs() was used to save the trace_printk()).
+
+ co