trace doc: convert trace/ftrace.txt to rst format

This converts the plain text documentation to reStructuredText format and add it into Sphinx TOC tree. No essential content change. Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
author: Changbin Du <changbin.du@intel.com> 2018-02-17 13:39:38 +0800
committer: Jonathan Corbet <corbet@lwn.net> 2018-03-07 10:23:10 -0700
commit: 1f198e22bc3a0af747a7cf7b444de49ff76b6869 (patch)
tree: 4bd4e057bfd0c1df535f680cf8cf1afc5ec519d2 /Documentation/trace
parent: 8fa4e720e8d919271cdf0da3c0856333246398a4 (diff)
3 files changed, 3333 insertions, 3220 deletions
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
new file mode 100644
index 000000000000..636aa9bf5674
--- /dev/null
+++ b/Documentation/trace/ftrace.rst
@@ -0,0 +1,3332 @@
+========================
+ftrace - Function Tracer
+========================
+
+Copyright 2008 Red Hat Inc.
+
+:Author:   Steven Rostedt <srostedt@redhat.com>
+:License:  The GNU Free Documentation License, Version 1.2
+          (dual licensed under the GPL v2)
+:Original Reviewers:  Elias Oltmanns, Randy Dunlap, Andrew Morton,
+		      John Kacur, and David Teigland.
+
+- Written for: 2.6.28-rc2
+- Updated for: 3.10
+- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
+- Converted to rst format - Changbin Du <changbin.du@intel.com>
+
+Introduction
+------------
+
+Ftrace is an internal tracer designed to help out developers and
+designers of systems to find what is going on inside the kernel.
+It can be used for debugging or analyzing latencies and
+performance issues that take place outside of user-space.
+
+Although ftrace is typically considered the function tracer, it
+is really a frame work of several assorted tracing utilities.
+There's latency tracing to examine what occurs between interrupts
+disabled and enabled, as well as for preemption and from a time
+a task is woken to the task is actually scheduled in.
+
+One of the most common uses of ftrace is the event tracing.
+Through out the kernel is hundreds of static event points that
+can be enabled via the tracefs file system to see what is
+going on in certain parts of the kernel.
+
+See events.txt for more information.
+
+
+Implementation Details
+----------------------
+
+See :doc:`ftrace-design` for details for arch porters and such.
+
+
+The File System
+---------------
+
+Ftrace uses the tracefs file system to hold the control files as
+well as the files to display output.
+
+When tracefs is configured into the kernel (which selecting any ftrace
+option will do) the directory /sys/kernel/tracing will be created. To mount
+this directory, you can add to your /etc/fstab file::
+
+ tracefs       /sys/kernel/tracing       tracefs defaults        0       0
+
+Or you can mount it at run time with::
+
+ mount -t tracefs nodev /sys/kernel/tracing
+
+For quicker access to that directory you may want to make a soft link to
+it::
+
+ ln -s /sys/kernel/tracing /tracing
+
+.. attention::
+
+  Before 4.1, all ftrace tracing control files were within the debugfs
+  file system, which is typically located at /sys/kernel/debug/tracing.
+  For backward compatibility, when mounting the debugfs file system,
+  the tracefs file system will be automatically mounted at:
+
+  /sys/kernel/debug/tracing
+
+  All files located in the tracefs file system will be located in that
+  debugfs file system directory as well.
+
+.. attention::
+
+  Any selected ftrace option will also create the tracefs file system.
+  The rest of the document will assume that you are in the ftrace directory
+  (cd /sys/kernel/tracing) and will only concentrate on the files within that
+  directory and not distract from the content with the extended
+  "/sys/kernel/tracing" path name.
+
+That's it! (assuming that you have ftrace configured into your kernel)
+
+After mounting tracefs you will have access to the control and output files
+of ftrace. Here is a list of some of the key files:
+
+
+ Note: all time values are in microseconds.
+
+  current_tracer:
+
+	This is used to set or display the current tracer
+	that is configured.
+
+  available_tracers:
+
+	This holds the different types of tracers that
+	have been compiled into the kernel. The
+	tracers listed here can be configured by
+	echoing their name into current_tracer.
+
+  tracing_on:
+
+	This sets or displays whether writing to the trace
+	ring buffer is enabled. Echo 0 into this file to disable
+	the tracer or 1 to enable it. Note, this only disables
+	writing to the ring buffer, the tracing overhead may
+	still be occurring.
+
+	The kernel function tracing_off() can be used within the
+	kernel to disable writing to the ring buffer, which will
+	set this file to "0". User space can re-enable tracing by
+	echoing "1" into the file.
+
+	Note, the function and event trigger "traceoff" will also
+	set this file to zero and stop tracing. Which can also
+	be re-enabled by user space using this file.
+
+  trace:
+
+	This file holds the output of the trace in a human
+	readable format (described below). Note, tracing is temporarily
+	disabled while this file is being read (opened).
+
+  trace_pipe:
+
+	The output is the same as the "trace" file but this
+	file is meant to be streamed with live tracing.
+	Reads from this file will block until new data is
+	retrieved.  Unlike the "trace" file, this file is a
+	consumer. This means reading from this file causes
+	sequential reads to display more current data. Once
+	data is read from this file, it is consumed, and
+	will not be read again with a sequential read. The
+	"trace" file is static, and if the tracer is not
+	adding more data, it will display the same
+	information every time it is read. This file will not
+	disable tracing while being read.
+
+  trace_options:
+
+	This file lets the user control the amount of data
+	that is displayed in one of the above output
+	files. Options also exist to modify how a tracer
+	or events work (stack traces, timestamps, etc).
+
+  options:
+
+	This is a directory that has a file for every available
+	trace option (also in trace_options). Options may also be set
+	or cleared by writing a "1" or "0" respectively into the
+	corresponding file with the option name.
+
+  tracing_max_latency:
+
+	Some of the tracers record the max latency.
+	For example, the maximum time that interrupts are disabled.
+	The maximum time is saved in this file. The max trace will also be
+	stored,	and displayed by "trace". A new max trace will only be
+	recorded if the latency is greater than the value in this file
+	(in microseconds).
+
+	By echoing in a time into this file, no latency will be recorded
+	unless it is greater than the time in this file.
+
+  tracing_thresh:
+
+	Some latency tracers will record a trace whenever the
+	latency is greater than the number in this file.
+	Only active when the file contains a number greater than 0.
+	(in microseconds)
+
+  buffer_size_kb:
+
+	This sets or displays the number of kilobytes each CPU
+	buffer holds. By default, the trace buffers are the same size
+	for each CPU. The displayed number is the size of the
+	CPU buffer and not total size of all buffers. The
+	trace buffers are allocated in pages (blocks of memory
+	that the kernel uses for allocation, usually 4 KB in size).
+	If the last page allocated has room for more bytes
+	than requested, the rest of the page will be used,
+	making the actual allocation bigger than requested or shown.
+	( Note, the size may not be a multiple of the page size
+	due to buffer management meta-data. )
+
+	Buffer sizes for individual CPUs may vary
+	(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
+	this file will show "X".
+
+  buffer_total_size_kb:
+
+	This displays the total combined size of all the trace buffers.
+
+  free_buffer:
+
+	If a process is performing tracing, and the ring buffer	should be
+	shrunk "freed" when the process is finished, even if it were to be
+	killed by a signal, this file can be used for that purpose. On close
+	of this file, the ring buffer will be resized to its minimum size.
+	Having a process that is tracing also open this file, when the process
+	exits its file descriptor for this file will be closed, and in doing so,
+	the ring buffer will be "freed".
+
+	It may also stop tracing if disable_on_free option is set.
+
+  tracing_cpumask:
+
+	This is a mask that lets the user only trace on specified CPUs.
+	The format is a hex string representing the CPUs.
+
+  set_ftrace_filter:
+
+	When dynamic ftrace is configured in (see the
+	section below "dynamic ftrace"), the code is dynamically
+	modified (code text rewrite) to disable calling of the
+	function profiler (mcount). This lets tracing be configured
+	in with practically no overhead in performance.  This also
+	has a side effect of enabling or disabling specific functions
+	to be traced. Echoing names of functions into this file
+	will limit the trace to only those functions.
+
+	The functions listed in "available_filter_functions" are what
+	can be written into this file.
+
+	This interface also allows for commands to be used. See the
+	"Filter commands" section for more details.
+
+  set_ftrace_notrace:
+
+	This has an effect opposite to that of
+	set_ftrace_filter. Any function that is added here will not
+	be traced. If a function exists in both set_ftrace_filter
+	and set_ftrace_notrace,	the function will _not_ be traced.
+
+  set_ftrace_pid:
+
+	Have the function tracer only trace the threads whose PID are
+	listed in this file.
+
+	If the "function-fork" option is set, then when a task whose
+	PID is listed in this file forks, the child's PID will
+	automatically be added to this file, and the child will be
+	traced by the function tracer as well. This option will also
+	cause PIDs of tasks that exit to be removed from the file.
+
+  set_event_pid:
+
+	Have the events only trace a task with a PID listed in this file.
+	Note, sched_switch and sched_wake_up will also trace events
+	listed in this file.
+
+	To have the PIDs of children of tasks with their PID in this file
+	added on fork, enable the "event-fork" option. That option will also
+	cause the PIDs of tasks to be removed from this file when the task
+	exits.
+
+  set_graph_function:
+
+	Functions listed in this file will cause the function graph
+	tracer to only trace these functions and the functions that
+	they call. (See the section "dynamic ftrace" for more details).
+
+  set_graph_notrace:
+
+	Similar to set_graph_function, but will disable function graph
+	tracing when the function is hit until it exits the function.
+	This makes it possible to ignore tracing functions that are called
+	by a specific function.
+
+  available_filter_functions:
+
+	This lists the functions that ftrace has processed and can trace.
+	These are the function names that you can pass to
+	"set_ftrace_filter" or "set_ftrace_notrace".
+	(See the section "dynamic ftrace" below for more details.)
+
+  dyn_ftrace_total_info:
+
+	This file is for debugging purposes. The number of functions that
+	have been converted to nops and are available to be traced.
+
+  enabled_functions:
+
+	This file is more for debugging ftrace, but can also be useful
+	in seeing if any function has a callback attached to it.
+	Not only does the trace infrastructure use ftrace function
+	trace utility, but other subsystems might too. This file
+	displays all functions that have a callback attached to them
+	as well as the number of callbacks that have been attached.
+	Note, a callback may also call multiple functions which will
+	not be listed in this count.
+
+	If the callback registered to be traced by a function with
+	the "save regs" attribute (thus even more overhead), a 'R'
+	will be displayed on the same line as the function that
+	is returning registers.
+
+	If the callback registered to be traced by a function with
+	the "ip modify" attribute (thus the regs->ip can be changed),
+	an 'I' will be displayed on the same line as the function that
+	can be overridden.
+
+	If the architecture supports it, it will also show what callback
+	is being directly called by the function. If the count is greater
+	than 1 it most likely will be ftrace_ops_list_func().
+
+	If the callback of the function jumps to a trampoline that is
+	specific to a the callback and not the standard trampoline,
+	its address will be printed as well as the function that the
+	trampoline calls.
+
+  function_profile_enabled:
+
+	When set it will enable all functions with either the function
+	tracer, or if configured, the function graph tracer. It will
+	keep a histogram of the number of functions that were called
+	and if the function graph tracer was configured, it will also keep
+	track of the time spent in those functions. The histogram
+	content can be displayed in the files:
+
+	trace_stats/function<cpu> ( function0, function1, etc).
+
+  trace_stats:
+
+	A directory that holds different tracing stats.
+
+  kprobe_events:
+
+	Enable dynamic trace points. See kprobetrace.txt.
+
+  kprobe_profile:
+
+	Dynamic trace points stats. See kprobetrace.txt.
+
+  max_graph_depth:
+
+	Used with the function graph tracer. This is the max depth
+	it will trace into a function. Setting this to a value of
+	one will show only the first kernel function that is called
+	from user space.
+
+  printk_formats:
+
+	This is for tools that read the raw format files. If an event in
+	the ring buffer references a string, only a pointer to the string
+	is recorded into the buffer and not the string itself. This prevents
+	tools from knowing what that string was. This file displays the string
+	and address for	the string allowing tools to map the pointers to what
+	the strings were.
+
+  saved_cmdlines:
+
+	Only the pid of the task is recorded in a trace event unless
+	the event specifically saves the task comm as well. Ftrace
+	makes a cache of pid mappings to comms to try to display
+	comms for events. If a pid for a comm is not listed, then
+	"<...>" is displayed in the output.
+
+	If the option "record-cmd" is set to "0", then comms of tasks
+	will not be saved during recording. By default, it is enabled.
+
+  saved_cmdlines_size:
+
+	By default, 128 comms are saved (see "saved_cmdlines" above). To
+	increase or decrease the amount of comms that are cached, echo
+	in a the number of comms to cache, into this file.
+
+  saved_tgids:
+
+	If the option "record-tgid" is set, on each scheduling context switch
+	the Task Group ID of a task is saved in a table mapping the PID of
+	the thread to its TGID. By default, the "record-tgid" option is
+	disabled.
+
+  snapshot:
+
+	This displays the "snapshot" buffer and also lets the user
+	take a snapshot of the current running trace.
+	See the "Snapshot" section below for more details.
+
+  stack_max_size:
+
+	When the stack tracer is activated, this will display the
+	maximum stack size it has encountered.
+	See the "Stack Trace" section below.
+
+  stack_trace:
+
+	This displays the stack back trace of the largest stack
+	that was encountered when the stack tracer is activated.
+	See the "Stack Trace" section below.
+
+  stack_trace_filter:
+
+	This is similar to "set_ftrace_filter" but it limits what
+	functions the stack tracer will check.
+
+  trace_clock:
+
+	Whenever an event is recorded into the ring buffer, a
+	"timestamp" is added. This stamp comes from a specified
+	clock. By default, ftrace uses the "local" clock. This
+	clock is very fast and strictly per cpu, but on some
+	systems it may not be monotonic with respect to other
+	CPUs. In other words, the local clocks may not be in sync
+	with local clocks on other CPUs.
+
+	Usual clocks for tracing::
+
+	  # cat trace_clock
+	  [local] global counter x86-tsc
+
+	The clock with the square brackets around it is the one in effect.
+
+	local:
+		Default clock, but may not be in sync across CPUs
+
+	global:
+		This clock is in sync with all CPUs but may
+		be a bit slower than the local clock.
+
+	counter:
+		This is not a clock at all, but literally an atomic
+		counter. It counts up one by one, but is in sync
+		with all CPUs. This is useful when you need to
+		know exactly the order events occurred with respect to
+		each other on different CPUs.
+
+	uptime:
+		This uses the jiffies counter and the time stamp
+		is relative to the time since boot up.
+
+	perf:
+		This makes ftrace use the same clock that perf uses.
+		Eventually perf will be able to read ftrace buffers
+		and this will help out in interleaving the data.
+
+	x86-tsc:
+		Architectures may define their own clocks. For
+		example, x86 uses its own TSC cycle clock here.
+
+	ppc-tb:
+		This uses the powerpc timebase register value.
+		This is in sync across CPUs and can also be used
+		to correlate events across hypervisor/guest if
+		tb_offset is known.
+
+	mono:
+		This uses the fast monotonic clock (CLOCK_MONOTONIC)
+		which is monotonic and is subject to NTP rate adjustments.
+
+	mono_raw:
+		This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
+		which is montonic but is not subject to any rate adjustments
+		and ticks at the same rate as the hardware clocksource.
+
+	boot:
+		This is the boot clock (CLOCK_BOOTTIME) and is based on the
+		fast monotonic clock, but also accounts for time spent in
+		suspend. Since the clock access is designed for use in
+		tracing in the suspend path, some side effects are possible
+		if clock is accessed after the suspend time is accounted before
+		the fast mono clock is updated. In this case, the clock update
+		appears to happen slightly sooner than it normally would have.
+		Also on 32-bit systems, it's possible that the 64-bit boot offset
+		sees a partial update. These effects are rare and post
+		processing should be able to handle them. See comments in the
+		ktime_get_boot_fast_ns() function for more information.
+
+		To set a clock, simply echo the clock name into this file::
+
+		  echo global > trace_clock
+
+  trace_marker:
+
+	This is a very useful file for synchronizing user space
+	with events happening in the kernel. Writing strings into
+	this file will be written into the ftrace buffer.
+
+	It is useful in applications to open this file at the start
+	of the application and just reference the file descriptor
+	for the file::
+
+		void trace_write(const char *fmt, ...)
+		{
+			va_list ap;
+			char buf[256];
+			int n;
+
+			if (trace_fd < 0)
+				return;
+
+			va_start(ap, fmt);
+			n = vsnprintf(buf, 256, fmt, ap);
+			va_end(ap);
+
+			write(trace_fd, buf, n);
+		}
+
+	start::
+
+		trace_fd = open("trace_marker", WR_ONLY);
+
+  trace_marker_raw:
+
+	This is similar to trace_marker above, but is meant for for binary data
+	to be written to it, where a tool can be used to parse the data
+	from trace_pipe_raw.
+
+  uprobe_events:
+
+	Add dynamic tracepoints in programs.
+	See uprobetracer.txt
+
+  uprobe_profile:
+
+	Uprobe statistics. See uprobetrace.txt
+
+  instances:
+
+	This is a way to make multiple trace buffers where different
+	events can be recorded in different buffers.
+	See "Instances" section below.
+
+  events:
+
+	This is the trace event directory. It holds event tracepoints
+	(also known as static tracepoints) that have been compiled
+	into the kernel. It shows what event tracepoints exist
+	and how they are grouped by system. There are "enable"
+	files at various levels that can enable the tracepoints
+	when a "1" is written to them.
+
+	See events.txt for more information.
+
+  set_event:
+
+	By echoing in the event into this file, will enable that event.
+
+	See events.txt for more information.
+
+  available_events:
+
+	A list of events that can be enabled in tracing.
+
+	See events.txt for more information.
+
+  hwlat_detector:
+
+	Directory for the Hardware Latency Detector.
+	See "Hardware Latency Detector" section below.
+
+  per_cpu:
+
+	This is a directory that contains the trace per_cpu information.
+
+  per_cpu/cpu0/buffer_size_kb:
+
+	The ftrace buffer is defined per_cpu. That is, there's a separate
+	buffer for each CPU to allow writes to be done atomically,
+	and free from cache bouncing. These buffers may have different
+	size buffers. This file is similar to the buffer_size_kb
+	file, but it only displays or sets the buffer size for the
+	specific CPU. (here cpu0).
+
+  per_cpu/cpu0/trace:
+
+	This is similar to the "trace" file, but it will only display
+	the data specific for the CPU. If written to, it only clears
+	the specific CPU buffer.
+
+  per_cpu/cpu0/trace_pipe
+
+	This is similar to the "trace_pipe" file, and is a consuming
+	read, but it will only display (and consume) the data specific
+	for the CPU.
+
+  per_cpu/cpu0/trace_pipe_raw
+
+	For tools that can parse the ftrace ring buffer binary format,
+	the trace_pipe_raw file can be used to extract the data
+	from the ring buffer directly. With the use of the splice()
+	system call, the buffer data can be quickly transferred to
+	a file or to the network where a server is collecting the
+	data.
+
+	Like trace_pipe, this is a consuming reader, where multiple
+	reads will always produce different data.
+
+  per_cpu/cpu0/snapshot:
+
+	This is similar to the main "snapshot" file, but will only
+	snapshot the current CPU (if supported). It only displays
+	the content of the snapshot for a given CPU, and if
+	written to, only clears this CPU buffer.
+
+  per_cpu/cpu0/snapshot_raw:
+
+	Similar to the trace_pipe_raw, but will read the binary format
+	from the snapshot buffer for the given CPU.
+
+  per_cpu/cpu0/stats:
+
+	This displays certain stats about the ring buffer:
+
+	entries:
+		The number of events that are still in the buffer.
+
+	overrun:
+		The number of lost events due to overwriting when
+		the buffer was full.
+
+	commit overrun:
+		Should always be zero.
+		This gets set if so many events happened within a nested
+		event (ring buffer is re-entrant), that it fills the
+		buffer and starts dropping events.
+
+	bytes:
+		Bytes actually read (not overwritten).
+
+	oldest event ts:
+		The oldest timestamp in the buffer
+
+	now ts:
+		The current timestamp
+
+	dropped events:
+		Events lost due to overwrite option being off.
+
+	read events:
+		The number of events read.
+
+The Tracers
+-----------
+
+Here is the list of current tracers that may be configured.
+
+  "function"
+
+	Function call tracer to trace all kernel functions.
+
+  "function_graph"
+
+	Similar to the function tracer except that the
+	function tracer probes the functions on their entry
+	whereas the function graph tracer traces on both entry
+	and exit of the functions. It then provides the ability
+	to draw a graph of function calls similar to C code
+	source.
+
+  "blk"
+
+	The block tracer. The tracer used by the blktrace user
+	application.
+
+  "hwlat"
+
+	The Hardware Latency tracer is used to detect if the hardware
+	produces any latency. See "Hardware Latency Detector" section
+	below.
+
+  "irqsoff"
+
+	Traces the areas that disable interrupts and saves
+	the trace with the longest max latency.
+	See tracing_max_latency. When a new max is recorded,
+	it replaces the old trace. It is best to view this
+	trace with the latency-format option enabled, which
+	happens automatically when the tracer is selected.
+
+  "preemptoff"
+
+	Similar to irqsoff but traces and records the amount of
+	time for which preemption is disabled.
+
+  "preemptirqsoff"
+
+	Similar to irqsoff and preemptoff, but traces and
+	records the largest time for which irqs and/or preemption
+	is disabled.
+
+  "wakeup"
+
+	Traces and records the max latency that it takes for
+	the highest priority task to get scheduled after
+	it has been woken up.
+        Traces all tasks as an average developer would expect.
+
+  "wakeup_rt"
+
+        Traces and records the max latency that it takes for just
+        RT tasks (as the current "wakeup" does). This is useful
+        for those interested in wake up timings of RT tasks.
+
+  "wakeup_dl"
+
+	Traces and records the max latency that it takes for
+	a SCHED_DEADLINE task to be woken (as the "wakeup" and
+	"wakeup_rt" does).
+
+  "mmiotrace"
+
+	A special tracer that is used to trace binary module.
+	It will trace all the calls that a module makes to the
+	hardware. Everything it writes and reads from the I/O
+	as well.
+
+  "branch"
+
+	This tracer can be configured when tracing likely/unlikely
+	calls within the kernel. It will trace when a likely and
+	unlikely branch is hit and if it was correct in its prediction
+	of being correct.
+
+  "nop"
+
+	This is the "trace nothing" tracer. To remove all
+	tracers from tracing simply echo "nop" into
+	current_tracer.
+
+
+Examples of using the tracer
+----------------------------
+
+Here are typical examples of using the tracers when controlling
+them only with the tracefs interface (without using any
+user-land utilities).
+
+Output format:
+--------------
+
+Here is an example of the output format of the file "trace"::
+
+  # tracer: function
+  #
+  # entries-in-buffer/entries-written: 140080/250280   #P:4
+  #
+  #                              _-----=> irqs-off
+  #                             / _----=> need-resched
+  #                            | / _---=> hardirq/softirq
+  #                            || / _--=> preempt-depth
+  #                            ||| /     delay
+  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+  #              | |       |   ||||       |         |
+              bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
+              bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
+              bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
+              sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
+              bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
+              bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
+              bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
+              bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
+              bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
+              sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
+              ....
+
+A header is printed with the tracer name that is represented by
+the trace. In this case the tracer is "function". Then it shows the
+number of events in the buffer as well as the total number of entries
+that were written. The difference is the number of entries that were
+lost due to the buffer filling up (250280 - 140080 = 110200 events
+lost).
+
+The header explains the content of the events. Task name "bash", the task
+PID "1977", the CPU that it was running on "000", the latency format
+(explained below), the timestamp in <secs>.<usecs> format, the
+function name that was traced "sys_close" and the parent function that
+called this function "system_call_fastpath". The timestamp is the time
+at which the function was entered.
+
+Latency trace format
+--------------------
+
+When the latency-format option is enabled or when one of the latency
+tracers is set, the trace file gives somewhat more information to see
+why a latency happened. Here is a typical trace::
+
+  # tracer: irqsoff
+  #
+  # irqsoff latency trace v1.1.5 on 3.8.0-test+
+  # --------------------------------------------------------------------
+  # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+  #    -----------------
+  #    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
+  #    -----------------
+  #  => started at: __lock_task_sighand
+  #  => ended at:   _raw_spin_unlock_irqrestore
+  #
+  #
+  #                  _------=> CPU#            
+  #                 / _-----=> irqs-off        
+  #                | / _----=> need-resched    
+  #                || / _---=> hardirq/softirq 
+  #                ||| / _--=> preempt-depth   
+  #                |||| /     delay             
+  #  cmd     pid   ||||| time  |   caller      
+  #     \   /      |||||  \    |   /           
+        ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
+        ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
+        ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
+        ps-6143    2d..1  306us : <stack trace>
+   => trace_hardirqs_on_caller
+   => trace_hardirqs_on
+   => _raw_spin_unlock_irqrestore
+   => do_task_stat
+   => proc_tgid_stat
+   => proc_single_show
+   => seq_read
+   => vfs_read
+   => sys_read
+   => system_call_fastpath
+
+
+This shows that the current tracer is "irqsoff" tracing the time
+for which interrupts were disabled. It gives the trace version (which
+never changes) and the version of the kernel upon which this was executed on
+(3.8). Then it displays the max latency in microseconds (259 us). The number
+of trace entries displayed and the total number (both are four: #4/4).
+VP, KP, SP, and HP are always zero and are reserved for later use.
+#P is the number of online CPUs (#P:4).
+
+The task is the process that was running when the latency
+occurred. (ps pid: 6143).
+
+The start and stop (the functions in which the interrupts were
+disabled and enabled respectively) that caused the latencies:
+
+  - __lock_task_sighand is where the interrupts were disabled.
+  - _raw_spin_unlock_irqrestore is where they were enabled again.
+
+The next lines after the header are the trace itself. The header
+explains which is which.
+
+  cmd: The name of the process in the trace.
+
+  pid: The PID of that process.
+
+  CPU#: The CPU which the process was running on.
+
+  irqs-off: 'd' interrupts are disabled. '.' otherwise.
+	.. caution:: If the architecture does not support a way to
+		read the irq flags variable, an 'X' will always
+		be printed here.
+
+  need-resched:
+	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
+	- 'n' only TIF_NEED_RESCHED is set,
+	- 'p' only PREEMPT_NEED_RESCHED is set,
+	- '.' otherwise.
+
+  hardirq/softirq:
+	- 'Z' - NMI occurred inside a hardirq
+	- 'z' - NMI is running
+	- 'H' - hard irq occurred inside a softirq.
+	- 'h' - hard irq is running
+	- 's' - soft irq is running
+	- '.' - normal context.
+
+  preempt-depth: The level of preempt_disabled
+
+The above is mostly meaningful for kernel developers.
+
+  time:
+	When the latency-format option is enabled, the trace file
+	output includes a timestamp relative to the start of the
+	trace. This differs from the output when latency-format
+	is disabled, which includes an absolute timestamp.
+
+  delay:
+	This is just to help catch your eye a bit better. And
+	needs to be fixed to be only relative to the same CPU.
+	The marks are determined by the difference between this
+	current trace and the next trace.
+
+	  - '$' - greater than 1 second
+	  - '@' - greater than 100 milisecond
+	  - '*' - greater than 10 milisecond
+	  - '#' - greater than 1000 microsecond
+	  - '!' - greater than 100 microsecond
+	  - '+' - greater than 10 microsecond
+	  - ' ' - less than or equal to 10 microsecond.
+
+  The rest is the same as the 'trace' file.
+
+  Note, the latency tracers will usually end with a back trace
+  to easily find where the latency occurred.
+
+trace_options
+-------------
+
+The trace_options file (or the options directory) is used to control
+what gets printed in the trace output, or manipulate the tracers.
+To see what is available, simply cat the file::
+
+  cat trace_options
+	print-parent
+	nosym-offset
+	nosym-addr
+	noverbose
+	noraw
+	nohex
+	nobin
+	noblock
+	trace_printk
+	annotate
+	nouserstacktrace
+	nosym-userobj
+	noprintk-msg-only
+	context-info
+	nolatency-format
+	record-cmd
+	norecord-tgid
+	overwrite
+	nodisable_on_free
+	irq-info
+	markers
+	noevent-fork
+	function-trace
+	nofunction-fork
+	nodisplay-graph
+	nostacktrace
+	nobranch
+
+To disable one of the options, echo in the option prepended with
+"no"::
+
+  echo noprint-parent > trace_options
+
+To enable an option, leave off the "no"::
+
+  echo sym-offset > trace_options
+
+Here are the available options:
+
+  print-parent
+	On function traces, display the calling (parent)
+	function as well as the function being traced.
+	::
+
+	  print-parent:
+	   bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
+
+	  noprint-parent:
+	   bash-4000  [01]  1477.606694: simple_strtoul
+
+
+  sym-offset
+	Display not only the function name, but also the
+	offset in the function. For example, instead of
+	seeing just "ktime_get", you will see
+	"ktime_get+0xb/0x20".
+	::
+
+	  sym-offset:
+	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
+
+  sym-addr
+	This will also display the function address as well
+	as the function name.
+	::
+
+	  sym-addr:
+	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
+
+  verbose
+	This deals with the trace file when the
+        latency-format option is enabled.
+	::
+
+	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
+	    (+0.000ms): simple_strtoul (kstrtoul)
+
+  raw
+	This will display raw numbers. This option is best for
+	use with user applications that can translate the raw
+	numbers better than having it done in the kernel.
+
+  hex
+	Similar to raw, but the numbers will be in a hexadecimal format.
+
+  bin
+	This will print out the formats in raw binary.
+
+  block
+	When set, reading trace_pipe will not block when polled.
+
+  trace_printk
+	Can disable trace_printk() from writing into the buffer.
+
+  annotate
+	It is sometimes confusing when the CPU buffers are full
+	and one CPU buffer had a lot of events recently, thus
+	a shorter time frame, were another CPU may have only had
+	a few events, which lets it have older events. When
+	the trace is reported, it shows the oldest events first,
+	and it may look like only one CPU ran (the one with the
+	oldest events). When the annotate option is set, it will
+	display when a new CPU buffer started::
+
+			  <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
+			  <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
+			  <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
+		##### CPU 2 buffer started ####
+			  <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
+			  <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
+			  <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
+
+  userstacktrace
+	This option changes the trace. It records a
+	stacktrace of the current user space thread after
+	each trace event.
+
+  sym-userobj
+	when user stacktrace are enabled, look up which
+	object the address belongs to, and print a
+	relative address. This is especially useful when
+	ASLR is on, otherwise you don't get a chance to
+	resolve the address to object/file/line after
+	the app is no longer running
+
+	The lookup is performed when you read
+	trace,trace_pipe. Example::
+
+		  a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
+		  x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
+
+
+  printk-msg-only
+	When set, trace_printk()s will only show the format
+	and not their parameters (if trace_bprintk() or
+	trace_bputs() was used to save the trace_printk()).
+
+  co
author	Changbin Du <changbin.du@intel.com>	2018-02-17 13:39:38 +0800
committer	Jonathan Corbet <corbet@lwn.net>	2018-03-07 10:23:10 -0700
commit	1f198e22bc3a0af747a7cf7b444de49ff76b6869 (patch)
tree	4bd4e057bfd0c1df535f680cf8cf1afc5ec519d2 /Documentation/trace
parent	8fa4e720e8d919271cdf0da3c0856333246398a4 (diff)