summaryrefslogtreecommitdiffstats
path: root/GUIDE.md
diff options
context:
space:
mode:
Diffstat (limited to 'GUIDE.md')
-rw-r--r--GUIDE.md71
1 files changed, 71 insertions, 0 deletions
diff --git a/GUIDE.md b/GUIDE.md
index 8022f292..907ab382 100644
--- a/GUIDE.md
+++ b/GUIDE.md
@@ -18,6 +18,7 @@ translatable to any command line shell environment.
* [Replacements](#replacements)
* [Configuration file](#configuration-file)
* [File encoding](#file-encoding)
+* [Binary data](#binary-data)
* [Common options](#common-options)
@@ -680,6 +681,76 @@ $ rg '\w(?-u:\w)\w'
```
+### Binary data
+
+In addition to skipping hidden files and files in your `.gitignore` by default,
+ripgrep also attempts to skip binary files. ripgrep does this by default
+because binary files (like PDFs or images) are typically not things you want to
+search when searching for regex matches. Moreover, if content in a binary file
+did match, then it's possible for undesirable binary data to be printed to your
+terminal and wreak havoc.
+
+Unfortunately, unlike skipping hidden files and respecting your `.gitignore`
+rules, a file cannot as easily be classified as binary. In order to figure out
+whether a file is binary, the most effective heuristic that balances
+correctness with performance is to simply look for `NUL` bytes. At that point,
+the determination is simple: a file is considered "binary" if and only if it
+contains a `NUL` byte somewhere in its contents.
+
+The issue is that while most binary files will have a `NUL` byte toward the
+beginning of its contents, this is not necessarily true. The `NUL` byte might
+be the very last byte in a large file, but that file is still considered
+binary. While this leads to a fair amount of complexity inside ripgrep's
+implementation, it also results in some unintuitive user experiences.
+
+At a high level, ripgrep operates in three different modes with respect to
+binary files:
+
+1. The default mode is to attempt to remove binary files from a search
+ completely. This is meant to mirror how ripgrep removes hidden files and
+ files in your `.gitignore` automatically. That is, as soon as a file is
+ detected as binary, searching stops. If a match was already printed (because
+ it was detected long before a `NUL` byte), then ripgrep will print a warning
+ message indicating that the search stopped prematurely. This default mode
+ **only applies to files searched by ripgrep as a result of recursive
+ directory traversal**, which is consistent with ripgrep's other automatic
+ filtering. For example, `rg foo .file` will search `.file` even though it
+ is hidden. Similarly, `rg foo binary-file` search `binary-file` in "binary"
+ mode automatically.
+2. Binary mode is similar to the default mode, except it will not always
+ stop searching after it sees a `NUL` byte. Namely, in this mode, ripgrep
+ will continue searching a file that is known to be binary until the first
+ of two conditions is met: 1) the end of the file has been reached or 2) a
+ match is or has been seen. This means that in binary mode, if ripgrep
+ reports no matches, then there are no matches in the file. When a match does
+ occur, ripgrep prints a message similar to one it prints when in its default
+ mode indicating that the search has stopped prematurely. This mode can be
+ forcefully enabled for all files with the `--binary` flag. The purpose of
+ binary mode is to provide a way to discover matches in all files, but to
+ avoid having binary data dumped into your terminal.
+3. Text mode completely disables all binary detection and searches all files
+ as if they were text. This is useful when searching a file that is
+ predominantly text but contains a `NUL` byte, or if you are specifically
+ trying to search binary data. This mode can be enabled with the `-a/--text`
+ flag. Note that when using this mode on very large binary files, it is
+ possible for ripgrep to use a lot of memory.
+
+Unfortunately, there is one additional complexity in ripgrep that can make it
+difficult to reason about binary files. That is, the way binary detection works
+depends on the way that ripgrep searches your files. Specifically:
+
+* When ripgrep uses memory maps, then binary detection is only performed on the
+ first few kilobytes of the file in addition to every matching line.
+* When ripgrep doesn't use memory maps, then binary detection is performed on
+ all bytes searched.
+
+This means that whether a file is detected as binary or not can change based
+on the internal search strategy used by ripgrep. If you prefer to keep
+ripgrep's binary file detection consistent, then you can disable memory maps
+via the `--no-mmap` flag. (The cost will be a small performance regression when
+searching very large files on some platforms.)
+
+
### Common options
ripgrep has a lot of flags. Too many to keep in your head at once. This section