summaryrefslogtreecommitdiffstats
path: root/GUIDE.md
diff options
context:
space:
mode:
authorAndrew Gallant <jamslam@gmail.com>2019-04-08 19:28:38 -0400
committerAndrew Gallant <jamslam@gmail.com>2019-04-14 19:29:27 -0400
commita7d26c8f144a4957b75f71087a66692d0b25759a (patch)
tree4888ac5ea66643ac919d4e12c60cc51992bef11a /GUIDE.md
parentbd222ae93fa0cabe7d51ba8db40ece99579bdaed (diff)
binary: rejigger ripgrep's handling of binary files
This commit attempts to surface binary filtering in a slightly more user friendly way. Namely, before, ripgrep would silently stop searching a file if it detected a NUL byte, even if it had previously printed a match. This can lead to the user quite reasonably assuming that there are no more matches, since a partial search is fairly unintuitive. (ripgrep has this behavior by default because it really wants to NOT search binary files at all, just like it doesn't search gitignored or hidden files.) With this commit, if a match has already been printed and ripgrep detects a NUL byte, then it will print a warning message indicating that the search stopped prematurely. Moreover, this commit adds a new flag, --binary, which causes ripgrep to stop filtering binary files, but in a way that still avoids dumping binary data into terminals. That is, the --binary flag makes ripgrep behave more like grep's default behavior. For files explicitly specified in a search, e.g., `rg foo some-file`, then no binary filtering is applied (just like no gitignore and no hidden file filtering is applied). Instead, ripgrep behaves as if you gave the --binary flag for all explicitly given files. This was a fairly invasive change, and potentially increases the UX complexity of ripgrep around binary files. (Before, there were two binary modes, where as now there are three.) However, ripgrep is now a bit louder with warning messages when binary file detection might otherwise be hiding potential matches, so hopefully this is a net improvement. Finally, the `-uuu` convenience now maps to `--no-ignore --hidden --binary`, since this is closer to the actualy intent of the `--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a consequence, `rg -uuu foo` should now search roughly the same number of bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the same number of bytes as `grep -ra foo`. (The "roughly" weasel word is used because grep's and ripgrep's binary file detection might differ somewhat---perhaps based on buffer sizes---which can impact exactly what is and isn't searched.) See the numerous tests in tests/binary.rs for intended behavior. Fixes #306, Fixes #855
Diffstat (limited to 'GUIDE.md')
-rw-r--r--GUIDE.md71
1 files changed, 71 insertions, 0 deletions
diff --git a/GUIDE.md b/GUIDE.md
index 8022f292..907ab382 100644
--- a/GUIDE.md
+++ b/GUIDE.md
@@ -18,6 +18,7 @@ translatable to any command line shell environment.
* [Replacements](#replacements)
* [Configuration file](#configuration-file)
* [File encoding](#file-encoding)
+* [Binary data](#binary-data)
* [Common options](#common-options)
@@ -680,6 +681,76 @@ $ rg '\w(?-u:\w)\w'
```
+### Binary data
+
+In addition to skipping hidden files and files in your `.gitignore` by default,
+ripgrep also attempts to skip binary files. ripgrep does this by default
+because binary files (like PDFs or images) are typically not things you want to
+search when searching for regex matches. Moreover, if content in a binary file
+did match, then it's possible for undesirable binary data to be printed to your
+terminal and wreak havoc.
+
+Unfortunately, unlike skipping hidden files and respecting your `.gitignore`
+rules, a file cannot as easily be classified as binary. In order to figure out
+whether a file is binary, the most effective heuristic that balances
+correctness with performance is to simply look for `NUL` bytes. At that point,
+the determination is simple: a file is considered "binary" if and only if it
+contains a `NUL` byte somewhere in its contents.
+
+The issue is that while most binary files will have a `NUL` byte toward the
+beginning of its contents, this is not necessarily true. The `NUL` byte might
+be the very last byte in a large file, but that file is still considered
+binary. While this leads to a fair amount of complexity inside ripgrep's
+implementation, it also results in some unintuitive user experiences.
+
+At a high level, ripgrep operates in three different modes with respect to
+binary files:
+
+1. The default mode is to attempt to remove binary files from a search
+ completely. This is meant to mirror how ripgrep removes hidden files and
+ files in your `.gitignore` automatically. That is, as soon as a file is
+ detected as binary, searching stops. If a match was already printed (because
+ it was detected long before a `NUL` byte), then ripgrep will print a warning
+ message indicating that the search stopped prematurely. This default mode
+ **only applies to files searched by ripgrep as a result of recursive
+ directory traversal**, which is consistent with ripgrep's other automatic
+ filtering. For example, `rg foo .file` will search `.file` even though it
+ is hidden. Similarly, `rg foo binary-file` search `binary-file` in "binary"
+ mode automatically.
+2. Binary mode is similar to the default mode, except it will not always
+ stop searching after it sees a `NUL` byte. Namely, in this mode, ripgrep
+ will continue searching a file that is known to be binary until the first
+ of two conditions is met: 1) the end of the file has been reached or 2) a
+ match is or has been seen. This means that in binary mode, if ripgrep
+ reports no matches, then there are no matches in the file. When a match does
+ occur, ripgrep prints a message similar to one it prints when in its default
+ mode indicating that the search has stopped prematurely. This mode can be
+ forcefully enabled for all files with the `--binary` flag. The purpose of
+ binary mode is to provide a way to discover matches in all files, but to
+ avoid having binary data dumped into your terminal.
+3. Text mode completely disables all binary detection and searches all files
+ as if they were text. This is useful when searching a file that is
+ predominantly text but contains a `NUL` byte, or if you are specifically
+ trying to search binary data. This mode can be enabled with the `-a/--text`
+ flag. Note that when using this mode on very large binary files, it is
+ possible for ripgrep to use a lot of memory.
+
+Unfortunately, there is one additional complexity in ripgrep that can make it
+difficult to reason about binary files. That is, the way binary detection works
+depends on the way that ripgrep searches your files. Specifically:
+
+* When ripgrep uses memory maps, then binary detection is only performed on the
+ first few kilobytes of the file in addition to every matching line.
+* When ripgrep doesn't use memory maps, then binary detection is performed on
+ all bytes searched.
+
+This means that whether a file is detected as binary or not can change based
+on the internal search strategy used by ripgrep. If you prefer to keep
+ripgrep's binary file detection consistent, then you can disable memory maps
+via the `--no-mmap` flag. (The cost will be a small performance regression when
+searching very large files on some platforms.)
+
+
### Common options
ripgrep has a lot of flags. Too many to keep in your head at once. This section