binary: rejigger ripgrep's handling of binary files

This commit attempts to surface binary filtering in a slightly more user friendly way. Namely, before, ripgrep would silently stop searching a file if it detected a NUL byte, even if it had previously printed a match. This can lead to the user quite reasonably assuming that there are no more matches, since a partial search is fairly unintuitive. (ripgrep has this behavior by default because it really wants to NOT search binary files at all, just like it doesn't search gitignored or hidden files.) With this commit, if a match has already been printed and ripgrep detects a NUL byte, then it will print a warning message indicating that the search stopped prematurely. Moreover, this commit adds a new flag, --binary, which causes ripgrep to stop filtering binary files, but in a way that still avoids dumping binary data into terminals. That is, the --binary flag makes ripgrep behave more like grep's default behavior. For files explicitly specified in a search, e.g., `rg foo some-file`, then no binary filtering is applied (just like no gitignore and no hidden file filtering is applied). Instead, ripgrep behaves as if you gave the --binary flag for all explicitly given files. This was a fairly invasive change, and potentially increases the UX complexity of ripgrep around binary files. (Before, there were two binary modes, where as now there are three.) However, ripgrep is now a bit louder with warning messages when binary file detection might otherwise be hiding potential matches, so hopefully this is a net improvement. Finally, the `-uuu` convenience now maps to `--no-ignore --hidden --binary`, since this is closer to the actualy intent of the `--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a consequence, `rg -uuu foo` should now search roughly the same number of bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the same number of bytes as `grep -ra foo`. (The "roughly" weasel word is used because grep's and ripgrep's binary file detection might differ somewhat---perhaps based on buffer sizes---which can impact exactly what is and isn't searched.) See the numerous tests in tests/binary.rs for intended behavior. Fixes #306, Fixes #855
author: Andrew Gallant <jamslam@gmail.com> 2019-04-08 19:28:38 -0400
committer: Andrew Gallant <jamslam@gmail.com> 2019-04-14 19:29:27 -0400
commit: a7d26c8f144a4957b75f71087a66692d0b25759a (patch)
tree: 4888ac5ea66643ac919d4e12c60cc51992bef11a
parent: bd222ae93fa0cabe7d51ba8db40ece99579bdaed (diff)
22 files changed, 1323 insertions, 70 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0e9f381d..fad6dc11 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,6 +11,11 @@ TODO.
   error (e.g., regex syntax error). One exception to this is if ripgrep is run
   with `-q/--quiet`. In that case, if an error occurs and a match is found,
   then ripgrep will exit with a `0` exit status code.
+* Supplying the `-u/--unrestricted` flag three times is now equivalent to
+  supplying `--no-ignore --hidden --binary`. Previously, `-uuu` was equivalent
+  to `--no-ignore --hidden --text`. The difference is that `--binary` disables
+  binary file filtering without potentially dumping binary data into your
+  terminal. That is, `rg -uuu foo` should now be equivalent to `grep -r foo`.
 * The `avx-accel` feature of ripgrep has been removed since it is no longer
   necessary. All uses of AVX in ripgrep are now enabled automatically via
   runtime CPU feature detection. The `simd-accel` feature does remain
@@ -25,6 +30,8 @@ Performance improvements:
 
 Feature enhancements:
 
+* [FEATURE #855](https://github.com/BurntSushi/ripgrep/issues/855):
+  Add `--binary` flag for disabling binary file filtering.
 * [FEATURE #1099](https://github.com/BurntSushi/ripgrep/pull/1099):
   Add support for Brotli and Zstd to the `-z/--search-zip` flag.
 * [FEATURE #1138](https://github.com/BurntSushi/ripgrep/pull/1138):
@@ -36,6 +43,9 @@ Feature enhancements:
 
 Bug fixes:
 
+* [BUG #306](https://github.com/BurntSushi/ripgrep/issues/306),
+  [BUG #855](https://github.com/BurntSushi/ripgrep/issues/855):
+  Improve the user experience for ripgrep's binary file filtering.
 * [BUG #373](https://github.com/BurntSushi/ripgrep/issues/373),
   [BUG #1098](https://github.com/BurntSushi/ripgrep/issues/1098):
   `**` is now accepted as valid syntax anywhere in a glob.
diff --git a/GUIDE.md b/GUIDE.md
index 8022f292..907ab382 100644
--- a/GUIDE.md
+++ b/GUIDE.md
@@ -18,6 +18,7 @@ translatable to any command line shell environment.
 * [Replacements](#replacements)
 * [Configuration file](#configuration-file)
 * [File encoding](#file-encoding)
+* [Binary data](#binary-data)
 * [Common options](#common-options)
 
 
@@ -680,6 +681,76 @@ $ rg '\w(?-u:\w)\w'
 ```
 
 
+### Binary data
+
+In addition to skipping hidden files and files in your `.gitignore` by default,
+ripgrep also attempts to skip binary files. ripgrep does this by default
+because binary files (like PDFs or images) are typically not things you want to
+search when searching for regex matches. Moreover, if content in a binary file
+did match, then it's possible for undesirable binary data to be printed to your
+terminal and wreak havoc.
+
+Unfortunately, unlike skipping hidden files and respecting your `.gitignore`
+rules, a file cannot as easily be classified as binary. In order to figure out
+whether a file is binary, the most effective heuristic that balances
+correctness with performance is to simply look for `NUL` bytes. At that point,
+the determination is simple: a file is considered "binary" if and only if it
+contains a `NUL` byte somewhere in its contents.
+
+The issue is that while most binary files will have a `NUL` byte toward the
+beginning of its contents, this is not necessarily true. The `NUL` byte might
+be the very last byte in a large file, but that file is still considered
+binary. While this leads to a fair amount of complexity inside ripgrep's
+implementation, it also results in some unintuitive user experiences.
+
+At a high level, ripgrep operates in three different modes with respect to
+binary files:
+
+1. The default mode is to attempt to remove binary files from a search
+   completely. This is meant to mirror how ripgrep removes hidden files and
+   files in your `.gitignore` automatically. That is, as soon as a file is
+   detected as binary, searching stops. If a match was already printed (because
+   it was detected long before a `NUL` byte), then ripgrep will print a warning
+   message indicating that the search stopped prematurely. This default mode
+   **only applies to files searched by ripgrep as a result of recursive
+   directory traversal**, which is consistent with ripgrep's other automatic
+   filtering. For example, `rg foo .file` will search `.file` even though it
+   is hidden. Similarly, `rg foo binary-file` search `binary-file` in "binary"
+   mode automatically.
+2. Binary mode is similar to the default mode, except it will not always
+   stop searching after it sees a `NUL` byte. Namely, in this mode, ripgrep
+   will continue searching a file that is known to be binary until the first
+   of two conditions is met: 1) the end of the file has been reached or 2) a
+   match is or has been seen. This means that in binary mode, if ripgrep
+   reports no matches, then there are no matches in the file. When a match does
+   occur, ripgrep prints a message similar to one it prints when in its default
+   mode indicating that the search has stopped prematurely. This mode can be
+   forcefully enabled for all files with the `--binary` flag. The purpose of
+   binary mode is to provide a way to discover matches in all files, but to
+   avoid having binary data dumped into your terminal.
+3. Text mode completely disables all binary detection and searches all files
+   as if they were text. This is useful when searching a file that is
+   predominantly text but contains a `NUL` byte, or if you are specifically
+   trying to search binary data. This mode can be enabled with the `-a/--text`
+   flag. Note that when using this mode on very large binary files, it is
+   possible for ripgrep to use a lot of memory.
+
+Unfortunately, there is one additional complexity in ripgrep that can make it
+difficult to reason about binary files. That is, the way binary detection works
+depends on the way that ripgrep searches your files. Specifically:
+
+* When ripgrep uses memory maps, then binary detection is only performed on the
+  first few kilobytes of the file in addition to every matching line.
+* When ripgrep doesn't use memory maps, then binary detection is performed on
+  all bytes searched.
+
+This means that whether a file is detected as binary or not can change based
+on the internal search strategy used by ripgrep. If you prefer to keep
+ripgrep's binary file detection consistent, then you can disable memory maps
+via the `--no-mmap` flag. (The cost will be a small performance regression when
+searching very large files on some platforms.)
+
+
 ### Common options
 
 ripgrep has a lot of flags. Too many to keep in your head at once. This section
diff --git a/complete/_rg b/complete/_rg
index c4a983ac..882a38d6 100644
--- a/complete/_rg
+++ b/complete/_rg
@@ -227,6 +227,8 @@ _rg() {
 
     + '(text)' # Binary-search options
     {-a,--text}'[search binary files as if they were text]'
+    "--binary[search binary files, don't print binary data]"
+    $no"--no-binary[don't search binary files]"
     $no"(--null-data)--no-text[don't search binary files as if they were text]"
 
     + '(threads)' # Thread-count options
diff --git a/doc/rg.1.txt.tpl b/doc/rg.1.txt.tpl
index 1c542b6b..d40fb359 100644
--- a/doc/rg.1.txt.tpl
+++ b/doc/rg.1.txt.tpl
@@ -41,6 +41,9 @@ configuration file. The file can specify one shell argument per line. Lines
 starting with *#* are ignored. For more details, see the man page or the
 *README*.
 
+Tip: to disable all smart filtering and make ripgrep behave a bit more like
+classical grep, use *rg -uuu*.
+
 
 REGEX SYNTAX
 ------------
diff --git a/grep-printer/src/standard.rs b/grep-printer/src/standard.rs
index 6ead1db6..068f96a4 100644
--- a/grep-printer/src/standard.rs
+++ b/grep-printer/src/standard.rs
@@ -5,6 +5,7 @@ use std::path::Path;
 use std::sync::Arc;
 use std::time::Instant;
 
+use bstr::BStr;
 use grep_matcher::{Match, Matcher};
 use grep_searcher::{
     LineStep, Searcher,
@@ -743,6 +744,11 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
             stats.add_matches(self.standard.matches.len() as u64);
             stats.add_matched_lines(mat.lines().count() as u64);
         }
+        if searcher.binary_detection().convert_byte().is_some() {
+            if self.binary_byte_offset.is_some() {
+                return Ok(false);
+            }
+        }
 
         StandardImpl::from_match(searcher, self, mat).sink()?;
         Ok(!self.should_quit())
@@ -764,6 +770,12 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
             self.record_matches(ctx.bytes())?;
             self.replace(ctx.bytes())?;
         }
+        if searcher.binary_detection().convert_byte().is_some() {
+            if self.binary_byte_offset.is_some() {
+                return Ok(false);
+            }
+        }
+
         StandardImpl::from_context(searcher, self, ctx).sink()?;
         Ok(!self.should_quit())
     }
@@ -776,6 +788,15 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
         Ok(true)
     }
 
+    fn binary_data(
+        &mut self,
+        _searcher: &Searcher,
+        binary_byte_offset: u64,
+    ) -> Result<bool, io::Error> {
+        self.binary_byte_offset = Some(binary_byte_offset);
+        Ok(true)
+    }
+
     fn begin(
         &mut self,
         _searcher: &Searcher,
@@ -793,10 +814,12 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
 
     fn finish(
         &mut self,
-        _searcher: &Searcher,
+        searcher: &Searcher,
         finish: &SinkFinish,
     ) -> Result<(), io::Error> {
-        self.binary_byte_offset = finish.binary_byte_offset();
+        if let Some(offset) = self.binary_byte_offset {
+            StandardImpl::new(searcher, self).write_binary_message(offset)?;
+        }
         if let Some(stats) = self.stats.as_mut() {
             stats.add_elapsed(self.start_time.elapsed());
             stats.add_searches(1);
@@ -1314,6 +1337,38 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
         Ok(())
     }
 
+    fn write_binary_message(&self, offset: u64) -> io::Result<()> {
+        if self.sink.match_count == 0 {
+            return Ok(());
+        }
+
+        let bin = self.searcher.binary_detection();
+        if let Some(byte) = bin.quit_byte() {
+            self.write(b"WARNING: stopped searching binary file ")?;
+            if let Some(path) = self.path() {
+                self.write_spec(self.config().colors.path(), path.as_bytes())?;
+                self.write(b" ")?;
+            }
+            let remainder = format!(
+                "after match (found {:?} byte around offset {})\n",
+                BStr::new(&[byte]), offset,
+            );
+            self.write(remainder.as_bytes())?;
+        } else if let Some(byte) = bin.convert_byte() {
+            self.write(b"Binary file ")?;
+            if let Some(path) = self.path() {
+                self.write_spec(self.config().colors.path(), path.as_bytes())?;
+                self.write(b" ")?;
+            }
+            let remainder = format!(
+                "matches (found {:?} byte around offset {})\n",
+                BStr::new(&[byte]), offset,
+            );
+            self.write(remainder.as_bytes())?;
+        }
+        Ok(())
+    }
+
     fn write_context_separator(&self) -> io::Result<()> {
         if let Some(ref sep) = *self.config().separator_context {
             self.write(sep)?;
diff --git a/grep-printer/src/summary.rs b/grep-printer/src/summary.rs
index deb7e609..a1c7785e 100644
--- a/grep-printer/src/summary.rs
+++ b/grep-printer/src/summary.rs
@@ -636,6 +636,34 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for SummarySink<'p, 's, M, W> {
             stats.add_bytes_searched(finish.byte_count());
             stats.add_bytes_printed(self.summary.wtr.borrow().count());
         }
+        // If our binary detection method says to quit after seeing binary
+        // data, then we shouldn't print any results at all, even if we've
+        // found a match before detecting binary data. The intent here is to
+        // keep BinaryDetection::quit as a form of filter. Otherwise, we can
+        // present a matching file with a smaller number of matches than
+        // there might be, which can be quite misleading.
+        //
+        // If our binary detection method is to convert binary data, then we
+        // don't quit and therefore search the entire contents of the file.
+        //
+        // There is an unfortunate inconsistency here. Namely, when using
+        // Quiet or PathWithMatch, then the printer can quit after the first
+        // match seen, which could be long before seeing binary data. This
+        // means that using PathWithMatch can print a path where as using
+        // Count might not print it at all because of binary data.
+        //
+        // It's not possible to fix this without also potentially significantly
+        // impacting the performance of Quiet or PathWithMatch, so we accept
+        // the bug.
+        if self.binary_byte_offset.is_some()
+            && searcher.binary_detection().quit_byte().is_some()
+        {
+            // Squash the match count. The statistics reported will still
+            // contain the match count, but the "official" match count should
+            // be zero.
+            self.match_count = 0;
+            return Ok(());
+        }
 
         let show_count =
             !self.summary.config.exclude_zero
diff --git a/grep-searcher/src/line_buffer.rs b/grep-searcher/src/line_buffer.rs
index c2e54a9e..cc7dd578 100644
--- a/grep-searcher/src/line_buffer.rs
+++ b/grep-searcher/src/line_buffer.rs
@@ -317,6 +317,14 @@ pub struct LineBuffer {
 }
 
 impl LineBuffer {
+    /// Set the binary detection method used on this line buffer.
+    ///
+    /// This permits dynamically changing the binary detection strategy on
+    /// an existing line buffer without needing to create a new one.
+    pub fn set_binary_detection(&mut self, binary: BinaryDetection) {
+        self.config.binary = binary;
+    }
+
     /// Reset this buffer, such that it can be used with a new reader.
     fn clear(&mut self) {
         self.pos = 0;
diff --git a/grep-searcher/src/searcher/core.rs b/grep-searcher/src/searcher/core.rs
index ff2cd18d..dd621bba 100644
--- a/grep-searcher/src/searcher/core.rs
+++ b/grep-searcher/src/searcher/core.rs
@@ -90,6 +90,13 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
         self.sink_matched(buf, range)
     }
 
+    pub fn binary_data(
+        &mut self,
+        binary_byte_offset: u64,
+    ) -> Result<bool, S::Error> {
+        self.sink.binary_data(&self.searcher, binary_byte_offset)
+    }
+
     pub fn begin(&mut self) -> Result<bool, S::Error> {
         self.sink.begin(&self.searcher)
     }
@@ -141,19 +148,28 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
         consumed
     }
 
-    pub fn detect_binary(&mut self, buf: &[u8], range: &Range) -> bool {
+    pub fn detect_binary(
+        &mut self,
+        buf: &[u8],
+        range: &Range,
+    ) -> Result<bool, S::Error> {
         if self.binary_byte_offset.is_some() {
-            return true;
+            return Ok(self.config.binary.quit_byte().is_some());
         }
         let binary_byte = match self.config.binary.0 {
             BinaryDetection::Quit(b) => b,
-            _ => return false,
+            BinaryDetection::Convert(b) => b,
+            _ => return Ok(false),
         };
         if let Some(i) = B(&buf[*range]).find_byte(binary_byte) {
-            self.binary_byte_offset = Some(range.start() + i);
-            true
+            let offset = range.start() + i;
+            self.binary_byte_offset = Some(offset);
+            if !self.binary_data(offset as u64)? {
+                return Ok(true);
+            }
+            Ok(self.config.binary.quit_byte().is_some())
         } else {
-            false
+            Ok(false)
         }
     }
 
@@ -416,7 +432,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
         buf: &[u8],
         range: &Range,
     ) -> Result<bool, S::Error> {
-        if self.binary && self.detect_binary(buf, range) {
+        if self.binary && self.detect_binary(buf, range)? {
             return Ok(false);
         }
         if !self.sink_break_context(range.start())? {
@@ -448,7 +464,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
         buf: &[u8],
         range: &Range,
     ) -> Result<bool, S::Error> {
-        if self.binary && self.detect_binary(buf, range) {
+        if self.binary && self.detect_binary(buf, range)? {
             return Ok(false);
         }
         self.count_lines(buf, range.start());
@@ -478,7 +494,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
     ) -> Result<bool, S::Error> {
         assert!(self.after_context_left >= 1);
 
-        if self.binary && self.detect_binary(buf, range) {
+        if self.binary && self.detect_binary(buf, range)? {
             return Ok(false);
         }
         self.count_lines(buf, range.start());
@@ -507,7 +523,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
         buf: &[u8],
         range: &Range,
     ) -> Result<bool, S::Error> {
-        if self.binary && self.detect_binary(buf, range) {
+        if self.binary && self.detect_binary(buf, range)? {
             return Ok(false);
         }
         self.count_lines(buf, range.start());
diff --git a/grep-searcher/src/searcher/glue.rs b/grep-searcher/src/searcher/glue.rs
index 3a5d4291..4f362dab 100644
--- a/grep-searcher/src/searcher/glue.rs
+++ b/grep-searcher/src/searcher/glue.rs
@@ -51,6 +51,7 @@ where M: Matcher,
     fn fill(&mut self) -> Result<bool, S::Error> {
         assert!(self.rdr.buffer()[self.core.pos()..].is_empty());
 
+        let already_binary = self.rdr.binary_byte_offset().is_some();
         let old_buf_len = self.rdr.buffer().len();
         let consumed = self.core.roll(self.rdr.buffer());
         self.rdr.consume(consumed);
@@ -58,7 +59,14 @@ where M: Matcher,
             Err(err) => return Err(S::Error::error_io(err)),
             Ok(didread) => didread,
         };
-        if !didread || self.rdr.binary_byte_offset().is_some() {
+        if !already_binary {
+            if let Some(offset) = self.rdr.binary_byte_offset() {
+                if !self.core.binary_data(offset)? {
+                    return Ok(false);
+                }
+            }
+        }
+        if !didread || self.should_binary_quit() {
             return Ok(false);
         }
         // If rolling the buffer didn't result in consuming anything and if
@@ -71,6 +79,11 @@ where M: Matcher,
         }
         Ok(true)
     }
+
+    fn should_binary_quit(&self) -> bool {
+        self.rdr.binary_byte_offset().is_some()
+        && self.config.binary.quit_byte().is_some()
+    }
 }
 
 #[derive(Debug)]
@@ -103,7 +116,7 @@ impl<'s, M: Matcher, S: Sink> SliceByLine<'s, M, S> {
                 DEFAULT_BUFFER_CAPACITY,
             );
             let binary_range = Range::new(0, binary_upto);
-            if !self.core.detect_binary(self.slice, &binary_range) {
+            if !self.core.detect_binary(self.slice, &binary_range)? {
                 while
                     !self.slice[self.core.pos()..].is_empty()
                     && self.core.match_by_line(self.slice)?
@@ -155,7 +168,7 @@ impl<'s, M: Matcher, S: Sink> MultiLine<'s, M, S> {
                 DEFAULT_BUFFER_CAPACITY,
             );
             let binary_range = Range::new(0, binary_upto);
-            if !self.core.detect_binary(self.slice, &binary_range) {
+            if !self.core.detect_binary(self.slice, &binary_range)? {
                 let mut keepgoing = true;
                 while !self.slice[self.core.pos()..].is_empty() && keepgoing {
                     keepgoing = self.sink()?;
diff --git a/grep-searcher/src/searcher/mod.rs b/grep-searcher/src/searcher/mod.rs
index 729b491b..e20e04a3 100644
--- a/grep-searcher/src/searcher/mod.rs
+++ b/grep-searcher/src/searcher/mod.rs
@@ -75,25 +75,41 @@ impl BinaryDetection {
         BinaryDetection(line_buffer::BinaryDetection::Quit(binary_byte))
     }
 
-    // TODO(burntsushi): Figure out how to make binary conversion work. This
-    // permits implementing GNU grep's default behavior, which is to zap NUL
-    // bytes but still execute a search (if a match is detected, then GNU grep
-    // stops and reports that a match was found but doesn't print the matching
-    // line itself).
-    //
-    // This behavior is pretty simple to implement using the line buffer (and
-    // in fact, it is already implemented and tested), since there's a fixed
-    // size buffer that we can easily write to. The issue arises when searching
-    // a `&[u8]` (whether on the heap or via a memory map), since this isn't
-    // something we can easily write to.
-
-    /// The given byte is searched in all contents read by the line buffer. If
-    /// it occurs, then it is replaced by the line terminator. The line buffer
-    /// guarantees that this byte will never be observable by callers.
-    #[allow(dead_code)]
-    fn convert(binary_byte: u8) -> BinaryDetection {
+    /// Binary detection is performed by looking for the given byte, and
+    /// replacing it with the line terminator configured on the searcher.
+    /// (If the searcher is configured to use `CRLF` as the line terminator,
+    /// then this byte is replaced by just `LF`.)
+    ///
+    /// When searching is performed using a fixed size buffer, then the
+    /// contents of that buffer are always searched for the presence of this
+    /// byte and replaced with the line terminator. In effect, the caller is
+    /// guaranteed to never observe this byte while searching.
+    ///
+    /// When searching is performed with the entire contents mapped into
+    /// memory, then this setting has no effect and is ignored.
+    pub fn convert(binary_byte: u8) -> BinaryDetection {
         BinaryDetection(line_buffer::BinaryDetection::Convert(binary_byte))
     }
+
+    /// If this binary detection uses the "quit" strategy, then this returns
+    /// the byte that will cause a search to quit. In any other case, this
+    /// returns `None`.
+    pub fn quit_byte(&self) -> Option<u8> {
+        match self.0 {
+            line_buffer::BinaryDetection::Quit(b) => Some(b),
+            _ => None,
+        }
+    }
+
+    /// If this binary detection uses the "convert" strategy, then this returns
+    /// the byte that will be replaced by the line terminator. In any other
+    /// case, this returns `None`.
+    pub fn convert_byte(&self) -> Option<u8> {
+        match self.0 {
+            line_buffer::BinaryDetection::Convert(b) => Some(b),
+            _ => None,
+        }
+    }
 }
 
 /// An encoding to use when searching.
@@ -739,6 +755,12 @@ impl Searcher {
         }
     }
 
+    /// Set the binary detection method used on this searcher.
+    pub fn set_binary_detection(&mut self, detection: BinaryDetection) {
+        self.config.binary = detection.clone();
+        self.line_buffer.borrow_mut().set_binary_detection(detection.0);
+    }
+
     /// Check that the searcher's configuration and the matcher are consistent
     /// with each other.
     fn check_config<M: Matcher>(&self, matcher: M) -> Result<(), ConfigError> {
@@ -778,6 +800,12 @@ impl Searcher {
         self.config.line_term
     }
 
+    /// Returns the type of binary detection configured on this searcher.
+    #[inline]
+    pub fn binary_detection(&self) -> &BinaryDetection {
+        &self.config.binary
+    }
+
     /// Returns true if and only if this searcher is configured to invert its
     /// search results. That is, matching lines are lines that do **not** match
     /// the searcher's matcher.
diff --git a/grep-searcher/src/sink.rs b/grep-searcher/src/sink.rs
index bf2316f7..63a8ae24 100644
--- a/grep-searcher/src/sink.rs
+++ b/grep-searcher/src/sink.rs
@@ -167,6 +167,28 @@ pub trait Sink {
         Ok(true)
     }
 
+    /// This method is called whenever binary detection is enabled and binary
+    /// data is found. If binary data is found, then this is called at least
+    /// once for the first occurrence with the absolute byte offset at which
+    /// the binary data begins.
+    ///
+    /// If this returns `true`, then searching continues. If this returns
+    /// `false`, then searching is stopped immediately and `finish` is called.
+    ///
+    /// If this returns an error, then searching is stopped immediately,
+    /// `finish` is not called and the error is bubbled back up to the caller
+    /// of the searcher.
+    ///
+    /// By default, it does nothing and returns `true`.
+    #[inline]
+    fn binary_data(
+        &mut self,
+        _searcher: &Searcher,
+        _binary_byte_offset: u64,
+    ) -> Result<bool, Self::Error> {
+        Ok(true)
+    }
+
     /// This method is called when a search has begun, before any search is
     /// executed. By default, this does nothing.
     ///
@@ -229,6 +251,15 @@ impl<'a, S: Sink> Sink for &'a mut S {
     }
 
     #[inline]
+    fn binary_data(
+        &mut self,
+        searcher: &Searcher,
+        binary_byte_offset: u64,
+    ) -> Result<bool, S::Error> {
+        (**self).binary_data(searcher, binary_byte_offset)
+    }
+
+    #[inline]
     fn begin(
         &mut self,
         searcher: &Searcher,
@@ -276,6 +307,15 @@ impl<S: Sink + ?Sized> Sink for Box<S> {
     }
 
     #[inline]
+    fn binary_data(
+        &mut self,
+        searcher: &Searcher,
+        binary_byte_offset: u64,
+    ) -> Result<bool, S::Error> {
+        (**self).binary_data(searcher, binary_byte_offset)
+    }
+
+    #[inline]
     fn begin(
         &mut self,
         searcher: &Searcher,
diff --git a/src/app.rs b/src/app.rs
index 66eaedb4..d062699f 100644
--- a/src/app.rs
+++ b/src/app.rs
@@ -27,6 +27,9 @@ configuration file. The file can specify one shell argument per line. Lines
 starting with '#' are ignored. For more details, see the man page or the
 README.
 
+Tip: to disable all smart filtering and make ripgrep behave a bit more like
+classical grep, use 'rg -uuu'.
+
 Project home page: https://github.com/BurntSushi/ripgrep
 
 Use -h for short descriptions and --help for more details.";
@@ -545,6 +548,7 @@ pub fn all_args_and_flags() -> Vec<RGArg> {
     // "positive" flag.
     flag_after_context(&mut args);
     flag_before_context(&mut args);
+    flag_binary(&mut args);
     flag_block_buffered(&mut args);
     flag_byte_offset(&mut args);
     flag_case_sensitive(&mut args);
@@ -691,6 +695,55 @@ This overrides the --context flag.
     args.push(arg);
 }
 
+fn flag_binary(args: &mut Vec<RGArg>) {
+    const SHORT: &str = "Search binary files.";
+    const LONG: &str = long!("\
+Enabling this flag will cause ripgrep to search binary files. By default,
+ripgrep attempts to automatically skip binary files in order to improve the
+relevance of results and make the search faster.
+
+Binary files are heuristically detected based on whether they contain a NUL
+byte or not. By default (without this flag set), once a NUL byte is seen,
+ripgrep will stop searching the file. Usually, NUL bytes occur in the beginning
+of most binary files. If a NUL byte occurs after a match, then ripgrep will
+still stop searching the rest of the file, but a warning will be printed.
+
+In contrast, when this flag is provided, ripgrep will continue searching a file
+even if a NUL byte is found. In particular, if a NUL byte is found then ripgrep
+will continue searching until either a match is found or the end of the file is
+reached, whichever comes sooner. If a match is found, then ripgrep will stop
+and print a warning saying that the search stopped prematurely.
+
+If you want ripgrep to search a file without any special NUL byte handling at
+all (and potentially print binary data to stdout), then you should use the
+'-a/--text' flag.
+
+The '--binary' flag is a flag for controlling ripgrep's automatic filtering
+mechanism. As such, it does not need to be used when searching a file
+explicitly or when searching stdin. That is, it is only applicable when
+recursively searching a directory.
+
+Note that when the '-u/--unrestricted' flag is provided for a third time, then
+this flag is automatically enabled.
+
+This flag can be disabled with '--no-binary'. It overrides the '-a/--text'
+flag.
+");
+    let arg = RGArg::switch("binary")
+        .help(SHORT).long_help(LONG)
+        .overrides("no-binary")
+        .overrides("text")
+        .overrides("no-text");
+    args.push(arg);
+
+    let arg = RGArg::switch("no-binary")
+        .hidden()
+        .overrides("binary")
+        .overrides("text")
+        .overrides("no-text");
+    args.push(arg);
+}
+
 fn flag_block_buffered(args: &mut Vec<RGArg>) {
     const SHORT: &str = "Force block buffering.";
     const LONG: &str = long!("\
@@ -1874,7 +1927,7 @@ fn flag_pre(args: &mut Vec<RGArg>) {
 For each input FILE, search the standard output of COMMAND FILE rather than the
 contents of FILE. This option expects the COMMAND program to either be an
 absolute path or to be available in your PATH. Either an empty string COMMAND
-or the `--no-pre` flag will disable this behavior.
+or the '--no-pre' flag will disable this behavior.
 
     WARNING: When this flag is set, ripgrep will unconditionally
author	Andrew Gallant <jamslam@gmail.com>	2019-04-08 19:28:38 -0400
committer	Andrew Gallant <jamslam@gmail.com>	2019-04-14 19:29:27 -0400
commit	a7d26c8f144a4957b75f71087a66692d0b25759a (patch)
tree	4888ac5ea66643ac919d4e12c60cc51992bef11a
parent	bd222ae93fa0cabe7d51ba8db40ece99579bdaed (diff)