Add support for additional text encodings.

This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and Shift_JIS. (Courtesy of the `encoding_rs` crate.) Specifically, this feature enables ripgrep to search files that are encoded in an encoding other than UTF-8. The list of available encodings is tied directly to what the `encoding_rs` crate supports, which is in turn tied to the Encoding Standard. The full list of available encodings can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get This pull request also introduces the notion that text encodings can be automatically detected on a best effort basis. Currently, the only support for this is checking for a UTF-16 bom. In all other cases, a text encoding of `auto` (the default) implies a UTF-8 or ASCII compatible source encoding. When a text encoding is otherwise specified, it is unconditionally used for all files searched. Since ripgrep's regex engine is fundamentally built on top of UTF-8, this feature works by transcoding the files to be searched from their source encoding to UTF-8. This transcoding only happens when: 1. `auto` is specified and a non-UTF-8 encoding is detected. 2. A specific encoding is given by end users (including UTF-8). When transcoding occurs, errors are handled by automatically inserting the Unicode replacement character. In this case, ripgrep's output is guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they are printed). In all other cases, the source text is searched directly, which implies an assumption that it is at least ASCII compatible, but where UTF-8 is most useful. In this scenario, encoding errors are not detected. In this case, ripgrep's output will match the input exactly, byte-for-byte. This design may not be optimal in all cases, but it has some advantages: 1. In the happy path ("UTF-8 everywhere") remains happy. I have not been able to witness any performance regressions. 2. In the non-UTF-8 path, implementation complexity is kept relatively low. The cost here is transcoding itself. A potentially superior implementation might build decoding of any encoding into the regex engine itself. In particular, the fundamental problem with transcoding everything first is that literal optimizations are nearly negated. Future work should entail improving the user experience. For example, we might want to auto-detect more text encodings. A more elaborate UX experience might permit end users to specify multiple text encodings, although this seems hard to pull off in an ergonomic way. Fixes #1
author: Andrew Gallant <jamslam@gmail.com> 2017-03-08 20:22:48 -0500
committer: Andrew Gallant <jamslam@gmail.com> 2017-03-12 19:54:48 -0400
commit: 8bbe58d623db78a32b04eabff9a69667ad23ff7b (patch)
tree: f37d62299c50366c0eb8e619cc043f9feb4ba573 /src/worker.rs
parent: b3fd0df94bbf928ea00cf9a10bd007f4b236d85b (diff)
1 files changed, 25 insertions, 3 deletions
diff --git a/src/worker.rs b/src/worker.rs
index 60dde722..51b7f64c 100644
--- a/src/worker.rs
+++ b/src/worker.rs
@@ -2,11 +2,13 @@ use std::fs::File;
 use std::io;
 use std::path::Path;
 
+use encoding_rs::Encoding;
 use grep::Grep;
 use ignore::DirEntry;
 use memmap::{Mmap, Protection};
 use termcolor::WriteColor;
 
+use decoder::DecodeReader;
 use pathutil::strip_prefix;
 use printer::Printer;
 use search_buffer::BufferSearcher;
@@ -27,6 +29,7 @@ pub struct WorkerBuilder {
 #[derive(Clone, Debug)]
 struct Options {
     mmap: bool,
+    encoding: Option<&'static Encoding>,
     after_context: usize,
     before_context: usize,
     count: bool,
@@ -45,6 +48,7 @@ impl Default for Options {
     fn default() -> Options {
         Options {
             mmap: false,
+            encoding: None,
             after_context: 0,
             before_context: 0,
             count: false,
@@ -80,6 +84,7 @@ impl WorkerBuilder {
         Worker {
             grep: self.grep,
             inpbuf: inpbuf,
+            decodebuf: vec![0; 8 * (1<<10)],
             opts: self.opts,
         }
     }
@@ -106,6 +111,15 @@ impl WorkerBuilder {
         self
     }
 
+    /// Set the encoding to use to read each file.
+    ///
+    /// If the encoding is `None` (the default), then the encoding is
+    /// automatically detected on a best-effort per-file basis.
+    pub fn encoding(mut self, enc: Option<&'static Encoding>) -> Self {
+        self.opts.encoding = enc;
+        self
+    }
+
     /// If enabled, searching will print the path instead of each match.
     ///
     /// Disabled by default.
@@ -181,8 +195,9 @@ impl WorkerBuilder {
 /// Worker is responsible for executing searches on file paths, while choosing
 /// streaming search or memory map search as appropriate.
 pub struct Worker {
-    inpbuf: InputBuffer,
     grep: Grep,
+    inpbuf: InputBuffer,
+    decodebuf: Vec<u8>,
     opts: Options,
 }
 
@@ -241,6 +256,8 @@ impl Worker {
         path: &Path,
         rdr: R,
     ) -> Result<u64> {
+        let rdr = DecodeReader::new(
+            rdr, &mut self.decodebuf, self.opts.encoding);
         let searcher = Searcher::new(
             &mut self.inpbuf, printer, &self.grep, path, rdr);
         searcher
@@ -274,8 +291,13 @@ impl Worker {
             return self.search(printer, path, file);
         }
         let mmap = try!(Mmap::open(file, Protection::Read));
-        let searcher = BufferSearcher::new(
-            printer, &self.grep, path, unsafe { mmap.as_slice() });
+        let buf = unsafe { mmap.as_slice() };
+        if buf.len() >= 3 && Encoding::for_bom(buf).is_some() {
+            // If we have a UTF-16 bom in our memory map, then we need to fall
+            // back to the stream reader, which will do transcoding.
+            return self.search(printer, path, file);
+        }
+        let searcher = BufferSearcher::new(printer, &self.grep, path, buf);
         Ok(searcher
             .count(self.opts.count)
             .files_with_matches(self.opts.files_with_matches)
author	Andrew Gallant <jamslam@gmail.com>	2017-03-08 20:22:48 -0500
committer	Andrew Gallant <jamslam@gmail.com>	2017-03-12 19:54:48 -0400
commit	8bbe58d623db78a32b04eabff9a69667ad23ff7b (patch)
tree	f37d62299c50366c0eb8e619cc043f9feb4ba573 /src/worker.rs
parent	b3fd0df94bbf928ea00cf9a10bd007f4b236d85b (diff)