searcher: add option to disable BOM sniffing

This commit adds a new encoding feature where the -E/--encoding flag will now accept a value of 'none'. When given this value, all encoding related machinery is disabled and ripgrep will search the raw bytes of the file, including the BOM if it's present. Closes #1207, Closes #1208
author: lesnyrumcajs <lesny.rumcajs@gmail.com> 2019-03-04 17:18:45 +0100
committer: Andrew Gallant <jamslam@gmail.com> 2019-04-06 10:35:08 -0400
commit: 5962abc4655a0f07ece6fc6bd45142e8ee1cab0c (patch)
tree: 56b1f051f3e803cd24aa2d980c3edcf765756bda /GUIDE.md
parent: 1604a18db3d896514e1d536781810642de4b31c1 (diff)
1 files changed, 26 insertions, 6 deletions
diff --git a/GUIDE.md b/GUIDE.md
index 0094a7b4..8022f292 100644
--- a/GUIDE.md
+++ b/GUIDE.md
@@ -603,7 +603,7 @@ topic, but we can try to summarize its relevancy to ripgrep:
 * Files are generally just a bundle of bytes. There is no reliable way to know
   their encoding.
 * Either the encoding of the pattern must match the encoding of the files being
-  searched, or a form of transcoding must be performed converts either the
+  searched, or a form of transcoding must be performed that converts either the
   pattern or the file to the same encoding as the other.
 * ripgrep tends to work best on plain text files, and among plain text files,
   the most popular encodings likely consist of ASCII, latin1 or UTF-8. As
@@ -626,12 +626,15 @@ given, which is the default:
   they correspond to a UTF-16 BOM, then ripgrep will transcode the contents of
   the file from UTF-16 to UTF-8, and then execute the search on the transcoded
   version of the file. (This incurs a performance penalty since transcoding
-  is slower than regex searching.)
+  is slower than regex searching.) If the file contains invalid UTF-16, then
+  the Unicode replacement codepoint is substituted in place of invalid code
+  units.
 * To handle other cases, ripgrep provides a `-E/--encoding` flag, which permits
   you to specify an encoding from the
   [Encoding Standard](https://encoding.spec.whatwg.org/#concept-encoding-get).
-  ripgrep will assume *all* files searched are the encoding specified and
-  will perform a transcoding step just like in the UTF-16 case described above.
+  ripgrep will assume *all* files searched are the encoding specified (unless
+  the file has a BOM) and will perform a transcoding step just like in the
+  UTF-16 case described above.
 
 By default, ripgrep will not require its input be valid UTF-8. That is, ripgrep
 can and will search arbitrary bytes. The key here is that if you're searching
@@ -641,9 +644,26 @@ pattern won't find anything. With all that said, this mode of operation is
 important, because it lets you find ASCII or UTF-8 *within* files that are
 otherwise arbitrary bytes.
 
+As a special case, the `-E/--encoding` flag supports the value `none`, which
+will completely disable all encoding related logic, including BOM sniffing.
+When `-E/--encoding` is set to `none`, ripgrep will search the raw bytes of
+the underlying file with no transcoding step. For example, here's how you might
+search the raw UTF-16 encoding of the string `Шерлок`:
+
+```
+$ rg '(?-u)\(\x045\x04@\x04;\x04>\x04:\x04' -E none -a some-utf16-file
+```
+
+Of course, that's just an example meant to show how one can drop down into
+raw bytes. Namely, the simpler command works as you might expect automatically:
+
+```
+$ rg 'Шерлок' some-utf16-file
+```
+
 Finally, it is possible to disable ripgrep's Unicode support from within the
-pattern regular expression. For example, let's say you wanted `.` to match any
-byte rather than any Unicode codepoint. (You might want this while searching a
+regular expression. For example, let's say you wanted `.` to match any byte
+rather than any Unicode codepoint. (You might want this while searching a
 binary file, since `.` by default will not match invalid UTF-8.) You could do
 this by disabling Unicode via a regular expression flag:
author	lesnyrumcajs <lesny.rumcajs@gmail.com>	2019-03-04 17:18:45 +0100
committer	Andrew Gallant <jamslam@gmail.com>	2019-04-06 10:35:08 -0400
commit	5962abc4655a0f07ece6fc6bd45142e8ee1cab0c (patch)
tree	56b1f051f3e803cd24aa2d980c3edcf765756bda /GUIDE.md
parent	1604a18db3d896514e1d536781810642de4b31c1 (diff)