diff options
author | lesnyrumcajs <lesny.rumcajs@gmail.com> | 2019-03-04 17:18:45 +0100 |
---|---|---|
committer | Andrew Gallant <jamslam@gmail.com> | 2019-04-06 10:35:08 -0400 |
commit | 5962abc4655a0f07ece6fc6bd45142e8ee1cab0c (patch) | |
tree | 56b1f051f3e803cd24aa2d980c3edcf765756bda /GUIDE.md | |
parent | 1604a18db3d896514e1d536781810642de4b31c1 (diff) |
searcher: add option to disable BOM sniffing
This commit adds a new encoding feature where the -E/--encoding flag
will now accept a value of 'none'. When given this value, all encoding
related machinery is disabled and ripgrep will search the raw bytes of
the file, including the BOM if it's present.
Closes #1207, Closes #1208
Diffstat (limited to 'GUIDE.md')
-rw-r--r-- | GUIDE.md | 32 |
1 files changed, 26 insertions, 6 deletions
@@ -603,7 +603,7 @@ topic, but we can try to summarize its relevancy to ripgrep: * Files are generally just a bundle of bytes. There is no reliable way to know their encoding. * Either the encoding of the pattern must match the encoding of the files being - searched, or a form of transcoding must be performed converts either the + searched, or a form of transcoding must be performed that converts either the pattern or the file to the same encoding as the other. * ripgrep tends to work best on plain text files, and among plain text files, the most popular encodings likely consist of ASCII, latin1 or UTF-8. As @@ -626,12 +626,15 @@ given, which is the default: they correspond to a UTF-16 BOM, then ripgrep will transcode the contents of the file from UTF-16 to UTF-8, and then execute the search on the transcoded version of the file. (This incurs a performance penalty since transcoding - is slower than regex searching.) + is slower than regex searching.) If the file contains invalid UTF-16, then + the Unicode replacement codepoint is substituted in place of invalid code + units. * To handle other cases, ripgrep provides a `-E/--encoding` flag, which permits you to specify an encoding from the [Encoding Standard](https://encoding.spec.whatwg.org/#concept-encoding-get). - ripgrep will assume *all* files searched are the encoding specified and - will perform a transcoding step just like in the UTF-16 case described above. + ripgrep will assume *all* files searched are the encoding specified (unless + the file has a BOM) and will perform a transcoding step just like in the + UTF-16 case described above. By default, ripgrep will not require its input be valid UTF-8. That is, ripgrep can and will search arbitrary bytes. The key here is that if you're searching @@ -641,9 +644,26 @@ pattern won't find anything. With all that said, this mode of operation is important, because it lets you find ASCII or UTF-8 *within* files that are otherwise arbitrary bytes. +As a special case, the `-E/--encoding` flag supports the value `none`, which +will completely disable all encoding related logic, including BOM sniffing. +When `-E/--encoding` is set to `none`, ripgrep will search the raw bytes of +the underlying file with no transcoding step. For example, here's how you might +search the raw UTF-16 encoding of the string `Шерлок`: + +``` +$ rg '(?-u)\(\x045\x04@\x04;\x04>\x04:\x04' -E none -a some-utf16-file +``` + +Of course, that's just an example meant to show how one can drop down into +raw bytes. Namely, the simpler command works as you might expect automatically: + +``` +$ rg 'Шерлок' some-utf16-file +``` + Finally, it is possible to disable ripgrep's Unicode support from within the -pattern regular expression. For example, let's say you wanted `.` to match any -byte rather than any Unicode codepoint. (You might want this while searching a +regular expression. For example, let's say you wanted `.` to match any byte +rather than any Unicode codepoint. (You might want this while searching a binary file, since `.` by default will not match invalid UTF-8.) You could do this by disabling Unicode via a regular expression flag: |