doc: add a section about --pre to the GUIDE

Fixes #1252
author: Andrew Gallant <jamslam@gmail.com> 2020-05-08 11:44:00 -0400
committer: Andrew Gallant <jamslam@gmail.com> 2020-05-08 23:24:40 -0400
commit: 0eb2501b6e89bf83360eb70afbf1b5d221c92142 (patch)
tree: c4c592e8473c31243e0178d01d11fde604212c72
parent: 184c15882ed1ef9939cea62ac1ad204a93bee189 (diff)
3 files changed, 213 insertions, 3 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4ce6ae7d..bd860458 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,8 @@ TBD
 ===
 Bug fixes:
 
+* [BUG #1252](https://github.com/BurntSushi/ripgrep/issues/1252):
+  Add a section on the `--pre` flag to the GUIDE.
 * [BUG #1339](https://github.com/BurntSushi/ripgrep/issues/1339):
   Improve error message when a pattern with invalid UTF-8 is provided.
 * [BUG #1524](https://github.com/BurntSushi/ripgrep/issues/1524):
diff --git a/GUIDE.md b/GUIDE.md
index 22ad183d..3e3bd583 100644
--- a/GUIDE.md
+++ b/GUIDE.md
@@ -19,6 +19,7 @@ translatable to any command line shell environment.
 * [Configuration file](#configuration-file)
 * [File encoding](#file-encoding)
 * [Binary data](#binary-data)
+* [Preprocessor](#preprocessor)
 * [Common options](#common-options)
 
 
@@ -767,6 +768,212 @@ via the `--no-mmap` flag. (The cost will be a small performance regression when
 searching very large files on some platforms.)
 
 
+### Preprocessor
+
+In ripgrep, a preprocessor is any type of command that can be run to transform
+the input of every file before ripgrep searches it. This makes it possible to
+search virtually any kind of content that can be automatically converted to
+text without having to teach ripgrep how to read said content.
+
+One common example is searching PDFs. PDFs are first and foremost meant to be
+displayed to users. But PDFs often have text streams in them that can be useful
+to search. In our case, we want to search Bruce Watson's excellent
+dissertation,
+[Taxonomies and Toolkits of Regular Language Algorithms](https://burntsushi.net/stuff/1995-watson.pdf).
+After downloading it, let's try searching it:
+
+```
+$ rg 'The Commentz-Walter algorithm' 1995-watson.pdf
+$
+```
+
+Surely, a dissertation on regular language algorithms would mention
+Commentz-Walter. Indeed it does, but our search isn't picking it up because
+PDFs are a binary format, and the text shown in the PDF may not be encoded as
+simple contiguous UTF-8. Namely, even passing the `-a/--text` flag to ripgrep
+will not make our search work.
+
+One way to fix this is to convert the PDF to plain text first. This won't work
+well for all PDFs, but does great in a lot of cases. (Note that the tool we
+use, `pdftotext`, is part of the [poppler](https://poppler.freedesktop.org)
+PDF rendering library.)
+
+```
+$ pdftotext 1995-watson.pdf > 1995-watson.txt
+$ rg 'The Commentz-Walter algorithm' 1995-watson.txt
+316:The Commentz-Walter algorithms : : : : : : : : : : : : : : :
+7165:4.4 The Commentz-Walter algorithms
+10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm
+17218:The Commentz-Walter algorithm (and its variants) displayed more interesting behaviour,
+17249:Aho-Corasick algorithms are used extensively. The Commentz-Walter algorithms are used
+17297: The Commentz-Walter algorithms (CW). In all versions of the CW algorithms, a common program skeleton is used with di erent shift functions. The CW algorithms are
+```
+
+But having to explicitly convert every file can be a pain, especially when you
+have a directory full of PDF files. Instead, we can use ripgrep's preprocessor
+feature to search the PDF. ripgrep's `--pre` flag works by taking a single
+command name and then executing that command for every file that it searches.
+ripgrep passes the file path as the first and only argument to the command and
+also sends the contents of the file to stdin. So let's write a simple shell
+script that wraps `pdftotext` in a way that conforms to this interface:
+
+```
+$ cat preprocess
+#!/bin/sh
+
+exec pdftotext - -
+```
+
+With `preprocess` in the same directory as `1995-watson.pdf`, we can now use it
+to search the PDF:
+
+```
+$ rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf
+316:The Commentz-Walter algorithms : : : : : : : : : : : : : : :
+7165:4.4 The Commentz-Walter algorithms
+10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm
+17218:The Commentz-Walter algorithm (and its variants) displayed more interesting behaviour,
+17249:Aho-Corasick algorithms are used extensively. The Commentz-Walter algorithms are used
+17297: The Commentz-Walter algorithms (CW). In all versions of the CW algorithms, a common program skeleton is used with di erent shift functions. The CW algorithms are
+```
+
+Note that `preprocess` must be resolvable to a command that ripgrep can read.
+The simplest way to do this is to put your preprocessor command in a directory
+that is in your `PATH` (or equivalent), or otherwise use an absolute path.
+
+As a bonus, this turns out to be quite a bit faster than other specialized PDF
+grepping tools:
+
+```
+$ time rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf -c
+6
+
+real    0.697
+user    0.684
+sys     0.007
+maxmem  16 MB
+faults  0
+
+$ time pdfgrep 'The Commentz-Walter algorithm' 1995-watson.pdf -c
+6
+
+real    1.336
+user    1.310
+sys     0.023
+maxmem  16 MB
+faults  0
+```
+
+If you wind up needing to search a lot of PDFs, then ripgrep's parallelism can
+make the speed difference even greater.
+
+#### A more robust preprocessor
+
+One of the problems with the aforementioned preprocessor is that it will fail
+if you try to search a file that isn't a PDF:
+
+```
+$ echo foo > not-a-pdf
+$ rg --pre ./preprocess 'The Commentz-Walter algorithm' not-a-pdf
+not-a-pdf: preprocessor command failed: '"./preprocess" "not-a-pdf"':
+-------------------------------------------------------------------------------
+Syntax Warning: May not be a PDF file (continuing anyway)
+Syntax Error: Couldn't find trailer dictionary
+Syntax Error: Couldn't find trailer dictionary
+Syntax Error: Couldn't read xref table
+```
+
+To fix this, we can make our preprocessor script a bit more robust by only
+running `pdftotext` when we think the input is a non-empty PDF:
+
+```
+$ cat preprocessor
+#!/bin/sh
+
+case "$1" in
+*.pdf)
+  # The -s flag ensures that the file is non-empty.
+  if [ -s "$1" ]; then
+    exec pdftotext - -
+  else
+    exec cat
+  fi
+  ;;
+*)
+  exec cat
+  ;;
+esac
+```
+
+We can even extend our preprocessor to search other kinds of files. Sometimes
+we don't always know the file type from the file name, so we can use the `file`
+utility to "sniff" the type of the file based on its contents:
+
+```
+$ cat processor
+#!/bin/sh
+
+case "$1" in
+*.pdf)
+  # The -s flag ensures that the file is non-empty.
+  if [ -s "$1" ]; then
+    exec pdftotext - -
+  else
+    exec cat
+  fi
+  ;;
+*)
+  case $(file "$1") in
+  *Zstandard*)
+    exec pzstd -cdq
+    ;;
+  *)
+    exec cat
+    ;;
+  esac
+  ;;
+esac
+```
+
+#### Reducing preprocessor overhead
+
+There is one more problem with the above approach: it requires running a
+preprocessor for every single file that ripgrep searches. If every file needs
+a preprocessor, then this is OK. But if most don't, then this can substantially
+slow down searches because of the overhead of launching new processors. You
+can avoid this by telling ripgrep to only invoke the preprocessor when the file
+path matches a glob. For example, consider the performance difference even when
+searching a repository as small as ripgrep's:
+
+```
+$ time rg --pre pre-rg 'fn is_empty' -c
+crates/globset/src/lib.rs:1
+crates/matcher/src/lib.rs:2
+crates/ignore/src/overrides.rs:1
+crates/ignore/src/gitignore.rs:1
+crates/ignore/src/types.rs:1
+
+real    0.138
+user    0.485
+sys     0.209
+maxmem  7 MB
+faults  0
+
+$ time rg --pre pre-rg --pre-glob '*.pdf' 'fn is_empty' -c
+crates/globset/src/lib.rs:1
+crates/ignore/src/types.rs:1
+crates/ignore/src/gitignore.rs:1
+crates/ignore/src/overrides.rs:1
+crates/matcher/src/lib.rs:2
+
+real    0.008
+user    0.010
+sys     0.002
+maxmem  7 MB
+faults  0
+```
+
+
 ### Common options
 
 ripgrep has a lot of flags. Too many to keep in your head at once. This section
diff --git a/README.md b/README.md
index b58ca90c..b13e27ed 100644
--- a/README.md
+++ b/README.md
@@ -116,9 +116,10 @@ times are unaffected by the presence or absence of `-n`.
   specifically specified with the `-E/--encoding` flag.)
 * ripgrep supports searching files compressed in a common format (brotli,
   bzip2, gzip, lz4, lzma, xz, or zstandard) with the `-z/--search-zip` flag.
-* ripgrep supports arbitrary input preprocessing filters which could be PDF
-  text extraction, less supported decompression, decrypting, automatic encoding
-  detection and so on.
+* ripgrep supports
+  [arbitrary input preprocessing filters](GUIDE.md#preprocessor)
+  which could be PDF text extraction, less supported decompression, decrypting,
+  automatic encoding detection and so on.
 
 In other words, use ripgrep if you like speed, filtering by default, fewer
 bugs and Unicode support.
author	Andrew Gallant <jamslam@gmail.com>	2020-05-08 11:44:00 -0400
committer	Andrew Gallant <jamslam@gmail.com>	2020-05-08 23:24:40 -0400
commit	0eb2501b6e89bf83360eb70afbf1b5d221c92142 (patch)
tree	c4c592e8473c31243e0178d01d11fde604212c72
parent	184c15882ed1ef9939cea62ac1ad204a93bee189 (diff)