summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorAndrew Gallant <jamslam@gmail.com>2020-05-08 11:44:00 -0400
committerAndrew Gallant <jamslam@gmail.com>2020-05-08 23:24:40 -0400
commit0eb2501b6e89bf83360eb70afbf1b5d221c92142 (patch)
treec4c592e8473c31243e0178d01d11fde604212c72
parent184c15882ed1ef9939cea62ac1ad204a93bee189 (diff)
doc: add a section about --pre to the GUIDE
Fixes #1252
-rw-r--r--CHANGELOG.md2
-rw-r--r--GUIDE.md207
-rw-r--r--README.md7
3 files changed, 213 insertions, 3 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4ce6ae7d..bd860458 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,8 @@ TBD
===
Bug fixes:
+* [BUG #1252](https://github.com/BurntSushi/ripgrep/issues/1252):
+ Add a section on the `--pre` flag to the GUIDE.
* [BUG #1339](https://github.com/BurntSushi/ripgrep/issues/1339):
Improve error message when a pattern with invalid UTF-8 is provided.
* [BUG #1524](https://github.com/BurntSushi/ripgrep/issues/1524):
diff --git a/GUIDE.md b/GUIDE.md
index 22ad183d..3e3bd583 100644
--- a/GUIDE.md
+++ b/GUIDE.md
@@ -19,6 +19,7 @@ translatable to any command line shell environment.
* [Configuration file](#configuration-file)
* [File encoding](#file-encoding)
* [Binary data](#binary-data)
+* [Preprocessor](#preprocessor)
* [Common options](#common-options)
@@ -767,6 +768,212 @@ via the `--no-mmap` flag. (The cost will be a small performance regression when
searching very large files on some platforms.)
+### Preprocessor
+
+In ripgrep, a preprocessor is any type of command that can be run to transform
+the input of every file before ripgrep searches it. This makes it possible to
+search virtually any kind of content that can be automatically converted to
+text without having to teach ripgrep how to read said content.
+
+One common example is searching PDFs. PDFs are first and foremost meant to be
+displayed to users. But PDFs often have text streams in them that can be useful
+to search. In our case, we want to search Bruce Watson's excellent
+dissertation,
+[Taxonomies and Toolkits of Regular Language Algorithms](https://burntsushi.net/stuff/1995-watson.pdf).
+After downloading it, let's try searching it:
+
+```
+$ rg 'The Commentz-Walter algorithm' 1995-watson.pdf
+$
+```
+
+Surely, a dissertation on regular language algorithms would mention
+Commentz-Walter. Indeed it does, but our search isn't picking it up because
+PDFs are a binary format, and the text shown in the PDF may not be encoded as
+simple contiguous UTF-8. Namely, even passing the `-a/--text` flag to ripgrep
+will not make our search work.
+
+One way to fix this is to convert the PDF to plain text first. This won't work
+well for all PDFs, but does great in a lot of cases. (Note that the tool we
+use, `pdftotext`, is part of the [poppler](https://poppler.freedesktop.org)
+PDF rendering library.)
+
+```
+$ pdftotext 1995-watson.pdf > 1995-watson.txt
+$ rg 'The Commentz-Walter algorithm' 1995-watson.txt
+316:The Commentz-Walter algorithms : : : : : : : : : : : : : : :
+7165:4.4 The Commentz-Walter algorithms
+10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm
+17218:The Commentz-Walter algorithm (and its variants) displayed more interesting behaviour,
+17249:Aho-Corasick algorithms are used extensively. The Commentz-Walter algorithms are used
+17297: The Commentz-Walter algorithms (CW). In all versions of the CW algorithms, a common program skeleton is used with di erent shift functions. The CW algorithms are
+```
+
+But having to explicitly convert every file can be a pain, especially when you
+have a directory full of PDF files. Instead, we can use ripgrep's preprocessor
+feature to search the PDF. ripgrep's `--pre` flag works by taking a single
+command name and then executing that command for every file that it searches.
+ripgrep passes the file path as the first and only argument to the command and
+also sends the contents of the file to stdin. So let's write a simple shell
+script that wraps `pdftotext` in a way that conforms to this interface:
+
+```
+$ cat preprocess
+#!/bin/sh
+
+exec pdftotext - -
+```
+
+With `preprocess` in the same directory as `1995-watson.pdf`, we can now use it
+to search the PDF:
+
+```
+$ rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf
+316:The Commentz-Walter algorithms : : : : : : : : : : : : : : :
+7165:4.4 The Commentz-Walter algorithms
+10062:in input string S , we obtain the Boyer-Moore algorithm. The Commentz-Walter algorithm
+17218:The Commentz-Walter algorithm (and its variants) displayed more interesting behaviour,
+17249:Aho-Corasick algorithms are used extensively. The Commentz-Walter algorithms are used
+17297: The Commentz-Walter algorithms (CW). In all versions of the CW algorithms, a common program skeleton is used with di erent shift functions. The CW algorithms are
+```
+
+Note that `preprocess` must be resolvable to a command that ripgrep can read.
+The simplest way to do this is to put your preprocessor command in a directory
+that is in your `PATH` (or equivalent), or otherwise use an absolute path.
+
+As a bonus, this turns out to be quite a bit faster than other specialized PDF
+grepping tools:
+
+```
+$ time rg --pre ./preprocess 'The Commentz-Walter algorithm' 1995-watson.pdf -c
+6
+
+real 0.697
+user 0.684
+sys 0.007
+maxmem 16 MB
+faults 0
+
+$ time pdfgrep 'The Commentz-Walter algorithm' 1995-watson.pdf -c
+6
+
+real 1.336
+user 1.310
+sys 0.023
+maxmem 16 MB
+faults 0
+```
+
+If you wind up needing to search a lot of PDFs, then ripgrep's parallelism can
+make the speed difference even greater.
+
+#### A more robust preprocessor
+
+One of the problems with the aforementioned preprocessor is that it will fail
+if you try to search a file that isn't a PDF:
+
+```
+$ echo foo > not-a-pdf
+$ rg --pre ./preprocess 'The Commentz-Walter algorithm' not-a-pdf
+not-a-pdf: preprocessor command failed: '"./preprocess" "not-a-pdf"':
+-------------------------------------------------------------------------------
+Syntax Warning: May not be a PDF file (continuing anyway)
+Syntax Error: Couldn't find trailer dictionary
+Syntax Error: Couldn't find trailer dictionary
+Syntax Error: Couldn't read xref table
+```
+
+To fix this, we can make our preprocessor script a bit more robust by only
+running `pdftotext` when we think the input is a non-empty PDF:
+
+```
+$ cat preprocessor
+#!/bin/sh
+
+case "$1" in
+*.pdf)
+ # The -s flag ensures that the file is non-empty.
+ if [ -s "$1" ]; then
+ exec pdftotext - -
+ else
+ exec cat
+ fi
+ ;;
+*)
+ exec cat
+ ;;
+esac
+```
+
+We can even extend our preprocessor to search other kinds of files. Sometimes
+we don't always know the file type from the file name, so we can use the `file`
+utility to "sniff" the type of the file based on its contents:
+
+```
+$ cat processor
+#!/bin/sh
+
+case "$1" in
+*.pdf)
+ # The -s flag ensures that the file is non-empty.
+ if [ -s "$1" ]; then
+ exec pdftotext - -
+ else
+ exec cat
+ fi
+ ;;
+*)
+ case $(file "$1") in
+ *Zstandard*)
+ exec pzstd -cdq
+ ;;
+ *)
+ exec cat
+ ;;
+ esac
+ ;;
+esac
+```
+
+#### Reducing preprocessor overhead
+
+There is one more problem with the above approach: it requires running a
+preprocessor for every single file that ripgrep searches. If every file needs
+a preprocessor, then this is OK. But if most don't, then this can substantially
+slow down searches because of the overhead of launching new processors. You
+can avoid this by telling ripgrep to only invoke the preprocessor when the file
+path matches a glob. For example, consider the performance difference even when
+searching a repository as small as ripgrep's:
+
+```
+$ time rg --pre pre-rg 'fn is_empty' -c
+crates/globset/src/lib.rs:1
+crates/matcher/src/lib.rs:2
+crates/ignore/src/overrides.rs:1
+crates/ignore/src/gitignore.rs:1
+crates/ignore/src/types.rs:1
+
+real 0.138
+user 0.485
+sys 0.209
+maxmem 7 MB
+faults 0
+
+$ time rg --pre pre-rg --pre-glob '*.pdf' 'fn is_empty' -c
+crates/globset/src/lib.rs:1
+crates/ignore/src/types.rs:1
+crates/ignore/src/gitignore.rs:1
+crates/ignore/src/overrides.rs:1
+crates/matcher/src/lib.rs:2
+
+real 0.008
+user 0.010
+sys 0.002
+maxmem 7 MB
+faults 0
+```
+
+
### Common options
ripgrep has a lot of flags. Too many to keep in your head at once. This section
diff --git a/README.md b/README.md
index b58ca90c..b13e27ed 100644
--- a/README.md
+++ b/README.md
@@ -116,9 +116,10 @@ times are unaffected by the presence or absence of `-n`.
specifically specified with the `-E/--encoding` flag.)
* ripgrep supports searching files compressed in a common format (brotli,
bzip2, gzip, lz4, lzma, xz, or zstandard) with the `-z/--search-zip` flag.
-* ripgrep supports arbitrary input preprocessing filters which could be PDF
- text extraction, less supported decompression, decrypting, automatic encoding
- detection and so on.
+* ripgrep supports
+ [arbitrary input preprocessing filters](GUIDE.md#preprocessor)
+ which could be PDF text extraction, less supported decompression, decrypting,
+ automatic encoding detection and so on.
In other words, use ripgrep if you like speed, filtering by default, fewer
bugs and Unicode support.