summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorAndrew Gallant <jamslam@gmail.com>2020-02-16 06:32:57 -0500
committerAndrew Gallant <jamslam@gmail.com>2020-02-17 17:16:28 -0500
commitad97e9c93fc0687ba7a96680ddc749c1da664446 (patch)
treede32bc2b344c453c3ad40c1f3906337654f8a70c
parent24f8a3e5ec4729c8e485547ca40157920ae64b9c (diff)
grep-regex: improve inner literal detection
This fixes an interesting performance bug where the inner literal extractor would sometimes choose a sub-optimal literal. For example, consider the regex: \x20+Sherlock Holmes\x20+ (The `\x20` is the ASCII code for a space character, which we use here to just make it clearer. It otherwise does not matter.) Previously, this would see the initial \x20 and then stop collecting literals after the `+` repetition operator. This was because the inner literal detector was adapter from the prefix literal detector, which had to stop here. Namely, while \x20S would be a valid prefix (for example), \x20\x20S would also be a valid prefix. As would \x20\x20\x20S and so on. So the prefix detector would have to stop at the repetition operator. Otherwise, only searching for \x20S could potentially scan farther then the starting position of the next match. However, for inner literals, this calculus no longer makes sense. We can freely search for, e.g., \x20S without missing matches that start with \x20\x20S precisely because we know this is an inner literal which may not correspond to the start of a match. With this fix, the literal that is now detected is \x20Sherlock Holmes\x20 Which is much better. We achieve this by no longer "cutting" literals after seeing a `+` repetition operator. Instead, we permit literals to continue to be extended. The reason why this is important is because using \x20 as the literal to search for is generally bad juju since it is so common. In fact, we should probably add more logic here to either avoid such things or give up entirely on the inner literal optimization if it detected a literal that we think is very common. But we punt on such things here.
-rw-r--r--CHANGELOG.md4
-rw-r--r--grep-regex/src/literal.rs1
2 files changed, 4 insertions, 1 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 70813aaf..49b7a692 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,10 @@ Performance improvements:
* [PERF #1381](https://github.com/BurntSushi/ripgrep/pull/1381):
Directory traversal is sped up with speculative ignore-file existence checks.
+* PERF:
+ Improve inner literal detection to cover more cases more effectively.
+ e.g., ` +Sherlock Holmes +` now has ` Sherlock Holmes ` extracted instead
+ of ` `.
Feature enhancements:
diff --git a/grep-regex/src/literal.rs b/grep-regex/src/literal.rs
index 1563ca05..cc1e8965 100644
--- a/grep-regex/src/literal.rs
+++ b/grep-regex/src/literal.rs
@@ -146,7 +146,6 @@ fn union_required(expr: &Hir, lits: &mut Literals) {
hir::RepetitionKind::ZeroOrMore => lits.cut(),
hir::RepetitionKind::OneOrMore => {
union_required(&x.hir, lits);
- lits.cut();
}
hir::RepetitionKind::Range(ref rng) => {
let (min, max) = match *rng {