regex: tweak inner literal heuristic

Previously, we had logic to skip our own inner literal optimization if the regex itself was already (likely) accelerated. It turns out that the presence of a Unicode word boundary can defeat acceleration to a point. It's likely enough that even if the underlying regex is accelerated, it would be prudent to do our own inner literal optimization if the pattern has a Unicode word boundary. Normally a Unicode word boundary doesn't defeat literal optimizations, since even the slower engines can make use of *prefix* literal optimizations. But a regex can be accelerated via its own inner or suffix literal optimizations, and those require the use of a DFA (or lazy DFA). Since DFAs crap out on haystacks that contain a non-ASCII Unicode scalar value when the regex contains a Unicode word boundary, it follows that an "accelerated" can still wind up being quite slow. (An "accelerated" regex can also slow down because of restrictions on avoiding quadratic behavior, but I believe this happens less frequently and is not as severe as the slow down as a result of Unicode word boundaries. Namely, avoiding quadratic behavior just means giving up on the inner literal optimization for a single search. In which case, the regex engine can still fall back to a normal forward DFA. That will definitely be slower than an inner literal optimization done by ripgrep, but not quite as dramatic as it would be when DFAs can't be used at all.)
author: Andrew Gallant <jamslam@gmail.com> 2023-10-11 16:37:09 -0400
committer: Andrew Gallant <jamslam@gmail.com> 2023-11-20 23:51:53 -0500
commit: c21302b4096f9ce1af095fbe8b14828a90ecfecb (patch)
tree: e6dce87780e05fc4e48b59f304e8daa69a9b09b8 /crates
parent: 8a5b81716af38234c98f684096b4c5e36ce80763 (diff)
1 files changed, 12 insertions, 5 deletions
diff --git a/crates/regex/src/literal.rs b/crates/regex/src/literal.rs
index 831b82cb..e2f158b6 100644
--- a/crates/regex/src/literal.rs
+++ b/crates/regex/src/literal.rs
@@ -65,12 +65,19 @@ impl InnerLiterals {
         // If we believe the regex is already accelerated, then just let
         // the regex engine do its thing. We'll skip the inner literal
         // optimization.
+        //
+        // ... but only if the regex doesn't have any Unicode word boundaries.
+        // If it does, there's enough of a chance of the regex engine falling
+        // back to a slower engine that it's worth trying our own inner literal
+        // optimization.
         if re.is_accelerated() {
-            log::trace!(
-                "skipping inner literal extraction, \
-                 existing regex is believed to already be accelerated",
-            );
-            return InnerLiterals::none();
+            if !chir.hir().properties().look_set().contains_word_unicode() {
+                log::trace!(
+                    "skipping inner literal extraction, \
+                     existing regex is believed to already be accelerated",
+                );
+                return InnerLiterals::none();
+            }
         }
         // In this case, we pretty much know that the regex engine will handle
         // it as best as possible, even if it isn't reported as accelerated.
author	Andrew Gallant <jamslam@gmail.com>	2023-10-11 16:37:09 -0400
committer	Andrew Gallant <jamslam@gmail.com>	2023-11-20 23:51:53 -0500
commit	c21302b4096f9ce1af095fbe8b14828a90ecfecb (patch)
tree	e6dce87780e05fc4e48b59f304e8daa69a9b09b8 /crates
parent	8a5b81716af38234c98f684096b4c5e36ce80763 (diff)