summaryrefslogtreecommitdiffstats
path: root/grep-regex
AgeCommit message (Collapse)Author
2019-04-15grep-regex: release 0.1.3grep-regex-0.1.3Andrew Gallant
2019-04-15grep-matcher: release 0.1.2grep-matcher-0.1.2Andrew Gallant
2019-04-14regex: fix HIR analysis bugAndrew Gallant
An alternate can be empty at this point, so we must handle it. We didn't before because the regex engine actually disallows empty alternates, however, this code runs before the regex compiler rejects the regex.
2019-04-07regex: make multi-literal searcher fasterAndrew Gallant
This makes the case of searching for a dictionary of a very large number of literals much much faster. (~10x or so.) In particular, we achieve this by short-circuiting the construction of a full regex when we know we have a simple alternation of literals. Building the regex for a large dictionary (>100,000 literals) turns out to be quite slow, even if it internally will dispatch to Aho-Corasick. Even that isn't quite enough. It turns out that even *parsing* such a regex is quite slow. So when the -F/--fixed-strings flag is set, we short circuit regex parsing completely and jump straight to Aho-Corasick. We aren't quite as fast as GNU grep here, but it's much closer (less than 2x slower). In general, this is somewhat of a hack. In particular, it seems plausible that this optimization could be implemented entirely in the regex engine. Unfortunately, the regex engine's internals are just not amenable to this at all, so it would require a larger refactoring effort. For now, it's good enough to add this fairly simple hack at a higher level. Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will be slower, because of the aforementioned missing optimization. Moreover, passing flags like `-i` or `-S` will cause ripgrep to abandon this optimization and fall back to something potentially much slower. Again, this fix really needs to happen inside the regex engine, although we might be able to special case -i when the input literals are pure ASCII via Aho-Corasick's `ascii_case_insensitive`. Fixes #497, Fixes #838
2019-04-06searcher: add option to disable BOM sniffinglesnyrumcajs
This commit adds a new encoding feature where the -E/--encoding flag will now accept a value of 'none'. When given this value, all encoding related machinery is disabled and ripgrep will search the raw bytes of the file, including the BOM if it's present. Closes #1207, Closes #1208
2019-04-05regex: print out final regex in trace modeAndrew Gallant
This is useful for debugging to see what regex is actually being run. We put this as a trace since the regex can be quite gnarly. (It is not pretty printed.)
2019-04-05regex: fix a perf bug when using -w flagAndrew Gallant
When looking for an inner literal to speed up searches, if only a prefix is found, then we generally give up doing inner literal optimizations since the regex engine will generally handle it for us. Unfortunately, this decision was being made *before* we wrap the regex in (^|\W)...($|\W) when using the -w/--word-regexp flag, which would then defeat the literal optimizations inside the regex engine. We fix this with a bit of a hack that says, "if we're doing a word regexp, then give me back any literal you find, even if it's a prefix."
2019-02-16grep-regex-0.1.2grep-regex-0.1.2Andrew Gallant
2019-01-26regex: make CRLF hack more robustAndrew Gallant
This commit improves the CRLF hack to be more robust. In particular, in addition to rewriting `$` as `(?:\r??$)`, we now strip `\r` from the end of a match if and only if the regex has an ending line anchor required for a match. This doesn't quite make the hack 100% correct, but should fix most use cases in practice. An example of a regex that will still be incorrect is `foo|bar$`, since the analysis isn't quite sophisticated enough to determine that a `\r` can be safely stripped from any match. Even if we fix that, regexes like `foo\r|bar$` still won't be handled correctly. Alas, more work on this front should really be focused on enabling this in the regex engine itself. The specific cause of this bug was that grep-searcher was sneakily stripping CRLF from matching lines when it really shouldn't have. We remove that code now, and instead rely on better match semantics provided at a lower level. Fixes #1095
2019-01-19deps: update various dependenciesAndrew Gallant
We also increase the MSRV to 1.32, the current stable release, which sets the stage for migrating to Rust 2018.
2018-09-25grep-regex: fix inner literal detectionAndrew Gallant
It seems the inner literal detector fails spectacularly in cases of concatenations that involve groups. The issue here is that if the prefix of a group inside a concatenation can match the empty string, then any literals generated to that point in the concatenation need to be cut such that they are never extended. The detector isn't really built to handle this case, so we just act conservative cut literals whenever we see a sub-group. This may make some regexes slower, but the inner literal detector already misses plenty of cases. Literal detection (including in the regex engine) is a key component that needs to be completely rethought at some point. Fixes #1064
2018-09-07deps: update versions for all cratesAndrew Gallant
I don't think every change here is needed, but this ensures we're using the latest version of every direct dependency.
2018-09-07doc: minor touchups to API docsAndrew Gallant
2018-08-20deps: update libripgrep crate versionsAndrew Gallant
This prepares them for an initial 0.1.0 release.
2018-08-20libripgrep: initial commit introducing libripgrepAndrew Gallant
libripgrep is not any one library, but rather, a collection of libraries that roughly separate the following key distinct phases in a grep implementation: 1. Pattern matching (e.g., by a regex engine). 2. Searching a file using a pattern matcher. 3. Printing results. Ultimately, both (1) and (3) are defined by de-coupled interfaces, of which there may be multiple implementations. Namely, (1) is satisfied by the `Matcher` trait in the `grep-matcher` crate and (3) is satisfied by the `Sink` trait in the `grep2` crate. The searcher (2) ties everything together and finds results using a matcher and reports those results using a `Sink` implementation. Closes #162