summaryrefslogtreecommitdiffstats
path: root/_posts/2017-08-13-When-not-to-use-a-regex.md
blob: da59046678c96a3876061662b04c0053f741e536 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
layout: post
title: When not to use a regex
tags: [regex]
---

The other day, I saw [Learn regex the easy
way](https://github.com/zeeshanu/learn-regex). This is a great resource, but I
felt the need to pen a post explaining that regexes are usually not the right
approach.

Let's do a little exercise. I googled "URL regex" and here's the first Stack
Overflow result:

```
https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)
```

<p style="text-align: right">
<small><a href="https://stackoverflow.com/a/3809435/1191610">source</a></small>
</p>

This is a bad regex. Here are some valid URLs that this regex fails to match:

- http://x.org
- http://nic.science
- http://名がドメイン.com (warning: this is a parked domain)
- http://example.org/url,with,commas
- https://en.wikipedia.org/wiki/Harry_Potter_(film_series)
- http://127.0.0.1
- http://[::1] (ipv6 loopback)

Here are some invalid URLs the regex is fine with:

- http://exam..ple.org
- http://--example.org

This answer has been revised 9 times on Stack Overflow, and this is the best
they could come up with. Go back and read the regex. Can you tell where each of
these bugs are? How long did it take you? If you received a bug report in your
application because one of these URLs was handled incorrectly, do you understand
this regex well enough to fix it? If your application has a URL regex, go find
it and see how it fares with these tests.

Complicated regexes are opaque, unmaintainable, and often wrong. The correct
approach to validating a URL is as follows:

```python
from urllib.parse import urlparse

def is_url_valid(url):
    try:
        urlparse(url)
        return True
    except:
        return False
```

A regex is useful for validating *simple* patterns and for *finding* patterns in
text. For anything beyond that it's almost certainly a terrible choice. Say you
want to...

**validate an email address**: try to send an email to it!

**validate password strength requirements**: estimate the complexity with
[zxcvbn](https://github.com/dropbox/zxcvbn)!

**validate a date**: use your standard library!
[datetime.datetime.strptime](https://docs.python.org/3.6/library/datetime.html#datetime.datetime.strptime)

**validate a credit card number**: run the [Luhn
algorithm](https://en.wikipedia.org/wiki/Luhn_algorithm) on it!

**validate a social security number**: alright, use a regex. But don't expect
the number to be assigned to someone until you ask the Social Security
Administration about it!

Get the picture?