Regex Isn't Hard (2023)

75 点作者 asicsp21 天前

28 条评论

> e.g. This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly).I love regular expressions but one thing I've learned over the years is the syntax is dense enough that even people who are confident enough to start writing regex tutorials often can't write a regex that matches an IP address.

评论 #43750726 未加载

评论 #43750693 未加载

评论 #43750641 未加载

评论 #43751250 未加载

评论 #43751632 未加载

评论 #43750531 未加载

评论 #43750628 未加载

评论 #43751329 未加载

评论 #43754055 未加载

gwd20 天前

So my brother doesn't code for a living, but has done a fair amount of personal coding, and also gotten into the habit of watching live-coding sessions on YouTube. Recently he's gotten involved in my project a bit, and so we've done some pair programming sessions, in part to get him up to speed on the codebase, in part to get him up to speed on more industrial-grade coding practices and workflows.At some point we needed to do some parsing of some strings, and I suggested a simple regex. But apparently a bunch of the streamers he's been watching basically have this attitude that regexes stink, and you should use basically anything else. So we had a conversation, and compared the clarity of coding up the relatively simple regex I'd made, with how you'd have to do it procedurally; I think the regex was a clear winner.Obviously regexes aren't the right tool for every job, and they can certainly be done poorly; but in the right place at the right time they're the simplest, most robust, easiest to understand solution to the problem.

评论 #43750627 未加载

TrackerFF20 天前

Confession: Regex knowledge is one of those things I've let completely atrophy after integrating LLMs into my workflow. I guess if the day comes that AI/ML models suddenly disappear, or become completely unavailable to me, I'll have to get into the nitty gritty of Regex again...but until that time, it is a "solved problem" for my part.

评论 #43750872 未加载

评论 #43750637 未加载

评论 #43750611 未加载

noxer20 天前

> Instead, use a range negation, like [^%] if you know the % character won’t show up. It doesn’t hurt to be a little more explicit.This is absolutely horrible, pattern are fairly readable if they follow the syntax logic. Matching "everything but that random character that will not appear" is absurd. Also the idea that a . (dot) behaves arbitrary in different languages shows a sever lack up understanding about regex syntax. Ofc you can't write a proper pattern if you don't know which syntax is used. If anything you would force override the behavior of the . (dot) with the appropriate flag to ensure it works the same with different compatible regex engines.

评论 #43754673 未加载

latexr20 天前

I’m a fan of regular expressions, though I understand why many people wince at the sight. You should avoid showing them to a non-programmer who is interested in learning to code, because they’ll immediately fear programming is intractable.Even as much as I like regex, I wouldn’t recommend this post. One reason is the code style is too close to regular text:> a matches a single character, always lowercase a.That sentence uses “a” three times, two of them as code and once as an indefinite article, but it’s not immediately obvious to eye. VoiceOver completely fumbles it, especially considering the sentence immediately after.A more important reason against recommending the article is that I find a bunch of the arguments to be unhelpful. If you’re trying to convince people to give regular expressions a chance, telling them to ignore `.` and use `[^%]` is going to bite them. That’s not super common (important when trying to learn more from other sources) and even an experienced regexer must do a double take to figure out “is there a reason this specific character must not be matched?” Furthermore, no new learner is going to remember that four character incantation, and neither are they going to understand what’s happening when their code doesn’t work because there was a `%` in their text. People need to learn about `.` (possibly the most common character in regex) if only because they also need to learn to escape it and not ignore it when there is a literal period in the text. Don’t tell people to ignore repetition ranges either, they aren’t difficult to reason about and are certainly simpler to read than the same blob of intractable text multiple times.

评论 #43751104 未加载

BMc202020 天前

Regex is much easier if you don't do it all at once. It's perfectly acceptable to, say, trim all the leading spaces, store the result in a temp variable, trim all the trailing spaces, store the result in a temp variable, remove all the hyphens. etc. etc.Everyone tries to create the platonic ideal regex that does everything in one line.

justlikereddit21 天前

Nothing is hard once you've learned to do it intuitively.The hardest part is remembering how you struggled with it when you started.

评论 #43750562 未加载

nickez21 天前

Found an error immediately "Any lowercase character" doesn't match all Swedish lowercase characters.

评论 #43750511 未加载

评论 #43750527 未加载

goku1220 天前

If you take the regex subset that works uniformly across all regex engines (even for just perl-compatible engines), you would probably get nothing done. They all have some minor variations that make it impossible to write a regex for a particular engine without a reference sheet open nearby, even if you have years of experience writing them. And those 'shortcuts' like look-ahead and look-behind are often too useful to be neglected completely.Crafting regexes is story of its own. The other commentor has described it. Just to summarize, regexes are fine for simple patterns. But their complexity explode as soon as you need to handle a lot of corner cases.

mannykannot20 天前

Here’s a regex crossword:<a href="https://jimbly.github.io/regex-crossword/" rel="nofollow">https://jimbly.github.io/regex-crossword/</a>See also: Are Regex Crosswords NP-hard?<a href="https://cs.stackexchange.com/questions/30143/are-regex-crosswords-np-hard" rel="nofollow">https://cs.stackexchange.com/questions/30143/are-regex-cross...</a>

boricj20 天前

In a previous job I've done some stupid tricks with regexes. Inside a MongoDB database I had documents with a version field in string form ("x.y.z") and I needed to exclude documents with a schema too old to process in my queries.One can construct a regex that matches a number between x and y by enumerating all the digit patterns that fit the criteria. For example, the following pattern matches a number between 1 and 255: ^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$This can be extended to match a version less than or equal to x.z.y by enumerating all the patterns across the different fields. The following pattern matches any version less than or equal to 2.14.0: ^([0-1]\.\d+\.\d+)|(2\.[0-9]\.\d+|(2\.1[0-3]\.\d+))$Basically, I wrote a Java method that would generate a regex with all the patterns to match a version greater than or equal to a lower bound, which was then fed to MongoDB queries to exclude documents too old to process based on the version field. It was a stupid solution to a dumb problem, but it worked flawlessly.

m46320 天前

Regexes are powerful, useful and needlessly hard to use.But not because of the regex idea itself.It is quoting.The reason people don't properly learn how to use a regex is because they are insulated from it by whatever language they are using.It's literally like those surgeons who do heart surgery starting at a vein in your leg.I use regexes all the time, in emacs, python, perl, bash, sed, awk, grep and more...and just about every time the regex syntax is mixed with single quotes, double quotes, backslashes, $variable names and more from the "enclosing language or tool".If I have a parenthesis or $, I'm always wondering if it is part of the enclosing language, or the matching pattern, or the literal. Also, the kind of regex adds to the confusion (basic or extended regex?)I think it would be nice to have a syntax highlighter that would help with this, independent of language. green for variable or other language construct, red for regex pattern, white for matching literal.

评论 #43759153 未加载

prmph20 天前

I consider myself a reasonably competent senior engineer, and yet with regex this is what I have noticed:Every time I need to write even the simplest regex, I can't seem to get it right the first time. I always need to struggle with it for a long time. Sometimes even using online tools takes me time to get it right. This happens every.single.time.It baffles me to no end. I'm a pretty quick learner of pretty much everything I get into. I write the most sophisticated Typescript code you can imagine; I've written a small toy language; I've written biometric authentication drivers; I've written my own functional UI lib. But, I cannot master regex.You can give me all the arguments about what is good about regex, but in my experience (which you can't argue with), it is a VERY badly designed API, and nothing will convince me otherwise. Regex is probably the worst thing ever in programming.

alganet20 天前

One can think of regex as very compact notation for writing text operations. It helps a lot.The popular idea of them being write-only is obviously a joke, but it has some truth to it. On the good side, small code that needs to be rewritten is often better than large code that needs to be maintained.

thomasmg20 天前

For me, the main problem of the Regex syntax is the escaping rules: Many characters require escaping: \ { } ( ) [ ] | * + ? ^ $ . And the rules are different inside square brackets. I think it would be better if literal text is enclosed in quotes; that way, much less escaping is needed, but it would still be concise (and sometimes, more concise). I tried to formulate a proposal here: <a href="https://github.com/thomasmueller/bau-lang/blob/main/RegexV2.md">https://github.com/thomasmueller/bau-lang/blob/main/RegexV2....</a>

评论 #43750685 未加载

satisfice20 天前

I like the sentiment but I would make some very different choices. For instance, use the . operator, because it is easier to understand than his Rube-Goldberg-logic negation groups alternative.He’s also strangely worried about portability. If you are really concerned about portability, you are moving between languages and you probably aren’t some novice who should be frightened by complexity.I don’t think about portability at all, ever. And I do maintain code in Perl, Python, and Javascript.But yeah, just as in all programming languages, you can get by with knowing about a 20% subset of all it can do.

evertedsphere20 天前

<pre><code> This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly). </code></pre> that pattern (once you fixed the typo) would not match a whole ip address unless you allowed it to also swallow the character after the last octet, which wouldn't work at, say, end of line

RHSeeger20 天前

I tend to use regular expressions more commonly on the command line (looking for content in files, especially log files) than I do in code. But, that being said, I do use them in both cases. They're a tool and can be used well. But, like any other programming, you need to make sure your code is readable. Which (generally) means avoiding any really complex regular expressions.

hamdouni20 天前

I jump here just to say that non-greedy construction is valuable and not using them make expression harder to write and to understand.

lairv20 天前

My issue with regexes is that the formal definition of regex I learned at university is clear and simple [0] but then using them in programming languages is always a mess[0] <a href="https://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory" rel="nofollow">https://en.wikipedia.org/wiki/Regular_expression#Formal_lang...</a>

评论 #43769129 未加载

comrade123421 天前

I mean sure, if it was my full-time job to write regexes I’d probably get pretty good at it. But instead a really complex one comes up maybe once a year for me and so I have to go to some online regex checker and start iteratively building one up, spending hours only find some condition where it doesn’t work and back to the checker...So I don’t think it’s easy, but I do agree that they are very useful.

评论 #43750532 未加载

pyfon20 天前

I strongly agree with [^"] etc. over . and .?Involves much less thinking!

评论 #43750549 未加载

thoroughburro20 天前

> NOTE: Some languages, like Rust, have parser combinators which can be as good or better than regex in most of the ways I care about.What Rust feature is this referring to?

hyperman120 天前

This is both a demo for the beauty and power of regexes, and of their dangers:* The use of backslash separatores quickly makes a mess, as they tend to need escaping wherever regexes are usefull.* The uppercase/lowercase is only right if there are no accented characters, so USA. This is bad in western europe in files where they are rare: Your program works for a while, then an accent sneaks in and breaks things.* The exact meaning of all the specials like \( vs ( .* Ranges work in most regex dialects but not everywhere.* A simple regex for an int with a specific range is nasty. If you want a full float, good luck.Regexes are great as initial filter or quick hack, but you need more in full size programs.I'd love to see a better regex syntax, too.

voidUpdate21 天前

The text on that ai generated image at the top is definitely... interesting

评论 #43750487 未加载

poisonborz21 天前

This is truly one thing AI solved. Hard to write, easy to test. No one needs to learn this convoluted syntax in the future and we're all better for it.

评论 #43750537 未加载

评论 #43750588 未加载

评论 #43750794 未加载

bazoom4220 天前

Honestly regex syntax is a mess. For example parentheses are used both for grouping alternatives and for capturing. I think Perl 6 tried (and failed) to fix this. Larger problem is you have to memorize the meta characters since they are basically random.Regex is still the best solution I know of for its intended domain.

jmpman20 天前

I’ve started using LLMs to identify the proper regex for my use cases. I’d like to see such regex creation as an LLM benchmark.