Regex character "$" doesn't mean "end-of-string"

437 点作者 BerislavLopac大约 1 年前

53 条评论

Karellen大约 1 年前

> Folks who've worked with regular expressions before might know about ^ meaning "start-of-string" and correspondingly see $ as "end-of-string".Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

评论 #39766960 未加载

评论 #39764657 未加载

评论 #39764757 未加载

评论 #39764584 未加载

评论 #39766298 未加载

评论 #39764385 未加载

评论 #39765053 未加载

评论 #39768777 未加载

评论 #39769272 未加载

SAI_Peregrinus大约 1 年前

POSIX regexes and Python regexes are different. In general, you need to reference the regex documentation for your implementation, since the syntax is not universal.Per POSIX chapter 9[1]:9.2 … "The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions limit the processing to lines; that is, zero or more characters followed by a <newline>."and 9.3.8 … "A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character."combine to mean that $ may match the end of string OR the end of the line, and it's up to the utility (or mode) to define which. Most of the common utilities (grep, sed, awk, Python, etc) treat it as end of line by default, since they operate on lines by default.THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You cannot reliably read or write regular expressions without knowing which language & options are being used.[1] <a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html" rel="nofollow">https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...</a>

PuffinBlue大约 1 年前

This seems like the perfect opportunity to introduce those unfamiliar to Robert Elder. He makes cool YouTube[0] and blog content[1] and has a series on regular expressions[2] and does some quite deep dives into the differing behaviour of the different tools that implement the various versions.His latest on the topic is cool too: <a href="https://www.youtube.com/watch?v=ys7yUyyQA-Y" rel="nofollow">https://www.youtube.com/watch?v=ys7yUyyQA-Y</a>He's has quite a lot of content that HN folks might be interested in I think, like the reality and woes of consulting[3][0] <a href="https://www.youtube.com/@RobertElderSoftware" rel="nofollow">https://www.youtube.com/@RobertElderSoftware</a>[1] <a href="https://blog.robertelder.org/" rel="nofollow">https://blog.robertelder.org/</a>[2] <a href="https://blog.robertelder.org/regular-expressions/" rel="nofollow">https://blog.robertelder.org/regular-expressions/</a>[3] <a href="https://www.youtube.com/watch?v=cK87ktENPrI" rel="nofollow">https://www.youtube.com/watch?v=cK87ktENPrI</a>

评论 #39764400 未加载

评论 #39765354 未加载

xlii大约 1 年前

Regexp was one of the first things I truly internalized years ago when I was discovering Perl (which still lives in a cozy place in my heart due to a lovely “Camel” book).Today most important bit of information is knowledge that implementations differ and I made a habit of pulling reference sheet for a thing I work with.E.g. Emacs Regexp annoyingly doesn’t have word in form of “\w” but uses “\s_-“ (or something no reference sheet on screen) as character class (but Emacs has the best documentation and discoverability - a hill I’m willing to die on)Some utilities require parenthesis escaping and some not. Sometimes this behavior is configurable and sometimes it’s not.I lived through whole confusion, annoyance, denial phase and now I just accept it. Concept is the same everywhere but flavor changes.

评论 #39764942 未加载

评论 #39767568 未加载

onion2k大约 1 年前

I can hear thousands of bad hiring manager's adding 'How do you match the end of a string in a regex?' to their list of 'Ha! You don't know the trick!' questions designed to catch out candidates.

评论 #39764593 未加载

tyingq大约 1 年前

Seems odd to leave Perl off the list, given it's regex related.Here's the explanation for $ in the perlre docs:<pre><code> $ Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)</code></pre>

评论 #39765572 未加载

perlgeek大约 1 年前

Raku (formerly Perl 6) has picked ^ and $ for start-of-string and end-of-string, and has introduced ^^ and $$ for start-of-line and end-of-line. No multi line mode is available or necessary. (There's also \h for horizontal and \v for vertical whitespace)That's one of the benefits of a complete rethink/rewrite, you can learn from the fact that the old behavior surprised people.

评论 #39767039 未加载

评论 #39767181 未加载

评论 #39828524 未加载

beardyw大约 1 年前

Does anyone consider RegEx to be standardised? Moving to a new context is always a relearning exercise in my experience.

评论 #39764164 未加载

评论 #39764703 未加载

评论 #39764313 未加载

评论 #39764160 未加载

评论 #39765847 未加载

评论 #39764180 未加载

评论 #39764159 未加载

评论 #39765563 未加载

评论 #39764245 未加载

danbruc大约 1 年前

People are confused about strings and lines. A string is a sequence of characters, a line can be two different things. If you consider the newline a line terminator, then a line is a sequence of non-newline characters - possibly zero - plus a newline. If there is no new-line at the end, then it is not a [complete] line. That is what POSIX uses. If you consider the newline a line separator, then a line is a sequence of non-newline characters - possibly zero. In either case, the content of the line ends before the newline, either because the newline terminates the line or because it separates the line from the next. [1]The semantics of ^ and $ is based on lines - whether single-line or multi-line mode. For string based semantics - which you could also think of as entire file if you are dealing with files - use \A and \Z or their equivalents.[1] Both interpretations have their merits. If you transmit text over a serial connection, it is useful to have a newline as line terminator so that you know when you received a complete line. If you put text into text files, it might arguably be easier to look at a newline as a line separator because then you can not have a invalid last line. On the other hand having line terminators in text files allows you to detect incompletely written lines.

homakov大约 1 年前

This led to a few serious bugs in Ruby-based apps. Always use \A\z<a href="https://homakov.blogspot.com/2012/05/saferweb-injects-in-various-ruby.html" rel="nofollow">https://homakov.blogspot.com/2012/05/saferweb-injects-in-var...</a><a href="https://sakurity.com/blog/2015/02/28/openuri.html" rel="nofollow">https://sakurity.com/blog/2015/02/28/openuri.html</a><a href="https://sakurity.com/blog/2015/06/04/mongo_ruby_regexp.html" rel="nofollow">https://sakurity.com/blog/2015/06/04/mongo_ruby_regexp.html</a>

somat大约 1 年前

Structural regexes as found in the sam editor are an obscure but well engineered regex engine. I am far from an expert but my main takeaway from them is that most regex engines have an implied structure built around "lines" of text. While you can work around this, it is awkward. Structural regexes allow you to explicitly define the structure of a match, that is, you get to tell the engine what a "line" is.<a href="http://man.cat-v.org/plan_9/1/sam" rel="nofollow">http://man.cat-v.org/plan_9/1/sam</a>

vitiral大约 1 年前

In Lua it's only the start/end of the string> A pattern is a sequence of pattern items. A caret '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.<a href="https://www.lua.org/manual/5.3/manual.html#6.4.1" rel="nofollow">https://www.lua.org/manual/5.3/manual.html#6.4.1</a>Lua's pattern matching is much simpler than regexes though.> Unlike several other scripting languages, Lua does not use POSIX regular expressions (regexp) for pattern matching. The main reason for this is size: A typical implementation of POSIX regexp takes more than 4,000 lines of code. This is bigger than all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 500 lines.<a href="https://www.lua.org/pil/20.1.html" rel="nofollow">https://www.lua.org/pil/20.1.html</a>

评论 #39767901 未加载

librasteve大约 1 年前

I am surprised that the OP does not include perl5 in their table.In raku (aka perl6) Regexes were reinvented by Larry Wall (the creator of perl which made perlRE the de facto regex standard)Here's what he does with $:(<a href="https://docs.raku.org/language/regexes#Start_of_string_and_end_of_string" rel="nofollow">https://docs.raku.org/language/regexes#Start_of_string_and_e...</a>)* The $ anchor only matches at the end of the string* The $$ anchor matches at the end of a logical line. That is, before a newline character, or at the end of the string when the last character is not a newline character.

aftbit大约 1 年前

Wait, in non-multiline mode, it only matches _one_ trailing newline? And not any other whitespace, including \r or \r\n? That is indeed surprising behavior. Why? Why not just make it end of string like the author expected?<pre><code> >>> import re >>> bool(re.search('abc$', 'abc')) True >>> bool(re.search('abc$', 'abc\n')) True >>> bool(re.search('abc$', 'abc\n\n')) False >>> bool(re.search('abc$', 'abc ')) False >>> bool(re.search('abc$', 'abc\t')) False >>> bool(re.search('abc$', 'abc\r')) False >>> bool(re.search('abc$', 'abc\r\n')) False</code></pre>

jewel大约 1 年前

This has security implications! Example exploitable ruby code:<pre><code> unless person_id =~ /^\d+$/ abort "Bad person ID" end sql = "select * from people where person_id = #{person_id}" </code></pre> In addition to injection attacks, this also can bite people when parsing headers, where a bad header is allowed to sneak past a filter.

评论 #39769041 未加载

评论 #39768401 未加载

hans_castorp大约 1 年前

Fun fact: in Postgres, 'cat\n' matches 'cat$' when the so called "weird" newline matching is enabled :)<a href="https://www.postgresql.org/docs/current/functions-matching.html#POSIX-EMBEDDED-OPTIONS-TABLE" rel="nofollow">https://www.postgresql.org/docs/current/functions-matching.h...</a>

masswerk大约 1 年前

As for the good old reference implementation (not "Parameter Efficient Reinforcement Learning"):<pre><code> my $string = "cat\n"; /cat$/s -> true /cat\Z/s -> true /cat\z/s -> false</code></pre>

nebulous1大约 1 年前

The fact that there are so many different peculiarities in different regex systems has always raised the hairs on the back of my neck. As in when a tool accepts a regex and I have to a trawl the manual to find out exactly what regex is acceptable to it.

pjc50大约 1 年前

Special misery case: Visual Studio supports regex search, where '$' matches \n.The end of line character is usually the standard Windows \r\n.Yes, that means if you want to really match the end of line you have to match "\r$". So broken.

评论 #39764332 未加载

评论 #39764284 未加载

wruza大约 1 年前

By default, '$' only matches at the end of the string and immediately before the newline (if any) at the end of the string.The rationale was probably "it should be easier to match input strings" and now it's harder for everyone.

gorjusborg大约 1 年前

If you really want to learn regex, you'll have a hard time piecing it all together via blog posts.Brad Freidl's Mastering Regular Expressions is a good book to read if you want to stop being surprised/lost.I'll admit I stopped at the dive into DFA/NFA engine details.

评论 #39773225 未加载

m0rissette大约 1 年前

Why isn’t Perl anywhere on that chart when mentioning regex?

评论 #39765352 未加载

ghusbands大约 1 年前

> Note: The table of data was gathered from regex101.com, I didn't test using the actual runtimes.Has anyone confirmed this behaviour directly against the runtimes/languages? Newlines at the end of a string are certainly something that could get lost in transit inside an online service involving multiple runtimes.

评论 #39765142 未加载

评论 #39764335 未加载

评论 #39764740 未加载

评论 #39764232 未加载

评论 #39764522 未加载

ikiris大约 1 年前

this is mostly due to the different types of regex and less about it being platform dependent. $ was end of string in pcre which is the "old" perl compatible regex. python has its own which has quirks as mentioned, re2 is another option in go for example, and i think rust has its own version as well iirc.

评论 #39764121 未加载

评论 #39764046 未加载

评论 #39764241 未加载

javier_e06大约 1 年前

I would hold a code review hostage if any file does not end with an empty new line.My reasoning would be if the file is transmitted and gets truncated nobody would know for sure if it does not end a new line. Brownie points if this is code end has a comment that the files ends there.The article calls computer languages platforms but the are computer languages. Bash is not included. Weird. I believe the most common use of regular expressions is the use of grep or egrep with bash or some other shell but, who knows. Maybe I am hanging with the wrong crowd.

weinzierl大约 1 年前

The table in the article makes this look complicated, but it really isn't. All the cases in the article can be grouped into two families:- The JS/Go/Rust family, which treats $ like \z and does not support \Z at all- The Java, .NET, PHP, Python family, which treats $ like \Z and may or may not (Python) support \z.\Z does away with \n before the end of the string, while \z treats \n as a regular character. For multiline $ the distinction doesn't matter, because \n is the end.Really the only deviation from the rule is Python's \Z, which is indeed weird.

pksebben大约 1 年前

Regex would really benefit from a comprehensive industrial standard. It's such a powerful tool that you have to keep relearning whenever you switch contexts.

Scubabear68大约 1 年前

In 30 years of developing software I don’t think I ever used multi-line regexp even once.

评论 #39767146 未加载

评论 #39765443 未加载

Existing4190大约 1 年前

perlre Metacharacters documentation states: $ Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)(/m enables multiline mode)

frou_dh大约 1 年前

Something I found really surprising about Python's regexp implementation is that it doesn't support the typical character classes like [:alnum:] etc.It must be some kind of philosophical objection because there's no way something with as much water under the bridge as Python simply hasn't got around to it.

mmh0000大约 1 年前

<pre><code> > So if you're trying to match a string without a newline at the end, you can't only use $ in Python! My expectation was having multiline mode disabled wouldn't have had this newline-matching behavior, but that isn't the case. </code></pre> I would argue this is correct behavior, a "line" isn't a "line" if it doesn't end with \n.[1]<pre><code> > 3.206 Line - A sequence of zero or more non- <newline> characters plus a terminating <newline> character. </code></pre> [1] <a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206" rel="nofollow">https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...</a>

评论 #39778736 未加载

smlacy大约 1 年前

It's easy to get the canonical answer:$ man pcre2syntaxWhere you'll find the following block under ANCHORS AND SIMPLE ASSERTIONS:<pre><code> $ end of subject also before newline at end of subject also before internal newline in multiline mode </code></pre> So all the cases of "newline at/before end of subject" are covered here. Then, the question becomes "what is a subject?" Is it line-by-line? Are newlines included? What if we want multiline matching? That's where re.MULTILINE comes from, it's not "multiline matching" (sort of) it's "what is the subject of the regular expression that we're matching against"

pmarreck大约 1 年前

The results did not surprise me. The fact that everyone is in agreement that "cat$" matches "cat" and not "cat\n" if multiline is off did not surprise me. \n is implicitly a multiline-contextual character to me. In other words, if you didn't have any \n, you'd just have an array of lines (without linefeeds), same as if you were reading lines from a file one at a time or splitting a binary on \n.The other results that differ across engines seem to be because people either don't understand regex or because the POSIX description of how to deal with such an input and config was ill-defined.

cpeterso大约 1 年前

$ is the regex’s “the buck stops here” symbol. Here at the end of the line. :)

AtNightWeCode大约 1 年前

There are many differences between implementations of regex. To name a few. Lookbehind, atomic groups, named capturing groups, recursion, timeouts and my favorite interop problem, unicode.

Izmaki大约 1 年前

The new-line character is an actual character "at the end" of the string though so it makes sense that $ would include the new-line character in multi-line matching.

评论 #39764116 未加载

评论 #39764113 未加载

wodenokoto大约 1 年前

> So if you're trying to match a string without a newline at the end, you can't only use $ in Python! My expectation was having multiline mode disabled wouldn't have had this newline-matching behavior, but that isn't the case.A reproducible example would be nice. I don’t understand what it is he cannot do. `re.search('$', 'no new lines')` returns a match.

评论 #39764753 未加载

febeling大约 1 年前

Seriously, just write one unit test for your regex.

评论 #39765987 未加载

silent_cal大约 1 年前

I think there's a big opportunity to re-write Regex as a SQL-type language. It's too bad I don't feel like trying.

评论 #39773286 未加载

nurtbo大约 1 年前

Totally get the desire, but also feels like last two paragraphs are solvable with``` re.match(text).extract().rstrip(“\n”) ```

croes大约 1 年前

Isn't a string with a newline character automatically multiline?The new line is just empty but not the first line anymore.

评论 #39764660 未加载

menacingly大约 1 年前

Of course it’s line. How could it be the end of the string when the matter at hand is defining the string?

ary大约 1 年前

Was any regex documentation unclear on this? Some libraries have modes that change the semantics of ^ and $ but I’ve always found their use to be rather clear. It’s the grouping and look ahead/behind modifiers that I’ve always found hard to understand (at times).

评论 #39773250 未加载

nunez大约 1 年前

You can also use (?m) to enable multiline processing on PCRE-compatible regexp engines.

mdavid626大约 1 年前

Is this a bug?

humanlity大约 1 年前

Interesting

user2342大约 1 年前

I'm confused by this blog-post. In the table: what is the reg-ex pattern tested and against which input?

评论 #39764103 未加载

raldi大约 1 年前

Cmd-F perlno matches

1letterunixname大约 1 年前

Ugh. Whenever I hear people talk about regular expressions as a singular language or standard, I die a little inside.PSA: Regex security is particular to each implementation flavor. Please know the nuances of a particular kind and be unambiguously precise.

k3vinw大约 1 年前

Another poor soul trying to solve one problem using regex and now they have two… ;)

callwhendone大约 1 年前

it's end of line right?

michaelcampbell大约 1 年前

...IN PYTHON

teknopaul大约 1 年前

Tldr;$ does not mean end of string in Python.