C Strings and my slow descent to madness

145 点作者 Decabytes大约 2 年前

42 条评论

nathell大约 2 年前

> Our last function is strcmp. It looks at two strings and determines whether they are equal to each other or not. If they are it returns 0. If they aren’t it returns 1.No it doesn’t.<pre><code> RETURN VALUES The strcmp() and strncmp() functions return an integer greater than, equal to, or less than 0, according as the string s1 is greater than, equal to, or less than the string s2. The comparison is done using unsigned characters, so that ‘\200’ is greater than ‘\0’.</code></pre>

评论 #35469312 未加载

评论 #35468788 未加载

评论 #35473354 未加载

stephc_int13大约 2 年前

If you are using C and do some non-trivial work with strings you should either use a good library to handle strings or build your own.It is not that difficult in practice.The old C std lib is, in my opinion, outdated, obsolete and a very bad fit for complex string handling, especially on the memory management side.In my own framework, the string management module is using a dedicated memory allocator and a "high level" string API with full UTF8 support from the start.As a general rule, I think that the C std lib is the weakest part of the C language and it should only be used as a fallback.

评论 #35473099 未加载

评论 #35468329 未加载

评论 #35471728 未加载

评论 #35471350 未加载

评论 #35476105 未加载

评论 #35476202 未加载

评论 #35469222 未加载

评论 #35474472 未加载

nneonneo大约 2 年前

Pop quiz, which of these is safe, given "char buf[80]" and arbitrary user input in argv[1]?<pre><code> gets(buf); scanf("%s", buf); strcpy(buf, argv[1]); scanf("%80s", buf); strncpy(buf, argv[1], 80); snprintf(buf, 80, argv[1]); </code></pre> ----The delightful answer is none of them. The first three have no bounds checking at all, meaning that they will happily overflow the buffer to an arbitrary extent (gets, at least, will usually trigger a warning on modern compilers). The next two have off-by-one errors: scanf will write a NUL byte out of bounds (and that's exploitable! <a href="https://googleprojectzero.blogspot.com/2014/08/the-poisoned-nul-byte-2014-edition.html" rel="nofollow">https://googleprojectzero.blogspot.com/2014/08/the-poisoned-...</a>) while strncpy will fail to NUL-terminate the string. The last one uses the right buffer length, but treats user input as a format string and can leak memory contents or produce arbitrary memory corruption with the %n format specifier.C string handling practically invites off-by-one errors and horrible security practices out-of-the-box.

评论 #35477847 未加载

评论 #35476922 未加载

评论 #35478626 未加载

评论 #35501168 未加载

评论 #35477760 未加载

评论 #35477097 未加载

评论 #35480148 未加载

评论 #35480580 未加载

marcodiego大约 2 年前

> If we try to print out some Japanese characters… [] The output isn’t what we expect.Yes it is. And I bet on a modern windows version it is too. The terminal has been (probably intentionally) neglected by ms for a long time, but as far as I know this has mostly been fixed on modern windows versions.EDIT: Author admits it later in the text "will be fixed in Windows 11 and Windows Server 2022"Also it says "strlen("有り難う")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".

评论 #35469437 未加载

评论 #35469344 未加载

评论 #35469290 未加载

评论 #35469495 未加载

评论 #35468108 未加载

userbinator大约 2 年前

Well-written C tends to minimise string usage in general, preferring to convert to another format as soon as possible. Allocating, copying, and passing around strings in large quantities is not a good idea for efficiency, but of course some people coming from other HLLs seem to try to do it anyway, which causes many other problems.

评论 #35469553 未加载

评论 #35475089 未加载

评论 #35476211 未加载

zh3大约 2 年前

I once got called in to fix an SS7 stack suffering from poor performance. Pretty well written, and not obvious at first sight why it was going slow. Most of it was low-level bit fiddling, and some small strncpy's() - generally about 8 chars or so.Didn't take that long to profile (well, printf's as no profiling available) and figure out it was the strncpy's causing the problem, but why? Well, there was a handy 8 megabyte buffer used for working memory that the strings were being copied into that for modification.From the strncpy() man page:->If the length of src is less than n, strncpy() pads the remainder of dest with null bytes.Ah. So every little strncpy was essentially copying the string then zeroing out 7,999,992 bytes. And there were lots of little strncpy's...

评论 #35473909 未加载

simonblack大约 2 年前

"We're not in Kansas any more, Toto"Or to paraphrase that "We're not in Python any more, and C is not Python".You know what sends me insane? Indentation and lack of fixed types in Python. But I don't have problems with C strings. Because I have grown to love and know C's string foibles just like the author will certainly not be driven insane by 'Python's shortcomings according to me'.The world is full of people who complain that something or other is different from what they know, so that 'other' is wrong. That's just being isolationist. Everything has its own advantages, its own disadvantages. Let's accept that and move on, instead of making mountains out of mole-hills.

评论 #35475621 未加载

评论 #35476423 未加载

评论 #35476163 未加载

评论 #35477191 未加载

评论 #35476842 未加载

gavinhoward大约 2 年前

Okay, I agree that by default, C strings are bad.But it doesn't have to stay that way. Someone else in the comments mentioned antirez's sds library for dynamic strings. This works, but you could also easily roll your own. All you need is an init function, and perhaps an assert or other check at the end of it that the string has a nul terminator.At that point, type checking will let you blindly pass those strings (or their char arrays) to any of those C functions without worry.Edit: I'll also add that I think a string library should have a difference between static strings and string builders (dynamic strings). It makes everything easier.

评论 #35468241 未加载

评论 #35468194 未加载

评论 #35468208 未加载

bluetomcat大约 2 年前

In well-written C, you don't work with strings the way you do in other HLLs. For example, extracting and copying substrings is something unnecessary, unless you want to modify the parent string. Otherwise, a substring is represented by a pointer and a size_t length, and can easily be printed that way via the "%.*s" printf specifier:<pre><code> const char *s = "Hello World!"; const char *world = s + 6; size_t world_len = 5; printf("%.*s\n", world_len, world);</code></pre>

评论 #35469499 未加载

评论 #35468588 未加载

flohofwoe大约 2 年前

This is from a C fan: If you are going to do any string heavy work, please use anything else than C (Python is pretty nice for this sort of stuff for instance).And if you need to use C anyway, then please use anything else than the string functions from the standard library. The C stdlib is (mostly) a leftover from the K&R era when opinions about what makes a good API were very different from today, and C was a much 'harsher' language.C is pretty nice for a lot of things, but working with strings definitely isn't one of them.

评论 #35479882 未加载

评论 #35481023 未加载

评论 #35474184 未加载

_benj大约 2 年前

With the woes of string.h being known, why not just use an alternative like <a href="https://github.com/antirez/sds">https://github.com/antirez/sds</a> ?I’ve also been having a blast with C because writing C feels like being a god! But the biggest thing that I like about C is that the world is sort of written on it!Just yesterday I needed to parse a JSON… found a bunch of libraries that do that and just picked one that I liked the API.

评论 #35468249 未加载

评论 #35468132 未加载

评论 #35468291 未加载

评论 #35474756 未加载

评论 #35468166 未加载

评论 #35468301 未加载

benmmurphy大约 2 年前

`strlcpy` is the function you probably want. but again it is not standard. <a href="https://lwn.net/Articles/507319/" rel="nofollow">https://lwn.net/Articles/507319/</a>I think the reason people don't want to standardise this kind of function is it often gives wrong behaviour. for example if you are trying to copy a string into a fixed buffer and its too long then often it is an error or potentially even a security bug to truncate it. so these functions generally do the 'wrong' thing even though they are 'safer'. if you are dealing with static buffers then I think you should be explicitly checking the source fits in the target and then handling the error case. you could even have a function like `strlcpy` that does `strlen` then checks if it fits, then does the copy or return an error code. alternatively, if the string should always fit and you don't want to handle the error case then the safe thing to do is check at runtime that it fits then abort the program if it doesn't fit.

评论 #35474841 未加载

评论 #35469351 未加载

评论 #35477013 未加载

tragomaskhalos大约 2 年前

K&R contains this beautiful koan-like string copy code:<pre><code> while (*t++ = *s++) ; </code></pre> Honestly the elegance of this thing was one of the hooks that made me fall in love with C. But this was from a now-forgotten age of innocence, as there are so many "nopes" around this line-and-a-half that one would, rightly, be tarred and feathered for ever putting it in a program today.

评论 #35476364 未加载

habibur大约 2 年前

I don't use null terminated strings. ptr+len struct everywhere. And when I need to call an API, like fopen, I make a temporary copy of that string + the null termination, do my work and then free it.You can printf non-null terminated strings too. Check printf("%.*s", length, strptr).

评论 #35469666 未加载

评论 #35468480 未加载

kevin_thibedeau大约 2 年前

wchar_t is a massive landmine that should never be used since its size varies by platform. The locale of the compiler has to match the end user for L prefixed strings to work correctly. Likewise char16_t and char32_t are just swimming against the easy path at this point. You're much better off sticking to UTF-8 and using the C11 u8 prefix on literals so you can use the regular string API and never have to worry about locale settings.

评论 #35473346 未加载

Dwedit大约 2 年前

> "But for real if anyone knows how to get this to work on Windows 10 let me know!"Since the May 2019 update, Windows 10 has supported declaring the code page in a manifest file.In Visual Studio, you must add "/utf-8" to the compiler command line, this makes it parse the source code as a UTF-8 file, and makes it output UTF-8 string literals.To make console output work, call the Win32 function "SetConsoleOutputCP(65001);"To get support for opening files with names that aren't in your system codepage:* Create a manifest file as shown in <a href="https://learn.microsoft.com/en-us/windows/apps/design" rel="nofollow">https://learn.microsoft.com/en-us/windows/apps/design</a> /globalizing/use-utf8-code-page* Add this as an "Additional Manifest File" in Visual Studio project settings for the manifest toolAdditionally, there is an undocumented NTDLL function "RtlInitNlsTables" that sets the code page for the process. It is difficult to use without a lot of example code, but some app locale type tools (used to change locale for a process) make use of this function.

gpderetta大约 2 年前

The worst part of C strings is that they tend to show up in APIs (especially system calls). This make interoperability with other languages harder than it should m

评论 #35478392 未加载

mahoho大约 2 年前

Just a pedantic comment, but 有り難う is arigatou or roughly "thanks", not "hello". Hello would usually be こんにちは or, more confusingly, 今日は

评论 #35473446 未加载

评论 #35469519 未加载

jmclnx大约 2 年前

Yes this is something to get use to. The BSDs created strlcpy(3) and wcslcpy(3)<a href="https://man.openbsd.org/strlcpy.3" rel="nofollow">https://man.openbsd.org/strlcpy.3</a><a href="https://man.openbsd.org/wcslcpy.3" rel="nofollow">https://man.openbsd.org/wcslcpy.3</a>which to me will help with some of these issues. Too bad other Operating Systems do not have these. On Linux there is libbsd to get these, but I would like to see these to be added to the stdc.Instead the c23 standard is messing with realloc(3) which could break some old programs. I have not looked at that in detail yet, so maybe it is a non-issue :)

评论 #35468226 未加载

评论 #35468153 未加载

评论 #35468223 未加载

评论 #35468666 未加载

评论 #35472695 未加载

评论 #35468135 未加载

russellbeattie大约 2 年前

Literally 25 years ago I was a beginner programmer and tried writing a .dll for Microsoft's Internet Information Server, which was relatively new at the time. (I hadn't so much as seen a Unix-based OS at the time, let alone understood CGI). C strings were mind boggling and frustrated me so much I simply gave up. Happily around the same time, MS introduced Active Server Pages and I was able to use that and never messed with C again. It's amazing the same issues still exist decades later.

评论 #35476685 未加载

coldpie大约 2 年前

It's unfortunate the author put the arrays-are-pointers thing so early in the doc, as that's a very beginner-to-C mixup and really nothing at all to do with strings. Otherwise, yep. It's pretty bad. C is a great language, but its string handling is definitely garbage. You get used to it pretty quick, and it's not hard to write a handful of sane wrappers or a simple string library for your own use, but the standard library's terrible string functions are an unending source of bugs.

评论 #35471093 未加载

评论 #35468650 未加载

kens大约 2 年前

It's strange that computer programmers think of themselves as being on the cutting edge of technology, but then we use a language that is over 50 years old. Of course there are going to be lots of problems with C strings since they were designed for a totally different world (no Unicode, no security issues, memory was precious, etc). The hardware is a million times as powerful but the software environment improves at a glacial pace.

Night_Thastus大约 2 年前

It's important to mention that strncpy (and also strncpy_s) are really not a strcpy replacement, it's not intended for the same usages. The name is a total misnomer. Do not use strncpy that way!In any case, strcpy_s (which is a good replacement for strcpy) is part of the C11 standard. I'm confused how that isn't considered portable.

评论 #35482817 未加载

评论 #35469627 未加载

bobajeff大约 2 年前

I've been wondering lately why many people write c in c++ rather than just c. I think this might be the reason.

评论 #35468160 未加载

评论 #35468483 未加载

评论 #35469720 未加载

pixelbeat__大约 2 年前

String handling in C has many gotchas indeed. Here are some of my notes on the subtleties:<a href="https://www.pixelbeat.org/programming/gcc/string_buffers.html" rel="nofollow">https://www.pixelbeat.org/programming/gcc/string_buffers.htm...</a>

bruce343434大约 2 年前

Meh. The w_char stuff is barely C's fault. You use wide (constant width) characters then set the terminal encoding to utf8 (variable length encoding). What did you expect? It's a windows issue. I can copy paste all sorts of utf8 in "normal" string literals, printf and puts them, and it just works in my terminal.RE counting characters: this is a whole can of worms. Do you want to count grapheme clusters? Code points? Anything other than just the amount of bytes? Use a unicode library.The latter part of this article is a bit like those articles that make fun of javascript for having floating point numbers behave like, gasp, floating point numbers.

kloch大约 2 年前

> At first this looks great, but there is a problem. What happens when the source string minus the null terminator is as long as the size of the destination string? The answer is that the destination gets filled with all the characters of the source string with no room left for the null terminator.The 'n' in strncpy is mainly there to help you avoid overrunning the destination, it does not guarantee whatever makes in there is null-terminated.This is why you should always explicitly set the last byte to zero after using strncpy (and never ever use strcpy).<pre><code> char dest[16]; strncpy(dest, src, 15); dest[15]=0;</code></pre>

djha-skin大约 2 年前

Related and good read about strcpy in the kernel: <a href="https://lwn.net/Articles/905777" rel="nofollow">https://lwn.net/Articles/905777</a>

tmsln大约 2 年前

Beej's guide to C programming is very helpful:<a href="https://beej.us/guide/bgc/html/split/unicode-wide-characters-and-all-that.html#unicode-wide-characters-and-all-that" rel="nofollow">https://beej.us/guide/bgc/html/split/unicode-wide-characters...</a>

js2大约 2 年前

Programs which have to deal with C strings beyond the bare minimum that libc provides will generally have a set of routines for making it more ergonomic. e.g.:<a href="https://github.com/git/git/blob/master/strbuf.h">https://github.com/git/git/blob/master/strbuf.h</a>

photochemsyn大约 2 年前

For initial string input, i.e. from a network/file/terminal stream, using fgetc and/or fgets plus code to verify and sanitize makes the most sense IMO.This does mean you have to write a lot of C code for what would be simple tasks in other languages, e.g. a correct file open, read-to-dynamically-allocated-memory, and file close with good error checking is a full page (at least) of dense code in C and just two lines in Python.If you've done a good job sanitizing and verifying all the input to your program, only then does it becomes relatively safe to use the standard string functions, with caveats for multithreading.Asking ChatGPT to compare and contrast fgetc and fgets is a good place to start, and then ask how to use fgets to handle errors during stream I/O, and what can go wrong with multithreading etc. Then take a look at the sqlite source code for in-house C-string handling, here's the take-away comment:"Because there is no consistency, we will define our own."<a href="https://github.com/sqlite/sqlite/blob/master/src/util.c">https://github.com/sqlite/sqlite/blob/master/src/util.c</a>

Tarragon大约 2 年前

> So how can we handle this case safely? There are a few ways I can think of.strdup:> The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3).

dsvr大约 2 年前

There's a framework for C now at <a href="https://vely.dev" rel="nofollow">https://vely.dev</a> which may help with C strings safety and memory management, among other things.

评论 #35477003 未加载

ranger_danger大约 2 年前

> By default, Windows PowerShell .lnk shortcut is hardcoded to use the "Consolas" fontSurely this is not the case for Japanese versions of Windows (or users with Japanese set as their display language?)

评论 #35476735 未加载

TheRealPomax大约 2 年前

How to make C safe: by putting it back in the box and putting the box back on the shelf and then closing the door to the garage. Then using a safe-by-design language.

jeffrallen大约 2 年前

Programming in Go made me a better C programmer, because now I no longer use C strings, only a buffer/length/capacity struct.

mirkodrummer大约 2 年前

Any good resource around with all the common C pitfalls and relative solutions?

teddyh大约 2 年前

> strcmp takes two strings and returns 0 when they are true.ITYM “equal”, not “true”.

analog31大约 2 年前

As a cellist, I was about to sympathize when I read the title.

pipeline_peak大约 2 年前

Handling strings in C was enough for me to choose C++…

FpUser大约 2 年前

C strings are bad for sure. Consider those raw assembly. Instead of using it directly get some decent string library ASAP and use it exclusively.

评论 #35468461 未加载

infradig大约 2 年前

I stopped when I read strcmp returns 0 if two strings are equal and 1 if they aren't.

评论 #35468277 未加载

评论 #35468428 未加载

评论 #35468187 未加载

评论 #35477643 未加载

评论 #35476788 未加载