TechEcho

16 comments

kazinatorover 6 years ago

The actions of strtok can easily be coded using strspn and strcspn.<a href="https://groups.google.com/forum/message/raw?msg=comp.lang.c/ZhXAlw6VZsA/_Y5evTIkf6kJ" rel="nofollow">https://groups.google.com/forum/message/raw?msg=comp.lang.c/...</a> [2001]<a href="https://groups.google.com/forum/message/raw?msg=comp.lang.c/ff0xFqRPH_Y/Cen0mgciXn8J" rel="nofollow">https://groups.google.com/forum/message/raw?msg=comp.lang.c/...</a> [2011 repost]strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .

评论 #18689501 未加载

jstimpfleover 6 years ago

strtok is one of the silliest parts of the standard library. (And there are many bad ones). It's broken. It's not thread safe (yes there is strtok_r). It's needlessly hard to use. And it writes zeros to the input array. The latter means it's unfit for most use cases, including non-trivial tokenization where you want e.g. to split "a+1" into three tokens.If you program in C please just write those four obvious lines yourself.

评论 #18688271 未加载

评论 #18689065 未加载

评论 #18688162 未加载

评论 #18688553 未加载

评论 #18690463 未加载

评论 #18690562 未加载

评论 #18690557 未加载

评论 #18692444 未加载

评论 #18688558 未加载

评论 #18688430 未加载

stochastic_monkover 6 years ago

I recommend ksplit/ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0]. It modifies the string in-place, adding null terminators, and provides a list of offsets into the string. This gives you the flexibility of accessing tokens by index without paying costs of copying or memory allocation.[0] <a href="https://github.com/attractivechaos/klib" rel="nofollow">https://github.com/attractivechaos/klib</a>

lixtraover 6 years ago

I have an obsession with unsafe example code:<pre><code> strcpy(str,"abc,def,ghi"); token = strtok(str,","); printf("%s \n",token); </code></pre> Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.

评论 #18688428 未加载

jfriesover 6 years ago

Well, yes, using strtok works if the data happens to be structured in a certain simple way. Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.

评论 #18688299 未加载

评论 #18688203 未加载

评论 #18688477 未加载

graycatover 6 years ago

A lot of experience shows that the string tokenization in Open Object Rexx is darned useful. E.g., for many years, IBM's internal computing was from about 3600 mainframe computers around the world running VM/CMS with a lot of service machines written in Rexx. Rexx is no toy but a powerful, polished, scripting language and really good at handling strings.A little example of some Rexx code with some string parsing is in<a href="https://news.ycombinator.com/item?id=18648999" rel="nofollow">https://news.ycombinator.com/item?id=18648999</a>

pasokanover 6 years ago

It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today

评论 #18688047 未加载

评论 #18688095 未加载

cafover 6 years ago

Note though that strsep() is not as portable, because it is an extension to standard C.

评论 #18688200 未加载

评论 #18688126 未加载

satyenrover 6 years ago

> Next, strtok is not thread-safe. That's because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:<pre><code> int strtok(char *str, char *delim, char **tokens); </code></pre> Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?Does anyone here have the historical prospective?

megousover 6 years ago

Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It's very powerful in combination with goto.

saagarjhaover 6 years ago

<pre><code> str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1)); strcpy(str,TESTSTRING); </code></pre> str = strdup(TESTSTRING)?

rurbanover 6 years ago

AFAIK strtok has restrict on both args since C99. And the safe variants strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not ASCII.<a href="https://en.cppreference.com/w/c/string/byte/strtok" rel="nofollow">https://en.cppreference.com/w/c/string/byte/strtok</a>

bsenftnerover 6 years ago

...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...

评论 #18690040 未加载

the_clarenceover 6 years ago

Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?

setqukover 6 years ago

I just use flex. You don’t have to ship flex as a dependency either.

alexandernstover 6 years ago

How about just using a properly suited language por string manipulation?

16 comments

kazinatorover 6 years ago

评论 #18689501 未加载

jstimpfleover 6 years ago

评论 #18688271 未加载

评论 #18689065 未加载

评论 #18688162 未加载

评论 #18688553 未加载

评论 #18690463 未加载

评论 #18690562 未加载

评论 #18690557 未加载

评论 #18692444 未加载

评论 #18688558 未加载

评论 #18688430 未加载

stochastic_monkover 6 years ago

lixtraover 6 years ago

评论 #18688428 未加载

jfriesover 6 years ago

评论 #18688299 未加载

评论 #18688203 未加载

评论 #18688477 未加载

graycatover 6 years ago

pasokanover 6 years ago

It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today

评论 #18688047 未加载

评论 #18688095 未加载

cafover 6 years ago

Note though that strsep() is not as portable, because it is an extension to standard C.

评论 #18688200 未加载

评论 #18688126 未加载

satyenrover 6 years ago

megousover 6 years ago

Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It's very powerful in combination with goto.

saagarjhaover 6 years ago

<pre><code> str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1)); strcpy(str,TESTSTRING); </code></pre> str = strdup(TESTSTRING)?

rurbanover 6 years ago

bsenftnerover 6 years ago

...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...

评论 #18690040 未加载

the_clarenceover 6 years ago

Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?

setqukover 6 years ago

I just use flex. You don’t have to ship flex as a dependency either.

alexandernstover 6 years ago

How about just using a properly suited language por string manipulation?

String tokenization in C

16 comments

String tokenization in C

16 comments