TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

String tokenization in C

155 pointsby throwaway2419over 6 years ago

16 comments

kazinatorover 6 years ago
The actions of <i>strtok</i> can easily be coded using <i>strspn</i> and <i>strcspn</i>.<p><a href="https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;ZhXAlw6VZsA&#x2F;_Y5evTIkf6kJ" rel="nofollow">https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;...</a> [2001]<p><a href="https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;ff0xFqRPH_Y&#x2F;Cen0mgciXn8J" rel="nofollow">https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;...</a> [2011 repost]<p><i>strspn(s, bag)</i> calculates the length of the prefix of string <i>s</i> which consists only of the characters in string <i>bag</i>. <i>strcspn(s, bag)</i> calculates the length of the prefix of <i>s</i> consisting of characters <i>not</i> in <i>bag</i>.<p>The <i>bag</i> is like a one-character regex class; so that is to say <i>strspn(s, &quot;abcd&quot;)</i> is like calculating the length of the token at the front of input <i>s</i> matching the regex [abcd]* , and in the case of <i>strcspn</i>, that becomes [^abcd]* .
评论 #18689501 未加载
jstimpfleover 6 years ago
strtok is one of the silliest parts of the standard library. (And there are many bad ones). It&#x27;s broken. It&#x27;s not thread safe (yes there is strtok_r). It&#x27;s needlessly hard to use. And it writes zeros to the input array. The latter means it&#x27;s unfit for most use cases, including non-trivial tokenization where you want e.g. to split &quot;a+1&quot; into three tokens.<p>If you program in C please just write those four obvious lines yourself.
评论 #18688271 未加载
评论 #18689065 未加载
评论 #18688162 未加载
评论 #18688553 未加载
评论 #18690463 未加载
评论 #18690562 未加载
评论 #18690557 未加载
评论 #18692444 未加载
评论 #18688558 未加载
评论 #18688430 未加载
stochastic_monkover 6 years ago
I recommend ksplit&#x2F;ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0]. It modifies the string in-place, adding null terminators, and provides a list of offsets into the string. This gives you the flexibility of accessing tokens by index without paying costs of copying or memory allocation.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;attractivechaos&#x2F;klib" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;attractivechaos&#x2F;klib</a>
lixtraover 6 years ago
I have an obsession with unsafe example code:<p><pre><code> strcpy(str,&quot;abc,def,ghi&quot;); token = strtok(str,&quot;,&quot;); printf(&quot;%s \n&quot;,token); </code></pre> Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.
评论 #18688428 未加载
jfriesover 6 years ago
Well, yes, using strtok works if the data happens to be structured in a certain simple way. Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.
评论 #18688299 未加载
评论 #18688203 未加载
评论 #18688477 未加载
graycatover 6 years ago
A lot of experience shows that the string tokenization in Open Object Rexx is darned useful. E.g., for many years, IBM&#x27;s internal computing was from about 3600 <i>mainframe</i> computers around the world running VM&#x2F;CMS with a lot of <i>service machines</i> written in Rexx. Rexx is no toy but a powerful, polished, scripting language and really good at handling strings.<p>A little example of some Rexx code with some string parsing is in<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=18648999" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=18648999</a>
pasokanover 6 years ago
It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today
评论 #18688047 未加载
评论 #18688095 未加载
cafover 6 years ago
Note though that strsep() is not as portable, because it is an extension to standard C.
评论 #18688200 未加载
评论 #18688126 未加载
satyenrover 6 years ago
&gt; Next, strtok is not thread-safe. That&#x27;s because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.<p>I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:<p><pre><code> int strtok(char *str, char *delim, char **tokens); </code></pre> Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?<p>Does anyone here have the historical prospective?
megousover 6 years ago
Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It&#x27;s very powerful in combination with goto.
saagarjhaover 6 years ago
<p><pre><code> str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1)); strcpy(str,TESTSTRING); </code></pre> str = strdup(TESTSTRING)?
rurbanover 6 years ago
AFAIK strtok has restrict on both args since C99. And the safe variants strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not ASCII.<p><a href="https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;c&#x2F;string&#x2F;byte&#x2F;strtok" rel="nofollow">https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;c&#x2F;string&#x2F;byte&#x2F;strtok</a>
bsenftnerover 6 years ago
...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...
评论 #18690040 未加载
the_clarenceover 6 years ago
Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?
setqukover 6 years ago
I just use flex. You don’t have to ship flex as a dependency either.
alexandernstover 6 years ago
How about just using a properly suited language por string manipulation?