TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

String tokenization in C

155 点作者 throwaway2419超过 6 年前

16 条评论

kazinator超过 6 年前
The actions of <i>strtok</i> can easily be coded using <i>strspn</i> and <i>strcspn</i>.<p><a href="https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;ZhXAlw6VZsA&#x2F;_Y5evTIkf6kJ" rel="nofollow">https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;...</a> [2001]<p><a href="https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;ff0xFqRPH_Y&#x2F;Cen0mgciXn8J" rel="nofollow">https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;message&#x2F;raw?msg=comp.lang.c&#x2F;...</a> [2011 repost]<p><i>strspn(s, bag)</i> calculates the length of the prefix of string <i>s</i> which consists only of the characters in string <i>bag</i>. <i>strcspn(s, bag)</i> calculates the length of the prefix of <i>s</i> consisting of characters <i>not</i> in <i>bag</i>.<p>The <i>bag</i> is like a one-character regex class; so that is to say <i>strspn(s, &quot;abcd&quot;)</i> is like calculating the length of the token at the front of input <i>s</i> matching the regex [abcd]* , and in the case of <i>strcspn</i>, that becomes [^abcd]* .
评论 #18689501 未加载
jstimpfle超过 6 年前
strtok is one of the silliest parts of the standard library. (And there are many bad ones). It&#x27;s broken. It&#x27;s not thread safe (yes there is strtok_r). It&#x27;s needlessly hard to use. And it writes zeros to the input array. The latter means it&#x27;s unfit for most use cases, including non-trivial tokenization where you want e.g. to split &quot;a+1&quot; into three tokens.<p>If you program in C please just write those four obvious lines yourself.
评论 #18688271 未加载
评论 #18689065 未加载
评论 #18688162 未加载
评论 #18688553 未加载
评论 #18690463 未加载
评论 #18690562 未加载
评论 #18690557 未加载
评论 #18692444 未加载
评论 #18688558 未加载
评论 #18688430 未加载
stochastic_monk超过 6 年前
I recommend ksplit&#x2F;ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0]. It modifies the string in-place, adding null terminators, and provides a list of offsets into the string. This gives you the flexibility of accessing tokens by index without paying costs of copying or memory allocation.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;attractivechaos&#x2F;klib" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;attractivechaos&#x2F;klib</a>
lixtra超过 6 年前
I have an obsession with unsafe example code:<p><pre><code> strcpy(str,&quot;abc,def,ghi&quot;); token = strtok(str,&quot;,&quot;); printf(&quot;%s \n&quot;,token); </code></pre> Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.
评论 #18688428 未加载
jfries超过 6 年前
Well, yes, using strtok works if the data happens to be structured in a certain simple way. Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.
评论 #18688299 未加载
评论 #18688203 未加载
评论 #18688477 未加载
graycat超过 6 年前
A lot of experience shows that the string tokenization in Open Object Rexx is darned useful. E.g., for many years, IBM&#x27;s internal computing was from about 3600 <i>mainframe</i> computers around the world running VM&#x2F;CMS with a lot of <i>service machines</i> written in Rexx. Rexx is no toy but a powerful, polished, scripting language and really good at handling strings.<p>A little example of some Rexx code with some string parsing is in<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=18648999" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=18648999</a>
pasokan超过 6 年前
It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today
评论 #18688047 未加载
评论 #18688095 未加载
caf超过 6 年前
Note though that strsep() is not as portable, because it is an extension to standard C.
评论 #18688200 未加载
评论 #18688126 未加载
satyenr超过 6 年前
&gt; Next, strtok is not thread-safe. That&#x27;s because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.<p>I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:<p><pre><code> int strtok(char *str, char *delim, char **tokens); </code></pre> Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?<p>Does anyone here have the historical prospective?
megous超过 6 年前
Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It&#x27;s very powerful in combination with goto.
saagarjha超过 6 年前
<p><pre><code> str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1)); strcpy(str,TESTSTRING); </code></pre> str = strdup(TESTSTRING)?
rurban超过 6 年前
AFAIK strtok has restrict on both args since C99. And the safe variants strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not ASCII.<p><a href="https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;c&#x2F;string&#x2F;byte&#x2F;strtok" rel="nofollow">https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;c&#x2F;string&#x2F;byte&#x2F;strtok</a>
bsenftner超过 6 年前
...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...
评论 #18690040 未加载
the_clarence超过 6 年前
Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?
setquk超过 6 年前
I just use flex. You don’t have to ship flex as a dependency either.
alexandernst超过 6 年前
How about just using a properly suited language por string manipulation?