Show HN: A fast ISO8601 date-time parser for Python

55 pointsby thomas-stalmost 11 years ago

13 comments

deathanatosalmost 11 years ago

A regex only seems to take ~1µs.<pre><code> In [7]: iso_regex = re.compile('(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}(?:\\.?\\d+))') In [8]: %timeit iso_regex.match('2014-01-09T21:48:00.921000') 1000000 loops, best of 3: 1.05 µs per loop </code></pre> But hey, once it's written in C, why go back?I'm missing the timezone, but the OP left that out, so I did too. For comparison, dateutil's parse takes ~76µs for me. Kinda makes me wonder why aniso8601 is so slow. (It's also missing a few other things, depending on if you count all the non-time forms as valid input.)That said, cool! I might use this. One of the things that makes dateutil's parse slower is that it'll parse more than just ISO-8601: it parses many things that look like dates, including some very non-intuitive ones that have caused "bugs"¹. Usually in APIs, its "dates are always ISO-8601", and all I really need is an ISO-8601 parser. While I appreciate the theory behind "be liberal in what you accept", sometimes, I'd rather error out than to build expectations that sending garbage — er, stuff that requires a complicated parse algorithm that I don't really understand — is okay.¹dateutil.parser.parse('') is midnight of the current date. Why, I don't know. Also, dateutil.parser.parse('noon') is "TypeError: 'NoneType' object is not iterable".

评论 #7871814 未加载

评论 #7871756 未加载

评论 #7871757 未加载

birkenalmost 11 years ago

Pandas (data analysis library for python) has a lot of cython and C optimizations for datetime string parsing:They have their own C function which parses ISO-8601 datetime strings: <a href="https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf566103eabef4997274e4576/pandas/src/datetime/np_datetime_strings.c#L344" rel="nofollow">https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf56610...</a>They have a version of strptime written in cython: <a href="https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L1473" rel="nofollow">https://github.com/pydata/pandas/blob/master/pandas/tslib.py...</a>I'm not saying these are better/worse than your solution, I haven't done any benchmarks and the pandas functions sometimes cut a few corners, but perhaps there is something useful there for reference anyways. They also don't deal directly in datetime.datetime objects, they use pandas specific intermediate objects, but should be simple enough to grok.Having done some work with dateutil, I will tell you that dateutil.parser.parse is slow, but its main use case shouldn't be converting strings to datetimes if you already know the format. If you know the format already you should use datetime.strptime or some faster variant (like the one above). There is a nice feature of pandas where given a list of datetime-y strings of an arbitrary format, it will attempt to guess the format using dateutil's lexer (<a href="https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L73" rel="nofollow">https://github.com/pydata/pandas/blob/master/pandas/tseries/...</a>) combined with trial/error, and then try to use a faster parser instead of dateutil.parser.parse to convert the array if possible. In the general case this resulted in about a 10x speedup over dateutil.parser.parse if the format was guessable.

评论 #7873131 未加载

data_scientistalmost 11 years ago

I tried to do a fair comparaison between comparaison between the main date implementations. The ciso8601 is really fast, 3.73 µs on my computer (MBA 2013). aniso8601, iso8601, isolate and arrow are all between 45 and 100µs. The dateutil parser is the slowest (150 µs).<pre><code> >>> ds = u'2014-01-09T21:48:00.921000+05:30' >>> %timeit ciso8601.parse_datetime(ds) 100000 loops, best of 3: 3.73 µs per loop >>> %timeit dateutil.parser.parse(ds) 10000 loops, best of 3: 157 µs per loop </code></pre> A regex[1] can be fast, but the parsing is just a small part of the time spent.<pre><code> >>> %timeit regex_parse_datetime(ds) 100000 loops, best of 3: 13 µs per loop >>> %timeit match = iso_regex.match(s) 100000 loops, best of 3: 2.18 µs per loop </code></pre> Pandas is also slow. However it is the fastest for a list of dates, just 0.43µs per date!!<pre><code> >>> %timeit pd.to_datetime(ds) 10000 loops, best of 3: 47.9 µs per loop >>> l = [u'2014-01-09T21:{}:{}.921000+05:30'.format( ("0"+str(i%60))[-2:], ("0"+str(int(i/60)))[-2:]) for i in xrange(1000)] #1000 differents dates >>> len(set(l)), len(l) (1000, 1000) >>> %timeit pd.to_datetime(l) 1000 loops, best of 3: 437 µs per loop </code></pre> NB: pandas is however very slow in ill-formed dates, like u'2014-01-09T21:00:0.921000+05:30' (just one figure for the second) (230 µs, no speedup by vectorization).So if you care about speed and your dates are well formatted, make a vector of dates and use pandas. If you can't use it, go for ciso8601. For thomas-st: it may be possible to speed-up parsing of list of dates like Pandas do. Another nice feature would be caching.[1]: <a href="http://pastebin.com/ppJ4dzBP" rel="nofollow">http://pastebin.com/ppJ4dzBP</a>

userbinatoralmost 11 years ago

Extremely simple and straightforward C code too, which is also nice to read. 320ns (on what processor?) is assuming a clock of 2-3GHz on x86 around 1K instructions, several orders of magnitude less than what it was before. But that still works out to a few dozen instructions per character of the string... so I'm inclined to believe that it could go an order of magnitude faster if you really wanted it to, but at that point the Python overhead (PyArg_ParseTuple et al) is going to dominate.I'm not sure this would be any better than just manually writing out both trivial iterations of the loop:<pre><code> for (i = 0; i < 2; i++)</code></pre>

评论 #7871804 未加载

josephlordalmost 11 years ago

Does it cover all of ISO8601? I'm sure it covers the common cases so is a valuable library anyway but I seem to remember that ISO8601 is quite complicated I.

评论 #7871778 未加载

radikalusalmost 11 years ago

My quick look at this shows that unless you cython wrap the call -- this is going to be slower than using pandas' to_datetime on anything with an array layout.I've never really spent much time looking at pandas' to_datetime but I believe it has to handle a lot of variety in what you pass to it, which probably cause a bit of a perf hit. (Lists, arrays, Series)<a href="http://dl.dropboxusercontent.com/u/14988785/ciso8601_comparison.html" rel="nofollow">http://dl.dropboxusercontent.com/u/14988785/ciso8601_compari...</a>

wanghqalmost 11 years ago

If you control the source data, store it as epoch and you can avoid this parsing.Not quite related: Is there any python library that can handle timezone parsing, like the java SimpleDateFormat (<a href="http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html" rel="nofollow">http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDat...</a>)? The timezone could be in utc offset and short name format (EST, EDT,...). I am surprised that I couldn't find one.

btbuilderalmost 11 years ago

While profiling I noticed the same thing about dateutil.parser.parse a few years ago. We standardized all our interacting systems on UTC so we have a regex that matches UTC and if that fails to match we call dateutil. That way the vast majority of cases are optimized but we still support other timezones.

jnksalmost 11 years ago

How many dates are you parsing at a time that optimizing this would make a noticeable difference to users?

评论 #7871618 未加载

评论 #7873139 未加载

rlpbalmost 11 years ago

There are other parsers that exist already too. For example, did you try this one? <a href="https://pypi.python.org/pypi/iso8601" rel="nofollow">https://pypi.python.org/pypi/iso8601</a>How do these all compare to each other?

daurnimatoralmost 11 years ago

I think you actually mean RFC3339. ISO8601 is probably a lot larger than you think.

评论 #7871831 未加载

jamesaguilaralmost 11 years ago

This seems like the type of thing that's good to ffi out of you're using it a lot. I highly doubt the c version would take this long.

评论 #7871699 未加载

Sir_Cmpwnalmost 11 years ago

Would it make more sense to modify the core library and send off a patch?

评论 #7873378 未加载