A regex only seems to take ~1µs.<p><pre><code> In [7]: iso_regex = re.compile('(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}(?:\\.?\\d+))')
In [8]: %timeit iso_regex.match('2014-01-09T21:48:00.921000')
1000000 loops, best of 3: 1.05 µs per loop
</code></pre>
But hey, once it's written in C, why go back?<p>I'm missing the timezone, but the OP left that out, so I did too. For comparison, dateutil's parse takes ~76µs for me. Kinda makes me wonder why aniso8601 is so slow. (It's also missing a few other things, depending on if you count all the non-time forms as valid input.)<p>That said, cool! I might use this. One of the things that makes dateutil's parse slower is that it'll parse more than just ISO-8601: it parses many things that look like dates, including some very non-intuitive ones that have caused "bugs"¹. Usually in APIs, its "dates are always ISO-8601", and all I really <i>need</i> is an ISO-8601 parser. While I appreciate the theory behind "be liberal in what you accept", sometimes, I'd rather error out than to build expectations that sending garbage — er, stuff that requires a complicated parse algorithm that I don't really understand — is okay.<p>¹dateutil.parser.parse('') is midnight of the current date. Why, I don't know. Also, dateutil.parser.parse('noon') is "TypeError: 'NoneType' object is not iterable".
Pandas (data analysis library for python) has a lot of cython and C optimizations for datetime string parsing:<p>They have their own C function which parses ISO-8601 datetime strings: <a href="https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf566103eabef4997274e4576/pandas/src/datetime/np_datetime_strings.c#L344" rel="nofollow">https://github.com/pydata/pandas/blob/2f1a6c412c3d1cbdf56610...</a><p>They have a version of strptime written in cython: <a href="https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L1473" rel="nofollow">https://github.com/pydata/pandas/blob/master/pandas/tslib.py...</a><p>I'm not saying these are better/worse than your solution, I haven't done any benchmarks and the pandas functions sometimes cut a few corners, but perhaps there is something useful there for reference anyways. They also don't deal directly in datetime.datetime objects, they use pandas specific intermediate objects, but should be simple enough to grok.<p>Having done some work with dateutil, I will tell you that dateutil.parser.parse is slow, but its main use case shouldn't be converting strings to datetimes if you already know the format. If you know the format already you should use datetime.strptime or some faster variant (like the one above). There is a nice feature of pandas where given a list of datetime-y strings of an arbitrary format, it will attempt to guess the format using dateutil's lexer (<a href="https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L73" rel="nofollow">https://github.com/pydata/pandas/blob/master/pandas/tseries/...</a>) combined with trial/error, and then try to use a faster parser instead of dateutil.parser.parse to convert the array if possible. In the general case this resulted in about a 10x speedup over dateutil.parser.parse if the format was guessable.
I tried to do a fair comparaison between comparaison between the main date implementations. The ciso8601 is really fast, 3.73 µs on my computer (MBA 2013). aniso8601, iso8601, isolate and arrow are all between 45 and 100µs. The dateutil parser is the slowest (150 µs).<p><pre><code> >>> ds = u'2014-01-09T21:48:00.921000+05:30'
>>> %timeit ciso8601.parse_datetime(ds)
100000 loops, best of 3: 3.73 µs per loop
>>> %timeit dateutil.parser.parse(ds)
10000 loops, best of 3: 157 µs per loop
</code></pre>
A regex[1] can be fast, but the parsing is just a small part of the time spent.<p><pre><code> >>> %timeit regex_parse_datetime(ds)
100000 loops, best of 3: 13 µs per loop
>>> %timeit match = iso_regex.match(s)
100000 loops, best of 3: 2.18 µs per loop
</code></pre>
Pandas is also slow. However it is the fastest for a list of dates, just 0.43µs per date!!<p><pre><code> >>> %timeit pd.to_datetime(ds)
10000 loops, best of 3: 47.9 µs per loop
>>> l = [u'2014-01-09T21:{}:{}.921000+05:30'.format(
("0"+str(i%60))[-2:], ("0"+str(int(i/60)))[-2:])
for i in xrange(1000)] #1000 differents dates
>>> len(set(l)), len(l)
(1000, 1000)
>>> %timeit pd.to_datetime(l)
1000 loops, best of 3: 437 µs per loop
</code></pre>
NB: pandas is however very slow in ill-formed dates, like u'2014-01-09T21:00:0.921000+05:30' (just one figure for the second) (230 µs, no speedup by vectorization).<p>So if you care about speed and your dates are well formatted, make a vector of dates and use pandas. If you can't use it, go for ciso8601. For thomas-st: it may be possible to speed-up parsing of list of dates like Pandas do. Another nice feature would be caching.<p>[1]: <a href="http://pastebin.com/ppJ4dzBP" rel="nofollow">http://pastebin.com/ppJ4dzBP</a>
Extremely simple and straightforward C code too, which is also nice to read. 320ns (on what processor?) is assuming a clock of 2-3GHz on x86 around 1K instructions, several orders of magnitude less than what it was before. But that still works out to a few dozen instructions <i>per character</i> of the string... so I'm inclined to believe that it could go an order of magnitude faster if you really wanted it to, but at that point the Python overhead (PyArg_ParseTuple et al) is going to dominate.<p>I'm not sure this would be any better than just manually writing out both trivial iterations of the loop:<p><pre><code> for (i = 0; i < 2; i++)</code></pre>
Does it cover all of ISO8601? I'm sure it covers the common cases so is a valuable library anyway but I seem to remember that ISO8601 is quite complicated I.
My quick look at this shows that unless you cython wrap the call -- this is going to be slower than using pandas' to_datetime on anything with an array layout.<p>I've never really spent much time looking at pandas' to_datetime but I believe it has to handle a lot of variety in what you pass to it, which probably cause a bit of a perf hit. (Lists, arrays, Series)<p><a href="http://dl.dropboxusercontent.com/u/14988785/ciso8601_comparison.html" rel="nofollow">http://dl.dropboxusercontent.com/u/14988785/ciso8601_compari...</a>
If you control the source data, store it as epoch and you can avoid this parsing.<p>Not quite related:
Is there any python library that can handle timezone parsing, like the java SimpleDateFormat (<a href="http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html" rel="nofollow">http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDat...</a>)? The timezone could be in utc offset and short name format (EST, EDT,...). I am surprised that I couldn't find one.
While profiling I noticed the same thing about dateutil.parser.parse a few years ago. We standardized all our interacting systems on UTC so we have a regex that matches UTC and if that fails to match we call dateutil. That way the vast majority of cases are optimized but we still support other timezones.
There are other parsers that exist already too. For example, did you try this one? <a href="https://pypi.python.org/pypi/iso8601" rel="nofollow">https://pypi.python.org/pypi/iso8601</a><p>How do these all compare to each other?