TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Extract Any Date Format

1 pointsby yonzover 1 year ago
I am looking for a way to extract and transform date from any format into unix time at a 10M+ scale (if it takes 1s&#x2F;record it will take ~4 months). I am using this simple task to explore how LLM adaptation capabilities can be made performant for scalable extraction.<p>- Oct26 3:51PM<p>- Nov 6, 2023 at 9:42:44 AM<p>- 2023-05-29T06:40:31.249-06:00<p>- June 3, 2011 at 4:52 AM<p>- Fr Nov 10, 2023, at 9:42:44 AM<p>- Friday November 10 at 2:20 AM<p>... You get the gist

3 comments

LinuxBenderover 1 year ago
I don&#x27;t know the fastest method but you might search around on StackExchange [1] and test some of their answers as I see some examples of people converting dirty date inputs that would surely be faster than using a heavy LLM, especially if broken up into batches and distributed to multiple machines by core&#x2F;thread counts.<p>[1] - <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;63371125&#x2F;python-how-to-clean-dirty-date-time-strings" rel="nofollow noreferrer">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;63371125&#x2F;python-how-to-c...</a>
Someoneover 1 year ago
&gt; I am using this simple task to explore how LLM adaptation capabilities can be made performant for scalable extraction<p>Divide and conquer:<p>- look at the first 50 or so unhandled cases, and pick the most common pattern in them.<p>- find a way to handle that pattern with a smile parser (e.g. a JVM DateTimeFormatter, a regex, or whatever works decently in your preferred language)<p>- repeat<p>That probably will decrease that 10 million to a million, then to 100,000 fairly rapidly.<p>Once you’re down to a manageable number, get your LLM to handle those.<p>(Also: this task likely is easily run in parallel, so if you have money, you won’t need 10M+ seconds)
xeckrover 1 year ago
I would transform the most commonly occurring formats programmatically. The rest could probably be handled by GPT-3. Alternatively, divide the task between as many cloud VMs hosting something like LLaMa as it takes to fit your time constraints.