TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Cheap and DIY solution for log analysis

6 pointsby unrequited8 months ago
I’ve ~5TB size of logs. I’m open to ideas on doing a log analysis using any of the AI models available to play around for learning purposes. I’m on a budget for this but have time to work on something DIY. Kindly suggest any ideas for anomaly detection or similar to play around with these logs. Thanks.

2 comments

speedgoose8 months ago
A few random thoughts since no one replied:<p>(rip)grep. Use AI to suggest what to look for if you want to use AI. Maybe do it in reverse, so you filter-out the logs you aren&#x27;t interested about.<p>Look at the similarity of each line. Working on UTF-8 or ASCII may not be good enough, though it can quickly highlight some interesting lines. Perhaps a nice tokenizer can help, or even language models embeddings. You can play with old text similarity algorithms or cosine similarity and similar.<p>Play with clustering algorithms like UMAP and HDBSCAN (or whatever the state of the art is, I havn&#x27;t look at the field recently).<p>Feeding a chat&#x2F;instruct LLM 5TB of logs is technically possible, but that would be a huge waste of ressources IMHO. Is it worth it? You could only feed the lines that are unusual to filter.<p>Let&#x27;s say you have hardware and a LLM than can process 100tokens&#x2F;s, 5TB is about 400 years of compute.
tonetegeatinst8 months ago
5TB of log data even if you cleaned up the data which would also take time, that&#x27;s a lot of input for any model.<p>I think its probably more feasible to sort by type, or category. Maby do something like kibana or greylog so you can better visualize the logs and narrow down what&#x27;s an IOC and what might just be a random error message. This also let&#x27;s you look at the type of logs over a time period.<p>Any ML or AI model would be computationally expensive, and if this isn&#x27;t something where you have the hardware to selfhost then you also need to upload 5TB of logs.