Ask HN: Is KDB a sane choice for a datalake in 2024?

40 pointsby sonthonax12 months ago

Pardon the vague question, but KDB is very much institutional knowledge hidden from the outside world. People have built their livelihoods around it and use it as a hammer for all sorts of nails.It's also extremely expensive and written in a language with origins so obtuse that it's progenitor APL needed a custom keyboard laden with mathematical symbols.Within my firm, it's very hard to get an outside perspective, the KDB developers are true believers in KDB, but they they obviously don't want to be professionally replaced. So I'm asking the more forward leaning HN.One nail in my job, is KDB as a data-lake and I'm being driven nuts by it. I write code in Rust that prices options. There's a lot of complex code involved in this, I use a mix of numeric simulations to calculate greeks and somewhat lengthy analytical formulas.The data that I save to KDB is quite raw, I save the market data and derived volatility surfaces, which are themselves complex-ish models needing some carefully unit-tested code to convert in to implied vols.Right now my desk has no proper tooling for backtesting that uses our own data. And I'm constantly being asked to do something about it, and I don't know what to do!I'm 99% sure KDB is the wrong tool for the job, because of three things:- It's not horizontally scalable. A divide and conquer algo on N<{small_number} cores is pointless.- I'm scared to do queries that return a lot of data. It's non trivial to get a day's worth of data. The query will just often freeze, it doesn't even buffer. Even if I'm just trying to fetch what should be a logical partition, the wire format is really inefficient and uncompressed. I feel like I need to engineering work for trivial things.- The main thing is that I need to do complex math to convert my raw data, order-books and vol-surfaces into useful data to backtest.I have no idea how do do any of this in KDB. My firm is primarily a spot desk, and while I respect my colleagues, their answer is:> Other firms are really invested in KDB and use KDB for this, just figure it out.I'm going nuts because I'm under the assumption that these other firms are way larger and have teams of KDB-quants doing the actual research. While we have some quant traders who know a bit of KDB but they work in the spot side with far more simple math.I keep on advocating for some Parquet style data-store with Spark/Dask/Arrow/Polars running on top of it that can be horizontally scaled and most importantly, with Polars, I can write my backtests in Rust and leverage the libraries I've already written.I get shot down with "we use KDB here". I just don't know how I can deliver a maintainable solution to my traders with this current infrastructure. Bizarrely, and this is a financial firm, no one in a team of ~100 devs has ever touched Spark style tech other than me here.What should I do? Are my concerns overblown? Am I misunderstanding the power of KDB?

15 comments

vessenes12 months ago

Long time Kdb/q enthusiast, absolutely NO enterprise deployment experience whatsoever.This feels like a ‘pick your poison’ situation. You’ve been told already you won’t be allowed to dump kdb; it’s probably embedded in your infra in a bunch of ways, and ripping it out is a no-go.OK, so, you have data in kdb. What you’re doing right now (it sounds like) is using it as literally just a raw data store. That’s the worst way to use it; a lot of work went into making it very fast to run summarization/grouping/sorting/etc all right on the kdb servers. Note that this is very unlike how an Apache project works.Unfortunately, you wrote a rust library that probably doesn’t really distinguish your kdb storage from, say, JSON files, so you are at a crossroads.Option 1: Get some good data cloning up, clone data over to your preferred generalized data lake tech, run rust against it.Option 2: Go through your rust code with a fine tooth comb and figure where exactly it’s doing things that cannot be done semantically in q/k. Start slimming down your Rust lib, or more exactly, rework what queries its sending, and what state of data it expectsOption 3: dump your rust library and rewrite it in q or k.Of these, I would be willing to bet that for an ‘ideal’ developer, meaning a 160+ IQ dev skilled in Rust, vs a 160+ IQ dev skilled in kdb, vs a 160+IQ dev skilled in say Java + Spark, Option 3 is going to be by far the least resource intensive in terms of deployed hardware, and the fastest / lowest latency.That said, given where you’re at, a principled Rustacean who’s looking at coming to grips with kdb realtime, I think I’d recommend you think hard about Option 2. By the end of Option 2, you will probably be like “Yeah, this could be all k, or nearly all,” but you’re likely going to have some learning to do.Think of it this way, when you’re done, you’ll be on the other side of the cabal, and can double your base rate for your next gig. :)

alexpotato12 months ago

I'm speaking as someone who:- has worked in finance for both hedge funds and banks- has managed a project where KDB was mandated to be used by mgmt.- on the above project, tasked one of the smartest developers I've ever worked with as the person to learn KDB and use it in the application- I'm a SRE (and former Operations) so that colors my perspective.Given the above, I list out the pros and cons:PROS- KDB is pretty fast (on some metrics of fast)CONS- VERY few people can write/read good Q (compared to say people who know Pandas/SKlearn etc)- The learning curve is INCREDIBLY steep. Even the most cited documentation and tutorials have something like this in the intro "are you really sure you need KDB? B/c Q is REALLY hard to learn"- As you mention, open source industry standards have come a LONG way since it made sense to have KDB (e.g. in the late 2000s/early 2010s)Conclusion:If you have a lot of in house expertise, then sure, it probably makes sense. If you are starting from scratch, I would not recommend it.On that note, this point stood out: > People have built their livelihoods around it and use it as a hammer for all sorts of nails.If you work in the industry long enough, you will find a lot of complexity added to systems for three reasons:1. Some things in finance really do need to be complex due to the math etc2. Smart people with quant backgrounds tend to LOVE complex things.3. Smart, rational people realize that adding complexity is one way to build a fortress around their job. This is particularly true in high paying firms where people realize that it's their knowledge of the complex systems that is keeps them in that high paying job.Give that, if you are looking to make a name for yourself at your firm: making things run faster, with fewer issues etc is a good way to stand out. Just be careful that you don't eliminate so much complexity that people get mad at you.

steveBK12312 months ago

Longtime KDB user here I think you maybe have some misunderstanding personally and some poor engineering at your firm around the the tech/data. Timeseries data particularly market data is exactly the use case the product excels at.The wire format is compressed.KDB horizontally scales (even their competitors comparison pages state this - <a href="https://www.influxdata.com/comparison/kdb-vs-tsdb/" rel="nofollow">https://www.influxdata.com/comparison/kdb-vs-tsdb/</a>)A few things to consider that might help - you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis. KDB will not excel for this, nor will anything else. KDB shines when you learn to move your code to the data rather than your data to the code.What does "move the code to the data" mean in practice?You can do things like use PyKX which allows you to run your python & kdb code together on top of the data directly in the same process.You should do as much of the filter/aggregation/joins/etc over on the KDB side before pulling the results back. You should also define, generate and use pre-aggregated data where it makes sense for your use case (second / minute / day bars).Backtesting in KDB is relatively trivial as you have historical data organized by day and symbol. Any half decent KDB dev should be able to cook one up of increasing complexity for you.Nick Psaris has a couple books that cover more advanced topics that may be of use.

评论 #40626612 未加载

评论 #40631950 未加载

评论 #40626586 未加载

oneplane12 months ago

As powerful as KDB is, finding people to make use of it is almost not worth it. But as it tends to be entrenched (usually via poor reasoning), you usually are screwed when some company or project uses it. I personally would just quit and work somewhere else or on something else.It's about as isolated as mainframe engineering; great on paper, great in closed off circles, but practically dead in the tech community at large.

succint11 months ago

You definitely should be able to do these calculations in q near-data. In fact, porting your code from Rust to Q might even reveal bugs and/or sub-optimal code. This was many years ago but I ported over some non-trivial image processing code to Q just to learn the language. I was amazed by how everything fit into a page of code and how seeing all of it together revealed a subtle bug and a couple of ways to optimize better.

sonthonax12 months ago

I care about my job, enjoy my work, and I take pride in being able to deliver things, but I don't know how to deliver value for my traders here.I'd really appreciate some perspective from seasoned data-engineers who might have seen this KDB as a data-lake pattern before, and what they did about it. Not just technically, but how they managed the organisational change of KDB for the KDB quants.I also just don't really know where else to ask. There's not really an online KDB community, and you have a lot of KDB devs who are really good at KDB but know barely anything else which makes me skeptical of their advice.

评论 #40626512 未加载

nextworddev12 months ago

No. Don’t go with KDB. (Source: built multiple production backtesting systems in prop desks)

评论 #40631215 未加载

rbanffy12 months ago

> Pardon the vague question, but KDB is very much institutional knowledge hidden from the outside world.This answers the question for me: unless you are sure the performance lives up to your expectations AND it gives you a competitive advantage (which can be easily lost with the human in the loop), don't even think of it. Get the next best tech that's easy to use, and well documented, and remove the human in the loop to gain the edge.

steveBK12312 months ago

Also you mention rust - <a href="https://docs.rs/kdb/latest/kdb/" rel="nofollow">https://docs.rs/kdb/latest/kdb/</a>The old style of kdb integration you could compile a .so and load it in to extend the language. Now people use PyKx to load in python modules. I have a guy doing this on my team to load a Python wrapped Rust lib.It looks like you have a few options with Rust as per the link above.Note this allows you to do the "move the code to the data" trick I mentioned.

评论 #40626713 未加载

pyuser58312 months ago

If you’re using KDB, use KDB. The decision has been made, the license paid.Work to change your organization, not your technology.Nuclear bombs used to be controlled by decades old systems that worked off floppy disks. Why? Because the systems were so important people worked around the tech.You’re in a similar spot.Even when using conventional languages and platforms, sometimes the decision has been made and you’re stuck with it.KDB might not be the best fit for a datalake, but plenty of people will sleep better just knowing it’s KDB.Change the people, not the tech.

评论 #40626454 未加载

sneakyavacado12 months ago

I’ve intermittently worked with kdb for the past three years and feel broadly the same.Can you deploy to a host that has a mount of the database and run a local q or are you forced to query via ipc? Are you on cloud or on prem?A fight for horizontal scaling and running a local q against the data might be an easier one to win than a full replacement of the database.

gigatexal12 months ago

Totally out of the loop here, what's KDB?

评论 #40626392 未加载

评论 #40626715 未加载

评论 #40627085 未加载

评论 #40626609 未加载

mrj12 months ago

<pre><code> I get shot down with "we use KDB here" </code></pre> Well, so fundamentally a decision has been made already. You got shot down. Unless there's some significant new data that might change that decision, it is what it is. The next step is for you to decide if you can get on board with that, or if it's time to start planning your next move. I don't know anything about KDB and you might be right. But it sounds like the powers that be don't want to make this change.Sorry, but it makes no sense to swim against the tide in an organization where you don't have the rank to make the decisions.<pre><code> I keep on advocating for some Parquet style data-store... </code></pre> The worst thing you can do is to continue spending your time advocating for change, they've heard you. If you stay on and the current tech does become untenable, you are unlikely to be the hero even in that case. They might just remember you as the detractor.

评论 #40626537 未加载

jmakov12 months ago

What's wrong with delta lake?

评论 #40626084 未加载

michaelg7x12 months ago

Hi, KDB is used for this kind of thing in probably all the Tier 1 banks, or has been at some point. I'm surprised that you seem to have been given so little help by the KDB guys as it really matters how you store your data. That's informed by the data itself and the access patterns you're likely to use. When you say you're saving them as complex-ish models it makes me think that it may not be optimal for KDB to process.KDB is in some respects as dumb as a bag of rocks. There is no execution profiler nor explain plan, no query analysis at all. When running your query over tabular data it simply applies the where-clause constraints in-order, passing the boolean vector result from one to the next, which refines the rows still under active consideration. It's for this reason that newbies are always told to put the date-constraint first, or they'll try to load the entire history (typically VOD.L) into memory.KDB really is very fast at processing vector data. Writing nested vectors or dictionaries to individual cells could easily be slowing you down; I've heard of one approach which writes nested dictionaries into vectors with the addition of a column to contain the dictionary keys. Then you get KDB to go faster over the 1-D data, nicely laid out on disk. You really do need to write it down in a way that is sympathetic to the way you will eventually process it.You can create hashmap indices over column data but the typical way of writing down equity L1 data is to "partition by date" (write it into a date directory) and "apply the parted attribute" to the symbol column (group by symbol, sort by time ascending). Each of the remaining vectors (time, price, size, exchange, whatnot) are obviously sorted to match and finding the next or previous trade for a given symbol is O1 simplicity itself. I've never worked on options data and so can't opine on the problems it presents, but if you've been asked to write this down without any help, then it's pretty "rubbish" of the KDB guys in your firm. You have asked for help, right?I'm really going on a bit but just a few more things:- KDB will compress IPC data — if it wants to. The data needs to exceed some size-threshold and you must, I think, be sending it between hosts. It won't bother compressing it to localhost, at least, according to some wisdom received from one of the guys at Kx, many moons ago. The IPC format itself is more or less a tag-length-value format, and good enough. It evolved to support vectors bigger than INT32_MAX a while ago but most IPC-interop libraries don't tend to advertise support for the later version that lets you send silly-big amounts of data around, so my guess is you may not want to load data out of KDB a day at a time. Try to do the processing in KDB!You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end? This may sound glib but you wouldn't be the first person to have been given a VM to do your dev-work on that isn't quite up to the job. You can find out the size of the payload you're trying to read by running the same query with the "-22!" system call. It'll tell you how many bytes it's trying to send. Surely there's help to be had from the KDB guys if you reach out?- I'm confused by the use of the term "data lake": to me this includes unstructured data. I'm not sure I'd ever characterise a KDB HDB as such.- If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.- KDB can scale horizontally. It can employ a number (I forget whether it's bounded) of slave instances and farm-out work. I think I read that the latest version has a better work-stealing algo. It's often about the data, though: if the data for a particular symbol/date tuple is on that one server over there, then you're probably better off doing big historic-reads on that one server alone. I doubt very much you're compute-bound or you'd have told us that your KDB licence limited you to a single or N (rather than any number) of cores.- Many years ago I was told never to run KDB on NFS. Except Solaris' NFS. I have no idea whether this is relevant ;)Good luck, sonthonax

评论 #40631359 未加载

15 comments

vessenes12 months ago

alexpotato12 months ago

steveBK12312 months ago

评论 #40626612 未加载

评论 #40631950 未加载

评论 #40626586 未加载

oneplane12 months ago

succint11 months ago

sonthonax12 months ago

评论 #40626512 未加载

nextworddev12 months ago

No. Don’t go with KDB. (Source: built multiple production backtesting systems in prop desks)