TechEcho

I have an application which contains approximately 10 million time series records growing at a rate of 1 million per year. This application generates reports on subsets of the time series data. These reports have poor performance characteristics because for each record in a time series, another subset of time series data must often be loaded as part of the computations behind the report. Effectively this can lead to N^2 performance. The current workaround is to implement these reports as stored procedures in the Oracle database where the time series data resides. This saves network roundtrips for every time series record.I've found the stored procedures to be insufficiently flexible to handle complex requirements compared to modern programming languages (requirements come in for new reports on a regular basis). I'd like to generate these reports in application code (C#) but can't see a way around the performance issue. Has anyone dealt with similar challenges and how did you work around them?

7 comments

eschutte2over 8 years ago

That volume of data should be handled pretty easily by the right indexes and joins. At one place with that order of magnitude of data we kept an offline synced clone of the data with indexes designed for the queries we needed to run (this is basically OLAP). Have you already looked into that?

评论 #12410993 未加载

brudgersover 8 years ago

1. 10 million records is not big data.2. If the rate of data acquisition is 1 million records a year, then 9 million records are the same as last year. There's no reason to hit the database again, because time series data [to be time series data] does not change.3. This suggests that processing the data from the Oracle database is not a requirement...i.e. it could be moved to another system and processed there. Again, it's small data and doesn't change so duplication presents neither a storage issue nor a consistency issue.4. The size and static nature of the data suggest that it might fit into memory on a single, moderately speced PC. An AWS type approach is also possible.5. The right data storage format depends on the workload...maybe something column oriented?Good luck.

NumberCruncherover 8 years ago

When working with time series it is a good practice to move the data in memory and vectorise both the data and the data-manipulation logic. Doing so you can take advantage of the battle tested and fast linear algebra libraries like BLAS and LAPACK (or even move the calculation to the GPU). The last time I worked with time series data I used python-pandas. Even without former python knowledge it took only a couple of weeks to get productive. It has also a wrapper around SQLalchemy so you don't have to deal directly with the ORM.

thorinover 8 years ago

I think you're probably wrong that you can't handle it in stored procedures (packages). You may even be able to do a lot of the processing in a single SQL statement? Try this first, then PL/SQL. Do the bits you can't do in PL/SQL in a java class stored on the database...

tmalyover 8 years ago

I am currently on a task of processing 4 million records to produce a report. It takes me about 10 seconds in the code.You could always use a sliding window type algorithm to keep up with the grown of the data.

ak39over 8 years ago

If the requirements allow, consider replicating data in memory with SQLite :memory: option. 10 million records will be nothing for it.Let us know.

maigraitover 8 years ago

Are these java stored procedures?

7 comments

eschutte2over 8 years ago

评论 #12410993 未加载

brudgersover 8 years ago

NumberCruncherover 8 years ago

thorinover 8 years ago

tmalyover 8 years ago

ak39over 8 years ago

If the requirements allow, consider replicating data in memory with SQLite :memory: option. 10 million records will be nothing for it.Let us know.

maigraitover 8 years ago

Are these java stored procedures?

Ask HN: Generating reports on time series data without killing performance?

7 comments

Ask HN: Generating reports on time series data without killing performance?

7 comments