TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Lightweight data analytics using SQLite, Bash and DuckDB – too simple?

4 点作者 hilti超过 1 年前
Over the last 12 months, in parallel to using Google BigQuery, I have built my own processing pipeline using SQLite and DuckDB.<p>What amazes me is that it works surprisingly well and costs much less than using BigQuery.<p>Roughly speaking, here&#x27;s what I do: A SQLite database receives IoT sensor data via a very simple PHP function. I currently use the FlightPHP framework for this. The data is written to a table within the SQLite database (WAL mode activated) and states are updated by the machines using triggers.<p>Example of a trigger<p>CREATE TRIGGER message_added AFTER INSERT ON messages BEGIN INSERT OR REPLACE INTO states VALUES ( new.id, new.status, new.time_stamp, new.current_present, new.voltage_present)<p>This allows me to query the current status of a machine in real time. To do this, I again use a simple PHP function that provides the data via SSE. In the frontend, a simple Javascript method (plain vanilla JS) retrieves the JSON data and updates the HTML in real time.<p><pre><code> const source_realtime = new EventSource(&quot;https:&#x2F;&#x2F;myapi&#x2F;sse_realtime_json&quot;); source_realtime.onmessage = function(event) { var json = JSON.parse(event.data); }; </code></pre> For a historical analysis - for example over 24 months - I create a CSV export from the SQLite database and convert the CSV files into Parquet format.<p>I use a simple BASH script that I execute regularly via CronJob.<p>Here is an excerpt<p># Loop through the arrays and export each table to a CSV, then convert it to a Parquet file and load into the DuckDB database for (( i=0; i&lt;${arrayLength}; i++ )); do db=${databases[$i]} table=${tables[$i]}<p><pre><code> echo &quot;Processing $db - $table&quot; # Export the SQLite table to a CSV file sqlite3 -header -csv $db &quot;SELECT * FROM $table;&quot; &gt; parquet&#x2F;$table.csv # Convert the CSV file to a Parquet file using DuckDB $duckdb_executable $duckdb_database &lt;&lt;EOF -- Set configurations SET memory_limit=&#x27;2GB&#x27;; SET threads TO 2; SET enable_progress_bar=true; COPY (SELECT * FROM read_csv_auto(&#x27;parquet&#x2F;$table.csv&#x27;, header=True)) TO &#x27;parquet&#x2F;$table.parquet&#x27; (FORMAT &#x27;PARQUET&#x27;, CODEC &#x27;ZSTD&#x27;); CREATE TABLE $table AS SELECT * FROM read_parquet(&#x27;parquet&#x2F;$table.parquet&#x27;);</code></pre> EOF<p>Now finally my question: Am I overlooking something? This little system works well for currently 15 million events per month. No outtages, nothing like that. I read so much about fancy data pipelines, reactive frontend dashboards, lambda functions ...<p>Somehow my system feels &quot;too simple&quot;. So I&#x27;m sharing it with you in the hope of getting feedback.

1 comment

perrygeo超过 1 年前
You&#x27;ve discovered a little secret of the industry - most ultra-scalable big data solutions are complete overkill for many scenarios. I&#x27;ve used a technique similar to yours for 20 years and it&#x27;s only fallen out of fashion in the last 5 years. Not because the workloads are too big but because the industry has willingly chosen to make it more complex (for reasons). Personal computers with local disk are big enough and powerful enough to handle analytics on even medium-sized workloads (into the tens of billions of rows &#x2F; 100s of GB scale). There&#x27;s nothing stopping you from doing the simple approach except for dogma and fashion.<p>The problem is team dynamics. Who executes that bash script and when? What happens if something goes wrong? How do you run it when you&#x27;re offline, or traveling, or need to use your laptop for something else? How do you and your team track the progress and view the results? And since this is all running in the cloud, how do you track access, security, and costs? And the bus factor - what if you leave and your knowledge leaves with it? What about all the junior developers that want to use FancyNewThing for their resume, can we incorporate that somehow? You need a complex system to support the dozens of people who want to stick their hands in the pipeline; a perfect reflection of Conway&#x27;s law. These are the organizational problems that &quot;fancy&quot; cloud data pipelines deal with. If you don&#x27;t need such things, you can (and should) reduce the complexity by orders of magnitude.
评论 #38802249 未加载