Over the last 12 months, in parallel to using Google BigQuery, I have built my own processing pipeline using SQLite and DuckDB.<p>What amazes me is that it works surprisingly well and costs much less than using BigQuery.<p>Roughly speaking, here's what I do:
A SQLite database receives IoT sensor data via a very simple PHP function. I currently use the FlightPHP framework for this. The data is written to a table within the SQLite database (WAL mode activated) and states are updated by the machines using triggers.<p>Example of a trigger<p>CREATE TRIGGER message_added
AFTER INSERT ON messages
BEGIN
INSERT OR REPLACE INTO states
VALUES
(
new.id,
new.status,
new.time_stamp,
new.current_present,
new.voltage_present)<p>This allows me to query the current status of a machine in real time. To do this, I again use a simple PHP function that provides the data via SSE. In the frontend, a simple Javascript method (plain vanilla JS) retrieves the JSON data and updates the HTML in real time.<p><pre><code> const source_realtime = new EventSource("https://myapi/sse_realtime_json");
source_realtime.onmessage = function(event) {
var json = JSON.parse(event.data); };
</code></pre>
For a historical analysis - for example over 24 months - I create a CSV export from the SQLite database and convert the CSV files into Parquet format.<p>I use a simple BASH script that I execute regularly via CronJob.<p>Here is an excerpt<p># Loop through the arrays and export each table to a CSV, then convert it to a Parquet file and load into the DuckDB database
for (( i=0; i<${arrayLength}; i++ ));
do
db=${databases[$i]}
table=${tables[$i]}<p><pre><code> echo "Processing $db - $table"
# Export the SQLite table to a CSV file
sqlite3 -header -csv $db "SELECT * FROM $table;" > parquet/$table.csv
# Convert the CSV file to a Parquet file using DuckDB
$duckdb_executable $duckdb_database <<EOF
-- Set configurations
SET memory_limit='2GB';
SET threads TO 2;
SET enable_progress_bar=true;
COPY (SELECT * FROM read_csv_auto('parquet/$table.csv', header=True)) TO 'parquet/$table.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
CREATE TABLE $table AS SELECT * FROM read_parquet('parquet/$table.parquet');</code></pre>
EOF<p>Now finally my question:
Am I overlooking something? This little system works well for currently 15 million events per month. No outtages, nothing like that.
I read so much about fancy data pipelines, reactive frontend dashboards, lambda functions ...<p>Somehow my system feels "too simple".
So I'm sharing it with you in the hope of getting feedback.