๐Ÿ•ท๏ธ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 198 (from laksa196)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

โ„น๏ธ Skipped - page is already crawled

๐Ÿ“„
INDEXABLE
โœ…
CRAWLED
2 days ago
๐Ÿค–
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://blog.hellmar-becker.de/2025/01/12/clickhouse-data-cookbook-aggregating-latest-values/
Last Crawled2026-04-10 03:05:49 (2 days ago)
First Indexed2025-01-13 11:13:43 (1 year ago)
HTTP Status Code200
Meta TitleClickHouse Data Cookbook: Aggregating Latest Values
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
Letโ€™s pick up one of my earlier blogs . In that blog, I took a very simple data set to show how Druid can aggregate data upon ingestion. Today, I will show how much easier this works with ClickHouse. You can use any version of ClickHouse for this exercise. For instance, you can run ClickHouse locally on your laptop. Or check out the installation docs for other options. In todayโ€™s tutorial, you will create a detail table for stock ticker data using schema inference create an aggregate table to store the latest price value per ticker symbol and time interval create a materialized view to automatically populate the aggregate whenever new rows are added to the detail table insert some data and query the aggregate table insert some more data and observe how the query result changes. Creating the sample data set First, create a text file with these data: timestamp,symbol,price 2022-08-10 10:01:00,AAAA,25.90 2022-08-10 10:11:00,AAAA,26.90 2022-08-10 11:55:00,AAAA,28.10 Save this file as ticker1.csv . Likewise, create a file ticker2.csv with these data: timestamp,symbol,price 2022-08-10 10:50:00,AAAA,23.90 2022-08-10 11:20:00,AAAA,22.10 Setting up the schema Create a table using the first file as a template. We use the file just to infer the table schema, we donโ€™t want to insert any data just yet. ClickHouse allows you to use CREATE AS SELECT with the EMPTY clause to do just that. This will be a MergeTree table that stores the detail rows. CREATE TABLE ticker_data ENGINE = MergeTree ORDER BY ( symbol , timestamp ) EMPTY AS SELECT * FROM file ( 'ticker1.csv' ) SETTINGS schema_inference_make_columns_nullable = 0 Worth mentioning is the last line which makes the inferred columns non nullable. When inferring from a CSV file, by default ClickHouse makes all columns nullable. But columns that participate in the sort key cannot be nullable (this can be overridden but is probably not a good idea.) Also, since there are no NULL values in the data, the best way to manage this is to force all columns to be non- NULL . Now define the aggregate table. This will be an AggregatingMergeTree table, where some of the aggregation magic happens inside the table engine: CREATE TABLE ticker_data_agg ( `timestampHour` DateTime , `symbol` String , `price` AggregateFunction ( argMax , Float64 , Timestamp ) ) ENGINE = AggregatingMergeTree ORDER BY ( symbol , timestampHour ) We will aggregate the data by hour - the timestamp will be truncated as we will see in a moment. Data aggregation is defined in three steps: In the aggregate table definition the dimensions go into the order key the measures rather than plain values, are defined as AggregateFunction types, which store intermediate state information. The AggregateFunction constructor is fed information about the aggregate function to be used, and the data types it operates on. In our case, this is argMax , which returns the value of its first argument where the second argument has a maximum. The second argument will be the timestamp. When populating the aggregate table the dimensions go into the group by clause measures use -State aggregators, in our case it will be argMaxState . Query the aggregated measures with -Merge aggregator functions, here: argMaxMerge . Now letโ€™s define a materialized view that will aggregate the ticker data: CREATE MATERIALIZED VIEW ticker_data_mv TO ticker_data_agg AS SELECT toStartOfHour ( timestamp ) AS timestampHour , symbol , argMaxState ( price , timestamp ) AS price FROM ticker_data GROUP BY timestampHour , symbol This will truncate the timestamp to the desired granularity (hour) and perform the aggregation. Populating the tables Insert the first set of data: INSERT INTO ticker_data SELECT * FROM file ( 'ticker1.csv' ); and query the aggregate at the configured time granularity: :) SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour, symbol; SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour ASC, symbol ASC Query id: b832dc41-a3db-4463-87a4-fe829bc120d0 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ 26.9 โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.012 sec. and at a coarser granularity: :) SELECT toStartOfDay(timestampHour) AS timestampDay, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampDay, symbol ORDER BY timestampDay, symbol; SELECT toStartOfDay(timestampHour) AS timestampDay, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampDay, symbol ORDER BY timestampDay ASC, symbol ASC Query id: bb33cc74-73fe-4d35-8009-136ca73b6579 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampDayโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 00:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 1 row in set. Elapsed: 0.008 sec. Adding more data Insert the second batch :) INSERT INTO ticker_data SELECT * FROM file ( 'ticker2.csv' ); If you query the data now with group by and merge, you get the updated rows :) SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour, symbol; SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour ASC, symbol ASC Query id: 2638d26a-ecdb-484f-9de7-3e8ca26e3724 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ 23.9 โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.020 sec. Note how the 11am entry has not changed because the added data had an earlier timestamp than the data already present in the table. Thus, late arriving data is automatically handled correctly. Also, now there are actually four rows in the aggregate table (at least until a background merge happens). A SELECT * statement shows this. It also shows that the aggregated measures are actually stored in a binary format that is output as garbled characters. :) SELECT * FROM ticker_data_agg; SELECT * FROM ticker_data_agg Query id: 27981039-a429-4399-a98c-fd2620b4b1b8 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?7@8q?b โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????6@@x?b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ” 3. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?:@h?b โ”‚ 4. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????<@t??b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 4 rows in set. Elapsed: 0.010 sec. You can (but probably shouldnโ€™t) force a background merge like so: :) OPTIMIZE TABLE ticker_data_agg; OPTIMIZE TABLE ticker_data_agg Query id: 294b0d27-d826-4f72-899f-06b9ff80f159 Ok. 0 rows in set. Elapsed: 0.015 sec. :) SELECT * FROM ticker_data_agg; SELECT * FROM ticker_data_agg Query id: 2f5c65ed-b3ef-4fa1-88cd-76ebdd194072 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?7@8q?b โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????<@t??b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.006 sec. Now, there are only two rows left. Conclusion ClickHouseโ€™s AggregatingMergeTree table engine allows aggregating data on the fly. Data should be merged upon query, and is compacted asynchronously in the background. Each standard aggregation function has modifiers that allow it to store intermediate aggregation state inside the aggregate table. Using materialized views, you can populate a detail and aggregate table simultaneously. โ€œ This image is taken from Page 377 of Praktisches Kochbuch fรผr die gewรถhnliche und feinere Kรผche โ€ by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .
Markdown
![](https://matomo.hellmar-becker.de/?idsite=1&rec=1) [Hellmar Becker's Blog](https://blog.hellmar-becker.de/) [About](https://blog.hellmar-becker.de/about/) [Archives](https://blog.hellmar-becker.de/archives/) [Blogroll](https://blog.hellmar-becker.de/blogroll/) [Publications](https://blog.hellmar-becker.de/publications/) [GitHub](https://github.com/hellmarbecker) # ClickHouse Data Cookbook: Aggregating Latest Values Jan 12, 2025 โ€ข [blog](https://blog.hellmar-becker.de/categories/blog/), [clickhouse](https://blog.hellmar-becker.de/categories/clickhouse/), [sql](https://blog.hellmar-becker.de/categories/sql/), [tutorial](https://blog.hellmar-becker.de/categories/tutorial/) ![Data Cookbook Kitchen](https://blog.hellmar-becker.de/assets/2024-01-01-elf.jpg) Letโ€™s pick up one of my [earlier blogs](https://blog.hellmar-becker.de/2022/08/10/timeseries-data-in-apache-druid-getting-the-latest-value-with-rolled-up-data/). In that blog, I took a very simple data set to show how Druid can aggregate data upon ingestion. Today, I will show how much easier this works with ClickHouse. You can use any version of ClickHouse for this exercise. For instance, you can run ClickHouse [locally](https://clickhouse.com/docs/en/operations/utilities/clickhouse-local) on your laptop. Or check out the [installation docs](https://clickhouse.com/docs/en/install) for other options. In todayโ€™s tutorial, you will - create a detail table for stock ticker data using schema inference - create an aggregate table to store the latest price value per ticker symbol and time interval - create a materialized view to automatically populate the aggregate whenever new rows are added to the detail table - insert some data and query the aggregate table - insert some more data and observe how the query result changes. ## Creating the sample data set First, create a text file with these data: ``` timestamp,symbol,price 2022-08-10 10:01:00,AAAA,25.90 2022-08-10 10:11:00,AAAA,26.90 2022-08-10 11:55:00,AAAA,28.10 ``` Save this file as `ticker1.csv`. Likewise, create a file `ticker2.csv` with these data: ``` timestamp,symbol,price 2022-08-10 10:50:00,AAAA,23.90 2022-08-10 11:20:00,AAAA,22.10 ``` ## Setting up the schema Create a table using the first file as a template. We use the file just to infer the table schema, we donโ€™t want to insert any data just yet. ClickHouse allows you to use `CREATE AS SELECT` with the `EMPTY` clause to do just that. This will be a [MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#mergetree) table that stores the detail rows. ``` CREATE TABLE ticker_data ENGINE = MergeTree ORDER BY (symbol, timestamp) EMPTY AS SELECT * FROM file('ticker1.csv') SETTINGS schema_inference_make_columns_nullable = 0 ``` Worth mentioning is the last line which makes the inferred columns non nullable. When inferring from a CSV file, by default ClickHouse makes all columns nullable. But columns that participate in the sort key cannot be nullable (this can be overridden but is probably not a good idea.) Also, since there are no *NULL* values in the data, the best way to manage this is to force all columns to be non-*NULL*. Now define the aggregate table. This will be an [AggregatingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree#aggregatingmergetree) table, where some of the aggregation magic happens inside the table engine: ``` CREATE TABLE ticker_data_agg ( `timestampHour` DateTime, `symbol` String, `price` AggregateFunction(argMax, Float64, Timestamp) ) ENGINE = AggregatingMergeTree ORDER BY (symbol, timestampHour) ``` We will aggregate the data by hour - the timestamp will be truncated as we will see in a moment. Data aggregation is defined in three steps: 1. In the aggregate table **definition** - the *dimensions* go into the order key - the *measures* rather than plain values, are defined as `AggregateFunction` types, which store intermediate state information. The `AggregateFunction` constructor is fed information about the aggregate function to be used, and the data types it operates on. In our case, this is `argMax`, which returns the value of its first argument where the second argument has a maximum. The second argument will be the timestamp. 2. When **populating** the aggregate table - the *dimensions* go into the group by clause - *measures* use [\-State](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-state) aggregators, in our case it will be `argMaxState`. 3. **Query** the aggregated measures with [\-Merge](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-merge) aggregator functions, here: `argMaxMerge`. Now letโ€™s define a materialized view that will aggregate the ticker data: ``` CREATE MATERIALIZED VIEW ticker_data_mv TO ticker_data_agg AS SELECT toStartOfHour(timestamp) AS timestampHour, symbol, argMaxState(price, timestamp) AS price FROM ticker_data GROUP BY timestampHour, symbol ``` This will truncate the timestamp to the desired granularity (hour) and perform the aggregation. ## Populating the tables Insert the first set of data: ``` INSERT INTO ticker_data SELECT * FROM file('ticker1.csv'); ``` and query the aggregate at the configured time granularity: ``` :) SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour, symbol; SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour ASC, symbol ASC Query id: b832dc41-a3db-4463-87a4-fe829bc120d0 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ 26.9 โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.012 sec. ``` and at a coarser granularity: ``` :) SELECT toStartOfDay(timestampHour) AS timestampDay, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampDay, symbol ORDER BY timestampDay, symbol; SELECT toStartOfDay(timestampHour) AS timestampDay, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampDay, symbol ORDER BY timestampDay ASC, symbol ASC Query id: bb33cc74-73fe-4d35-8009-136ca73b6579 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampDayโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 00:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 1 row in set. Elapsed: 0.008 sec. ``` ## Adding more data Insert the second batch ``` :) INSERT INTO ticker_data SELECT * FROM file('ticker2.csv'); ``` If you query the data now with group by and merge, you get the updated rows ``` :) SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour, symbol; SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour ASC, symbol ASC Query id: 2638d26a-ecdb-484f-9de7-3e8ca26e3724 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ 23.9 โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.020 sec. ``` Note how the 11am entry has not changed because the added data had an earlier timestamp than the data already present in the table. Thus, late arriving data is automatically handled correctly. Also, now there are actually four rows in the aggregate table (at least until a background merge happens). A `SELECT *` statement shows this. It also shows that the aggregated measures are actually stored in a binary format that is output as garbled characters. ``` :) SELECT * FROM ticker_data_agg; SELECT * FROM ticker_data_agg Query id: 27981039-a429-4399-a98c-fd2620b4b1b8 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?7@8q?b โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????6@@x?b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ” 3. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?:@h?b โ”‚ 4. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????<@t??b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 4 rows in set. Elapsed: 0.010 sec. ``` You can (but probably shouldnโ€™t) force a background merge like so: ``` :) OPTIMIZE TABLE ticker_data_agg; OPTIMIZE TABLE ticker_data_agg Query id: 294b0d27-d826-4f72-899f-06b9ff80f159 Ok. 0 rows in set. Elapsed: 0.015 sec. :) SELECT * FROM ticker_data_agg; SELECT * FROM ticker_data_agg Query id: 2f5c65ed-b3ef-4fa1-88cd-76ebdd194072 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?7@8q?b โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????<@t??b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.006 sec. ``` Now, there are only two rows left. ## Conclusion - ClickHouseโ€™s *AggregatingMergeTree* table engine allows aggregating data on the fly. Data should be merged upon query, and is compacted asynchronously in the background. - Each standard aggregation function has modifiers that allow it to store intermediate aggregation state inside the aggregate table. - Using materialized views, you can populate a detail and aggregate table simultaneously. *** โ€œ[This image is taken from Page 377 of Praktisches Kochbuch fรผr die gewรถhnliche und feinere Kรผche](https://www.flickr.com/photos/mhlimages/48051263811/)โ€ by [Medical Heritage Library, Inc.](https://www.flickr.com/photos/mhlimages/) is licensed under [CC BY-NC-SA 2.0 ![](https://mirrors.creativecommons.org/presskit/icons/cc.svg)![](https://mirrors.creativecommons.org/presskit/icons/by.svg)![](https://mirrors.creativecommons.org/presskit/icons/nc.svg)![](https://mirrors.creativecommons.org/presskit/icons/sa.svg)](https://creativecommons.org/licenses/by-nc-sa/2.0/). ยฉ - Powered by [Jekyll](https://jekyllrb.com/) & [whiteglass](https://github.com/yous/whiteglass) - Subscribe via [RSS](https://blog.hellmar-becker.de/feed.xml)
Readable Markdown
![Data Cookbook Kitchen](https://blog.hellmar-becker.de/assets/2024-01-01-elf.jpg) Letโ€™s pick up one of my [earlier blogs](https://blog.hellmar-becker.de/2022/08/10/timeseries-data-in-apache-druid-getting-the-latest-value-with-rolled-up-data/). In that blog, I took a very simple data set to show how Druid can aggregate data upon ingestion. Today, I will show how much easier this works with ClickHouse. You can use any version of ClickHouse for this exercise. For instance, you can run ClickHouse [locally](https://clickhouse.com/docs/en/operations/utilities/clickhouse-local) on your laptop. Or check out the [installation docs](https://clickhouse.com/docs/en/install) for other options. In todayโ€™s tutorial, you will - create a detail table for stock ticker data using schema inference - create an aggregate table to store the latest price value per ticker symbol and time interval - create a materialized view to automatically populate the aggregate whenever new rows are added to the detail table - insert some data and query the aggregate table - insert some more data and observe how the query result changes. ## Creating the sample data set First, create a text file with these data: ``` timestamp,symbol,price 2022-08-10 10:01:00,AAAA,25.90 2022-08-10 10:11:00,AAAA,26.90 2022-08-10 11:55:00,AAAA,28.10 ``` Save this file as `ticker1.csv`. Likewise, create a file `ticker2.csv` with these data: ``` timestamp,symbol,price 2022-08-10 10:50:00,AAAA,23.90 2022-08-10 11:20:00,AAAA,22.10 ``` ## Setting up the schema Create a table using the first file as a template. We use the file just to infer the table schema, we donโ€™t want to insert any data just yet. ClickHouse allows you to use `CREATE AS SELECT` with the `EMPTY` clause to do just that. This will be a [MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#mergetree) table that stores the detail rows. ``` CREATE TABLE ticker_data ENGINE = MergeTree ORDER BY (symbol, timestamp) EMPTY AS SELECT * FROM file('ticker1.csv') SETTINGS schema_inference_make_columns_nullable = 0 ``` Worth mentioning is the last line which makes the inferred columns non nullable. When inferring from a CSV file, by default ClickHouse makes all columns nullable. But columns that participate in the sort key cannot be nullable (this can be overridden but is probably not a good idea.) Also, since there are no *NULL* values in the data, the best way to manage this is to force all columns to be non-*NULL*. Now define the aggregate table. This will be an [AggregatingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree#aggregatingmergetree) table, where some of the aggregation magic happens inside the table engine: ``` CREATE TABLE ticker_data_agg ( `timestampHour` DateTime, `symbol` String, `price` AggregateFunction(argMax, Float64, Timestamp) ) ENGINE = AggregatingMergeTree ORDER BY (symbol, timestampHour) ``` We will aggregate the data by hour - the timestamp will be truncated as we will see in a moment. Data aggregation is defined in three steps: 1. In the aggregate table **definition** - the *dimensions* go into the order key - the *measures* rather than plain values, are defined as `AggregateFunction` types, which store intermediate state information. The `AggregateFunction` constructor is fed information about the aggregate function to be used, and the data types it operates on. In our case, this is `argMax`, which returns the value of its first argument where the second argument has a maximum. The second argument will be the timestamp. 2. When **populating** the aggregate table - the *dimensions* go into the group by clause - *measures* use [\-State](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-state) aggregators, in our case it will be `argMaxState`. 3. **Query** the aggregated measures with [\-Merge](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-merge) aggregator functions, here: `argMaxMerge`. Now letโ€™s define a materialized view that will aggregate the ticker data: ``` CREATE MATERIALIZED VIEW ticker_data_mv TO ticker_data_agg AS SELECT toStartOfHour(timestamp) AS timestampHour, symbol, argMaxState(price, timestamp) AS price FROM ticker_data GROUP BY timestampHour, symbol ``` This will truncate the timestamp to the desired granularity (hour) and perform the aggregation. ## Populating the tables Insert the first set of data: ``` INSERT INTO ticker_data SELECT * FROM file('ticker1.csv'); ``` and query the aggregate at the configured time granularity: ``` :) SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour, symbol; SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour ASC, symbol ASC Query id: b832dc41-a3db-4463-87a4-fe829bc120d0 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ 26.9 โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.012 sec. ``` and at a coarser granularity: ``` :) SELECT toStartOfDay(timestampHour) AS timestampDay, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampDay, symbol ORDER BY timestampDay, symbol; SELECT toStartOfDay(timestampHour) AS timestampDay, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampDay, symbol ORDER BY timestampDay ASC, symbol ASC Query id: bb33cc74-73fe-4d35-8009-136ca73b6579 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampDayโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 00:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 1 row in set. Elapsed: 0.008 sec. ``` ## Adding more data Insert the second batch ``` :) INSERT INTO ticker_data SELECT * FROM file('ticker2.csv'); ``` If you query the data now with group by and merge, you get the updated rows ``` :) SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour, symbol; SELECT timestampHour, symbol, argMaxMerge(price) AS latestPrice FROM ticker_data_agg GROUP BY timestampHour, symbol ORDER BY timestampHour ASC, symbol ASC Query id: 2638d26a-ecdb-484f-9de7-3e8ca26e3724 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€latestPriceโ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ 23.9 โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ 28.1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.020 sec. ``` Note how the 11am entry has not changed because the added data had an earlier timestamp than the data already present in the table. Thus, late arriving data is automatically handled correctly. Also, now there are actually four rows in the aggregate table (at least until a background merge happens). A `SELECT *` statement shows this. It also shows that the aggregated measures are actually stored in a binary format that is output as garbled characters. ``` :) SELECT * FROM ticker_data_agg; SELECT * FROM ticker_data_agg Query id: 27981039-a429-4399-a98c-fd2620b4b1b8 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?7@8q?b โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????6@@x?b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ” 3. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?:@h?b โ”‚ 4. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????<@t??b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 4 rows in set. Elapsed: 0.010 sec. ``` You can (but probably shouldnโ€™t) force a background merge like so: ``` :) OPTIMIZE TABLE ticker_data_agg; OPTIMIZE TABLE ticker_data_agg Query id: 294b0d27-d826-4f72-899f-06b9ff80f159 Ok. 0 rows in set. Elapsed: 0.015 sec. :) SELECT * FROM ticker_data_agg; SELECT * FROM ticker_data_agg Query id: 2f5c65ed-b3ef-4fa1-88cd-76ebdd194072 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€timestampHourโ”€โ”ฌโ”€symbolโ”€โ”ฌโ”€priceโ”€โ”€โ”€โ”€โ”€โ”€โ” 1. โ”‚ 2022-08-10 10:00:00 โ”‚ AAAA โ”‚ fffff?7@8q?b โ”‚ 2. โ”‚ 2022-08-10 11:00:00 โ”‚ AAAA โ”‚ ?????<@t??b โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 2 rows in set. Elapsed: 0.006 sec. ``` Now, there are only two rows left. ## Conclusion - ClickHouseโ€™s *AggregatingMergeTree* table engine allows aggregating data on the fly. Data should be merged upon query, and is compacted asynchronously in the background. - Each standard aggregation function has modifiers that allow it to store intermediate aggregation state inside the aggregate table. - Using materialized views, you can populate a detail and aggregate table simultaneously. *** โ€œ[This image is taken from Page 377 of Praktisches Kochbuch fรผr die gewรถhnliche und feinere Kรผche](https://www.flickr.com/photos/mhlimages/48051263811/)โ€ by [Medical Heritage Library, Inc.](https://www.flickr.com/photos/mhlimages/) is licensed under [CC BY-NC-SA 2.0 ![](https://mirrors.creativecommons.org/presskit/icons/cc.svg)![](https://mirrors.creativecommons.org/presskit/icons/by.svg)![](https://mirrors.creativecommons.org/presskit/icons/nc.svg)![](https://mirrors.creativecommons.org/presskit/icons/sa.svg)](https://creativecommons.org/licenses/by-nc-sa/2.0/).
Shard198 (laksa)
Root Hash1995484448304381198
Unparsed URLde,hellmar-becker!blog,/2025/01/12/clickhouse-data-cookbook-aggregating-latest-values/ s443