🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 169 (from laksa090)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
✅
CRAWLED
1 day ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://docs.peerdb.io/bestpractices/clickhouse_datamodeling
Last Crawled2026-04-09 18:55:52 (1 day ago)
First Indexed2024-09-10 06:19:33 (1 year ago)
HTTP Status Code200
Meta TitlePostgres to ClickHouse: Data Modeling Tips - PeerDB Docs: Setup your ETL in minutes with SQL.
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
PeerDB makes it fast and simple to replicate data from Postgres to ClickHouse . A common question from PeerDB users is how to model their data in ClickHouse after the replication process to maximize the benefits of ClickHouse. This question arises because ClickHouse and Postgres differ in data modeling, as each is a purpose-built database highly optimized for its specific workload -Postgres is a transactional (OLTP) database, while ClickHouse is an analytical (OLAP) columnar database. This guide walks you through essential data modeling concepts in ClickHouse for users coming from the Postgres world. ReplacingMergeTree table engine PeerDB maps PostgreSQL tables to ClickHouse using the ReplacingMergeTree engine. ClickHouse performs best with append-only workloads and does not recommend frequent UPDATEs. This is where ReplacingMergeTree is particularly powerful. ReplacingMergeTree supports workloads that involve both data ingestion and modifications. Each table is append-only, with user updates ingested as versioned INSERTs. The ReplacingMergeTree engine manages deduplication (merging) of rows in the background. This is one of the key factors that enables ClickHouse to deliver exceptional real-time ingestion performance. In PeerDB, both INSERTs and UPDATEs from Postgres are captured as new rows with different versions (using _peerdb_version ) in ClickHouse. The ReplacingMergeTree table engine periodically handles deduplication in the background using the Ordering Key (ORDER BY columns), retaining only the row with the latest _peerdb_version . DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the _peerdb_is_deleted column). The snippet below shows the target table definition for the public_goals table in ClickHouse. clickhouse - cloud :) SHOW CREATE TABLE public_goals; CREATE TABLE peerdb .public_goals ( `id` Int64, `owned_user_id` String, `goal_title` String, `goal_data` String, `enabled` Bool, `ts` DateTime64( 6 ), `_peerdb_synced_at` DateTime64( 9 ) DEFAULT now (), `_peerdb_is_deleted` Int8, `_peerdb_version` Int64 ) ENGINE = SharedReplacingMergeTree ( '/clickhouse/tables/{uuid}/{shard}' , '{replica}' , _peerdb_version) PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 You might still see duplicates for rows—how should you handle them? ReplacingMergeTree clears out duplicates asynchronously in the background but doesn’t guarantee the absence of duplicates. So, when you query the data, you might still see duplicates for the same row or primary key but with different versions. This is expected. To remove duplicates, you have a couple of approaches: Use FINAL in your queries ClickHouse has a unique modifier called FINAL , which performs de-duplication (merging of rows) at query time. This de-duplication occurs after filtering (WHERE clause) but before aggregations (GROUP BY). A historical concern has been that FINAL can slow down query performance. While it does impact query performance to some extent, recent releases of ClickHouse have introduced significant improvements to enhance FINAL query performance. So, don’t hesitate to use the FINAL clause and evaluate how your queries perform. Below is an example of how to use the FINAL clause: SELECT owner_user_id, COUNT ( * ) FROM goals FINAL WHERE enabled = true GROUP BY owner_user_id; Use argMax to deduplicate rows at query time In ClickHouse, argMax is a powerful function for deduplicating rows dynamically during query execution. This is particularly useful when you need to retain the most recent or relevant record based on a versioning or timestamp column. For instance, if you’re working with a table like peerdb.public_goals , where id is the primary key and _peerdb_version tracks versions, you can use argMax to select the row with the highest _peerdb_version for each id . This approach allows you to efficiently remove duplicates without altering the underlying data. You can then run your aggregations as a subquery over this deduplicated result set for further analysis. Below query is an example of using argMax SELECT owned_user_id, COUNT ( * ) AS active_goals_count, MAX (ts) AS latest_goal_time FROM ( SELECT id, argMax(owned_user_id, _peerdb_version) AS owned_user_id, argMax(goal_title, _peerdb_version) AS goal_title, argMax(goal_data, _peerdb_version) AS goal_data, argMax( enabled , _peerdb_version) AS enabled , argMax(ts, _peerdb_version) AS ts, argMax(_peerdb_synced_at, _peerdb_version) AS _peerdb_synced_at, argMax(_peerdb_is_deleted, _peerdb_version) AS _peerdb_is_deleted, max (_peerdb_version) AS _peerdb_version FROM peerdb . public_goals WHERE enabled = true GROUP BY id ) AS deduplicated_goals GROUP BY owned_user_id; Use WINDOW FUNCTIONS You can use ClickHouse’s window functions to achieve similar deduplication by selecting the row with the highest _peerdb_version within each id partition. Here’s an example: SELECT owned_user_id, COUNT ( * ) AS active_goals_count, MAX (ts) AS latest_goal_time FROM ( SELECT * , ROW_NUMBER () OVER ( PARTITION BY id ORDER BY _peerdb_version DESC ) AS rn FROM peerdb . public_goals WHERE enabled = true ) AS ranked_goals WHERE rn = 1 GROUP BY owned_user_id; Use Views to simplify deduplication Encapsulate deduplication in a view to make it simple for BI tools to query the most up-to-date data. For example, use a window function in the view to keep only the latest version of each row: CREATE VIEW goals AS SELECT * FROM ( SELECT * , ROW_NUMBER () OVER ( PARTITION BY id ORDER BY _peerdb_version DESC ) AS rn FROM peerdb . public_goals WHERE enabled = true ) WHERE rn = 1 ; SELECT owned_user_id, COUNT ( * ) AS active_goals_count, MAX (ts) AS latest_goal_time FROM goals GROUP BY owned_user_id; Nullable Columns If you’re coming from the Postgres world, one surprising aspect of ClickHouse is that it doesn’t store NULL values for columns unless you explicitly wrap the column types in Nullable . For example, instead of storing NULL for dates, ClickHouse stores 1970-01-01 as the default value, which might be unexpected. This behavior is due to the fact that storing NULLs can impact query performance in ClickHouse, as it’s a columnar database. Hence, ClickHouse requires users to explicitly define Nullable types. In PeerDB, we’ve introduced a setting called PEERDB_NULLABLE , which, when set to true , automatically detects nullable columns in Postgres and marks them as Nullable in ClickHouse during the replication process. This means you don’t need to manually define Nullable types during replication. You can read more about this feature in the following PR . Data Types ClickHouse offers a wide variety of data types, ranging from numbers, text, timestamps, dates, and arrays to the recently introduced JSON type. Many of the data types in Postgres can be natively stored in ClickHouse without much modification. As a reference, here goes the data type matrix we use at PeerDB when replicating data from Postgres to ClickHouse. The Ordering Key What is an Ordering Key? Choosing the right ordering key is crucial for query performance in ClickHouse. Defined by the ORDER BY clause when creating a table, the ordering key functions similarly to a index in Postgres but is optimized for analytics. Unlike Postgres, which uses a B-tree index with entries pointing to each row, ClickHouse uses Sparse Indexing: Data is sorted based on Ordering Key: The ordering key ensures that data on disk is sorted according to the specified columns. This allows for better compression , as correlated values are stored together. Ordering Key also creates a sparse index: The ordering key also creates a sparse index, storing only ranges of columns, with each entry pointing to a group of sorted rows. This keeps the index small, allowing ClickHouse to quickly identify relevant groups of rows using a binary search and execute queries efficiently. You can read more about this here . You can think of ordering keys as similar to BRIN indexes in Postgres, but in ClickHouse, the data is automatically sorted based on the ordering key via asynchronous merging of parts, so you don’t need to handle sorting during data ingestion. Choosing an appropriate Ordering Key When selecting an ordering key, base your choice on the columns most frequently used in your query filters. Prioritize columns that are commonly used in WHERE clauses, and order them in ascending sequence of cardinality —starting with columns that have the fewest distinct values. This approach optimizes data compression and query performance. For a deeper understanding of this topic, refer to the detailed guide here . PRIMARY KEY vs Ordering Key If you observe the table definition of public_goals , it has a PRIMARY KEY . You might be wondering how the PRIMARY KEY differs from the Ordering Key. Let us understand how they differ: PRIMARY KEY , if specified, defines the columns in the sparse index, while the columns in the ORDER BY clause determine how the data is sorted on disk. They are also used for deduplicating data by the ReplacingMergeTree . If the PRIMARY KEY isn’t specified, the Ordering Key automatically becomes the PRIMARY KEY and defines the columns in the sparse index. NOTE: Columns in the PRIMARY KEY should always be prefixed in the Ordering Key. This ensures that the index aligns with the physical data order, maximizing query performance by minimizing unnecessary data scans. An example where PRIMARY KEY could differ from Ordering Key An example where you might have different PRIMARY KEY and ORDER BY columns is when your queries are primarily filtered on customer_id rather than id . In this case, you can define the PRIMARY KEY on just customer_id and the ORDER BY on customer_id, id . This approach ensures a smaller, more efficient sparse index for querying, while data deduplication occurs on id , ensuring no data is lost. NOTE: Unlike in Postgres, where the PRIMARY KEY is a B-tree index that guarantees uniqueness, in ClickHouse, it does not ensure uniqueness. Instead, it defines the columns that should be part of the sparse index. Modifying the Ordering Key Choosing the right ordering key is crucial for query performance in ClickHouse, as it acts as an index when querying data. By default, PeerDB uses the PostgreSQL PRIMARY KEY to define the ordering key in ClickHouse tables, but you can change it using the following methods: Use materialized views You can use materialized views to create a new table with a different ordering key suitable for your workload. Include the primary key columns at the end of the ordering key to ensure proper deduplication, as ReplacingMergeTree uses the ORDER BY clause for deduplication, and including the primary key ensures that no data is lost. CREATE MATERIALIZED VIEW goals_mv ENGINE = ReplacingMergeTree(_peerdb_version) ORDER BY ( enabled , ts, id)  POPULATE AS SELECT * FROM peerdb . public_goals ; NOTE: After creating the materialized view, be sure to follow the steps described in the previous section on handling duplicates to ensure proper deduplication during query time. Predefine target tables with the desired Ordering Key To change the ordering key, you can predefine new tables with your desired Ordering Key and then swap them with the existing tables. Here’s how you can do it: 1. Create a Dummy Mirror: Create a dummy mirror in PeerDB to generate the default tables with the correct metadata columns and data types. 2. Create a New Table with the Desired Ordering Key: Use the table created by PeerDB to define a new table with your desired ordering key. Include the primary key columns at the end of the ordering key to ensure proper deduplication. Here is an example: CREATE TABLE public_events_new AS public_events ENGINE = ReplacingMergeTree(_peerdb_version) ORDER BY (user_id,id); 3. Drop the Old Table: DROP TABLE public_events; 4. Rename the New Table: Rename the new table to the actual table RENAME TABLE public_events_new TO public_events; 5. Start MIRROR to Point to the New Table: Configure the mirror to point to the actual table. PeerDB uses CREATE TABLE IF NOT EXISTS behind the scenes and continues to ingest data into the new table. Handling DELETEs As mentioned, DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the _peerdb_is_deleted column). To exclude deleted rows from your queries, you can create row-level policies in ClickHouse based on the _peerdb_is_deleted column. Here’s an example: CREATE ROW POLICY policy_name ON table_name FOR SELECT USING _peerdb_is_deleted = 0 ; This policy ensures that only rows where _peerdb_is_deleted is 0 are visible when querying the table. How to handle updates for primary keys? To propagate primary key updates into ClickHouse, enable PEERDB_CLICKHOUSE_ENABLE_PRIMARY_UPDATE setting on your mirror. This instructs the PeerDB normalize step to generate a “delete + insert” pair for changed primary keys, using a sign column (in the ReplacingMergeTree ) and a version column to remove the old record before inserting the new one, ensuring that your ClickHouse tables accurately reflect the latest state from the source system. It is highly uncommon for primary keys to be updated in Postgres. Conclusion I hope you enjoyed reading the guide. We aimed to cover the most common data-modeling challenges you might encounter when migrating from PostgreSQL to ClickHouse. In the next blog, I plan to dive into more advanced topics, such as joins, writing efficient SQL queries, and so on. If you want to give PeerDB and ClickHouse a try to start replicating data from Postgres to ClickHouse, please check out the links below or reach out to us directly! Try PeerDB Cloud for Free Try ClickHouse Cloud for Free Talk to the PeerDB team directly
Markdown
[Skip to main content](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#content-area) [PeerDB Docs: Setup your ETL in minutes with SQL. home page![light logo](https://mintcdn.com/peerdb/wtgPI5cQwjI2FO0m/logo/docs-logo-light.png?fit=max&auto=format&n=wtgPI5cQwjI2FO0m&q=85&s=90ac5de385106617a465b7ba4d33a28e)![dark logo](https://mintcdn.com/peerdb/wtgPI5cQwjI2FO0m/logo/docs-logo-dark.png?fit=max&auto=format&n=wtgPI5cQwjI2FO0m&q=85&s=feffd926023e0bcf9c5d4282d426fe12)](https://docs.peerdb.io/) Search... ⌘K Ask AI - [Get Started](https://app.peerdb.cloud/) - [Get Started](https://app.peerdb.cloud/) Search... Navigation Best practices Postgres to ClickHouse: Data Modeling Tips - [Documentation](https://docs.peerdb.io/introduction) - [SQL Reference](https://docs.peerdb.io/sql/reference) - [API Reference](https://docs.peerdb.io/peerdb-api/reference) - [Lua Reference](https://docs.peerdb.io/lua/reference) - [GitHub](https://github.com/PeerDB-io/peerdb) - [Changelog](https://github.com/PeerDB-io/peerdb/releases) - [Community](https://slack.peerdb.io/) ##### Home - [PeerDB](https://docs.peerdb.io/introduction) - [Architecture Overview](https://docs.peerdb.io/architecture) - [Why choose PeerDB over other tools?](https://docs.peerdb.io/why-peerdb) ##### Quickstart - [Quickstart Guide](https://docs.peerdb.io/quickstart/quickstart) - [PeerDB Streams for Kafka Quickstart Guide](https://docs.peerdb.io/quickstart/streams-quickstart) - [SQL Quickstart Guide](https://docs.peerdb.io/quickstart/sql-quickstart) ##### PeerDB Cloud - [PeerDB Cloud Quickstart Guide](https://docs.peerdb.io/peerdb-cloud/cloud-quickstart) - [PeerDB Cloud Security](https://docs.peerdb.io/peerdb-cloud/cloud-security) - [PeerDB Cloud Pricing FAQ](https://docs.peerdb.io/peerdb-cloud/peerdb-cloud-pricing-faq) - [Trust Center](https://trust.peerdb.io/) - [Public IPs For PeerDB Cloud](https://docs.peerdb.io/peerdb-cloud/ip-table) - [AWS Private Link](https://docs.peerdb.io/peerdb-cloud/aws-private-link) ##### Setup - PostgreSQL Setup Guide - Clickhouse Setup Guide - [Snowflake Setup Guide](https://docs.peerdb.io/connect/snowflake) - [BigQuery Setup Guide](https://docs.peerdb.io/connect/bigquery) - [GCS Setup Guide](https://docs.peerdb.io/connect/gcs) - [S3 Setup Guide](https://docs.peerdb.io/connect/s3) - [Elasticsearch Setup Guide](https://docs.peerdb.io/connect/elasticsearch) ##### Queues Setup - [Kafka Setup Guide](https://docs.peerdb.io/connect/kafka) - [Confluent Cloud Setup Guide](https://docs.peerdb.io/connect/confluent-cloud) - [PubSub Setup Guide](https://docs.peerdb.io/connect/pubsub) ##### Mirror - [CDC Setup from Postgres to ClickHouse](https://docs.peerdb.io/mirror/cdc-pg-clickhouse) - [CDC Setup from Supabase to ClickHouse](https://docs.peerdb.io/mirror/cdc-supabase-clickhouse) - [CDC Setup from Neon Postgres to ClickHouse](https://docs.peerdb.io/mirror/cdc-neon-clickhouse) - [CDC Setup from Postgres to Bigquery](https://docs.peerdb.io/mirror/cdc-pg-bq) - [CDC Setup from Postgres to Snowflake](https://docs.peerdb.io/mirror/cdc-pg-sf) - [CDC Setup from Postgres to Postgres](https://docs.peerdb.io/mirror/cdc-pg-pg) - [CDC Setup from Postgres to Kafka](https://docs.peerdb.io/mirror/cdc-pg-kafka) - [CDC Setup from Postgres to Elasticsearch](https://docs.peerdb.io/mirror/cdc-pg-elasticsearch) - [Upgrade](https://docs.peerdb.io/mirror/upgrade) ##### 5-minute Tutorials - [Streaming Query Replication of PostgreSQL To Snowflake](https://docs.peerdb.io/tutorials/realtime-streaming-of-query-results) ##### Features - [Pausing a CDC Mirror](https://docs.peerdb.io/features/pause-mirror) - [Editing a CDC Mirror](https://docs.peerdb.io/features/edit-mirror) - [Resyncing a CDC Mirror](https://docs.peerdb.io/features/resync-mirror) - [Schema Changes Propagation Support](https://docs.peerdb.io/features/schema-changes) - [Replicating partitioned tables](https://docs.peerdb.io/features/replicating-partitioned-tables) - [Feature matrix](https://docs.peerdb.io/features/feature-matrix) - [Supported connectors](https://docs.peerdb.io/features/supported-connectors) - Alerting ##### Data types - [Datatype matrix](https://docs.peerdb.io/datatypes/datatype-matrix) ##### Best practices - [Postgres to ClickHouse: Data Modeling Tips](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling) - [Heartbeat Table For CDC](https://docs.peerdb.io/bestpractices/heartbeat) - [Upgrading the PostgreSQL Peer Of An Ongoing CDC Mirror](https://docs.peerdb.io/bestpractices/postgres-upgrade) - [Generated Columns: Gotchas and Best Practices](https://docs.peerdb.io/bestpractices/generated_columns) - [Postgres to ClickHouse: Handling TOAST Columns](https://docs.peerdb.io/bestpractices/clickhouse_toast_columns) ##### Use-cases - Real-time Change Data Capture from Postgres - Streaming Query Replication - [Query Federation via PostgreSQL](https://docs.peerdb.io/usecases/postgres-compatible-layer-for-bigquery-and-snowflake) ##### Benchmarks - [Benchmarking Postgres Replication: PeerDB vs Airbyte](https://docs.peerdb.io/benchmarks/benchmarking-Postgres-Replication:-PeerDB-vs-Airbyte) ##### Contributing - [Running Flow tests](https://docs.peerdb.io/dev/running-tests) ##### Metrics and Tuning - [Config Tuning for Change Data Capture (CDC)](https://docs.peerdb.io/metrics/important_cdc_configs) - [Native Metrics](https://docs.peerdb.io/metrics/native-metrics) On this page - [ReplacingMergeTree table engine](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#replacingmergetree-table-engine) - [You might still see duplicates for rows—how should you handle them?](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#you-might-still-see-duplicates-for-rows%E2%80%94how-should-you-handle-them) - [Use FINAL in your queries](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-final-in-your-queries) - [Use argMax to deduplicate rows at query time](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-argmax-to-deduplicate-rows-at-query-time) - [Use WINDOW FUNCTIONS](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-window-functions) - [Use Views to simplify deduplication](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-views-to-simplify-deduplication) - [Nullable Columns](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#nullable-columns) - [Data Types](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#data-types) - [The Ordering Key](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#the-ordering-key) - [What is an Ordering Key?](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#what-is-an-ordering-key) - [Choosing an appropriate Ordering Key](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#choosing-an-appropriate-ordering-key) - [PRIMARY KEY vs Ordering Key](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#primary-key-vs-ordering-key) - [Modifying the Ordering Key](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#modifying-the-ordering-key) - [Use materialized views](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-materialized-views) - [Predefine target tables with the desired Ordering Key](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#predefine-target-tables-with-the-desired-ordering-key) - [Handling DELETEs](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#handling-deletes) - [How to handle updates for primary keys?](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#how-to-handle-updates-for-primary-keys) - [Conclusion](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#conclusion) Best practices # Postgres to ClickHouse: Data Modeling Tips [PeerDB](https://www.peerdb.io/) makes it fast and simple to replicate data from [Postgres](https://www.postgresql.org/) to [ClickHouse](https://clickhouse.com/). A common question from PeerDB users is how to model their data in ClickHouse after the replication process to maximize the benefits of ClickHouse. This question arises because ClickHouse and Postgres differ in data modeling, as each is a **purpose-built database** highly optimized for its specific workload -Postgres is a transactional (OLTP) database, while ClickHouse is an analytical (OLAP) columnar database. This guide walks you through essential data modeling concepts in ClickHouse for users coming from the Postgres world. ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#replacingmergetree-table-engine)ReplacingMergeTree table engine PeerDB maps PostgreSQL tables to ClickHouse using the [ReplacingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree) engine. ClickHouse performs best with append-only workloads and [does not recommend](https://clickhouse.com/docs/en/guides/developer/mutations) frequent UPDATEs. This is where ReplacingMergeTree is particularly powerful. `ReplacingMergeTree` supports workloads that involve both data ingestion and modifications. Each table is append-only, with user updates ingested as versioned INSERTs. The ReplacingMergeTree engine manages deduplication (merging) of rows in the background. This is one of the key factors that enables ClickHouse to deliver exceptional real-time ingestion performance. In PeerDB, both INSERTs and UPDATEs from Postgres are captured as new rows with different versions (using `_peerdb_version`) in ClickHouse. The `ReplacingMergeTree` table engine periodically handles deduplication in the background using the Ordering Key (ORDER BY columns), retaining only the row with the latest `_peerdb_version`. DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the `_peerdb_is_deleted` column). The snippet below shows the target table definition for the `public_goals` table in ClickHouse. ``` clickhouse-cloud :) SHOW CREATE TABLE public_goals; CREATE TABLE peerdb.public_goals ( `id` Int64, `owned_user_id` String, `goal_title` String, `goal_data` String, `enabled` Bool, `ts` DateTime64(6), `_peerdb_synced_at` DateTime64(9) DEFAULT now(), `_peerdb_is_deleted` Int8, `_peerdb_version` Int64 ) ENGINE = SharedReplacingMergeTree ('/clickhouse/tables/{uuid}/{shard}', '{replica}', _peerdb_version) PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 ``` ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#you-might-still-see-duplicates-for-rows%E2%80%94how-should-you-handle-them)You might still see duplicates for rows—how should you handle them? ReplacingMergeTree clears out duplicates asynchronously in the background but doesn’t guarantee the absence of duplicates. So, when you query the data, you might still see duplicates for the same row or primary key but with different versions. This is expected. To remove duplicates, you have a couple of approaches: ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-final-in-your-queries)Use FINAL in your queries ClickHouse has a unique modifier called [FINAL](https://clickhouse.com/docs/en/sql-reference/statements/select/from#final-modifier), which performs de-duplication (merging of rows) at query time. This de-duplication occurs after filtering (WHERE clause) but before aggregations (GROUP BY). A historical concern has been that FINAL can slow down query performance. While it does impact query performance to some extent, recent releases of ClickHouse have introduced [significant improvements](https://github.com/ClickHouse/ClickHouse/issues/11722) to enhance FINAL query performance. So, don’t hesitate to use the FINAL clause and evaluate how your queries perform. Below is an example of how to use the FINAL clause: ``` SELECT owner_user_id, COUNT(*) FROM goals FINAL WHERE enabled = true GROUP BY owner_user_id; ``` ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-argmax-to-deduplicate-rows-at-query-time)Use argMax to deduplicate rows at query time In ClickHouse, [argMax](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/argmax) is a powerful function for deduplicating rows dynamically during query execution. This is particularly useful when you need to retain the most recent or relevant record based on a versioning or timestamp column. For instance, if you’re working with a table like `peerdb.public_goals`, where id is the primary key and `_peerdb_version` tracks versions, you can use argMax to select the row with the highest `_peerdb_version` for each `id`. This approach allows you to efficiently remove duplicates without altering the underlying data. You can then run your aggregations as a subquery over this deduplicated result set for further analysis. Below query is an example of using argMax ``` SELECT owned_user_id, COUNT(*) AS active_goals_count, MAX(ts) AS latest_goal_time FROM ( SELECT id, argMax(owned_user_id, _peerdb_version) AS owned_user_id, argMax(goal_title, _peerdb_version) AS goal_title, argMax(goal_data, _peerdb_version) AS goal_data, argMax(enabled, _peerdb_version) AS enabled, argMax(ts, _peerdb_version) AS ts, argMax(_peerdb_synced_at, _peerdb_version) AS _peerdb_synced_at, argMax(_peerdb_is_deleted, _peerdb_version) AS _peerdb_is_deleted, max(_peerdb_version) AS _peerdb_version FROM peerdb.public_goals WHERE enabled = true GROUP BY id ) AS deduplicated_goals GROUP BY owned_user_id; ``` ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-window-functions)Use WINDOW FUNCTIONS You can use ClickHouse’s [window functions](https://clickhouse.com/docs/en/sql-reference/window-functions) to achieve similar deduplication by selecting the row with the highest `_peerdb_version` within each id partition. Here’s an example: ``` SELECT owned_user_id, COUNT(*) AS active_goals_count, MAX(ts) AS latest_goal_time FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _peerdb_version DESC) AS rn FROM peerdb.public_goals WHERE enabled = true ) AS ranked_goals WHERE rn = 1 GROUP BY owned_user_id; ``` ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-views-to-simplify-deduplication)Use Views to simplify deduplication Encapsulate deduplication in a [view](https://clickhouse.com/docs/en/sql-reference/statements/create/view) to make it simple for BI tools to query the most up-to-date data. For example, use a window function in the view to keep only the latest version of each row: ``` CREATE VIEW goals AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _peerdb_version DESC) AS rn FROM peerdb.public_goals WHERE enabled = true ) WHERE rn = 1; ``` ``` SELECT owned_user_id, COUNT(*) AS active_goals_count, MAX(ts) AS latest_goal_time FROM goals GROUP BY owned_user_id; ``` ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#nullable-columns)Nullable Columns If you’re coming from the Postgres world, one surprising aspect of ClickHouse is that it doesn’t store NULL values for columns unless you explicitly wrap the column types in [`Nullable`](https://clickhouse.com/docs/en/sql-reference/data-types/nullable). For example, instead of storing NULL for dates, ClickHouse stores `1970-01-01` as the default value, which might be unexpected. This behavior is due to the fact that storing NULLs can [impact](https://clickhouse.com/docs/en/sql-reference/data-types/nullable) query performance in ClickHouse, as it’s a columnar database. Hence, ClickHouse requires users to explicitly define `Nullable` types. In PeerDB, we’ve introduced a setting called `PEERDB_NULLABLE`, which, when set to `true`, automatically detects nullable columns in Postgres and marks them as `Nullable` in ClickHouse during the replication process. This means you don’t need to manually define `Nullable` types during replication. You can read more about this feature in the following [PR](https://github.com/PeerDB-io/peerdb/pull/2001). ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#data-types)**Data Types** ClickHouse offers a wide variety of data types, ranging from numbers, text, timestamps, dates, and arrays to the recently introduced [JSON](https://github.com/ClickHouse/ClickHouse/issues/54864) type. Many of the data types in Postgres can be natively stored in ClickHouse without much modification. As a reference, here goes [the data type matrix](https://docs.peerdb.io/datatypes/datatype-matrix) we use at PeerDB when replicating data from Postgres to ClickHouse. ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#the-ordering-key)The Ordering Key ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#what-is-an-ordering-key)What is an Ordering Key? Choosing the right ordering key is crucial for query performance in ClickHouse. Defined by the `ORDER BY` clause when creating a table, the ordering key functions similarly to a index in Postgres but is optimized for analytics. Unlike Postgres, which uses a B-tree index with entries pointing to each row, ClickHouse uses Sparse Indexing: 1. **Data is sorted based on Ordering Key:** The ordering key ensures that data on disk is sorted according to the specified columns. This allows for better [compression](https://clickhouse.com/docs/en/data-compression/compression-in-clickhouse), as correlated values are stored together. 2. **Ordering Key also creates a sparse index:** The ordering key also creates a sparse index, storing only ranges of columns, with each entry pointing to a group of sorted rows. This keeps the index small, allowing ClickHouse to quickly identify relevant groups of rows using a binary search and execute queries efficiently. You can read more about this [here](https://clickhouse.com/docs/en/migrations/postgresql/designing-schemas#primary-ordering-keys-in-clickhouse). You can think of ordering keys as similar to [BRIN](https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BRIN) indexes in Postgres, but in ClickHouse, the data is automatically sorted based on the ordering key via asynchronous merging of parts, so you don’t need to handle sorting during data ingestion. ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#choosing-an-appropriate-ordering-key)Choosing an appropriate Ordering Key When selecting an ordering key, base your choice on the columns most frequently used in your query filters. **Prioritize columns that are commonly used in WHERE clauses, and order them in ascending sequence of cardinality**—starting with columns that have the fewest distinct values. This approach optimizes data compression and query performance. For a deeper understanding of this topic, refer to the detailed guide [here](https://clickhouse.com/docs/en/data-modeling/schema-design#choosing-an-ordering-key). ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#primary-key-vs-ordering-key)**PRIMARY KEY vs Ordering Key** If you observe the table definition of `public_goals`, it has a `PRIMARY KEY`. You might be wondering how the `PRIMARY KEY` differs from the Ordering Key. Let us understand how they differ: 1. `PRIMARY KEY`, if specified, defines the columns in the sparse index, while the columns in the `ORDER BY` clause determine how the data is sorted on disk. They are also used for deduplicating data by the `ReplacingMergeTree`. 2. If the `PRIMARY KEY` isn’t specified, the Ordering Key automatically becomes the `PRIMARY KEY` and defines the columns in the sparse index. **NOTE:** Columns in the `PRIMARY KEY` should always be prefixed in the Ordering Key. This ensures that the index aligns with the physical data order, maximizing query performance by minimizing unnecessary data scans. **An example where** `PRIMARY KEY` **could differ from Ordering Key** An example where you might have different `PRIMARY KEY` and `ORDER BY` columns is when your queries are primarily filtered on `customer_id` rather than `id`. In this case, you can define the `PRIMARY KEY` on just `customer_id` and the `ORDER BY` on `customer_id, id`. This approach ensures a smaller, more efficient sparse index for querying, while data deduplication occurs on `id`, ensuring no data is lost. **NOTE:** Unlike in Postgres, where the `PRIMARY KEY` is a B-tree index that guarantees uniqueness, in ClickHouse, it does not ensure uniqueness. Instead, it defines the columns that should be part of the sparse index. ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#modifying-the-ordering-key)Modifying the Ordering Key Choosing the right [ordering key](https://clickhouse.com/docs/en/migrations/postgresql/designing-schemas#primary-ordering-keys-in-clickhouse) is crucial for query performance in ClickHouse, as it acts as an index when querying data. By default, PeerDB uses the PostgreSQL `PRIMARY KEY` to define the ordering key in ClickHouse tables, but you can change it using the following methods: ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#use-materialized-views)Use materialized views You can use materialized views to create a new table with a different ordering key suitable for your workload. Include the primary key columns at the end of the ordering key to ensure proper deduplication, as ReplacingMergeTree uses the ORDER BY clause for deduplication, and including the primary key ensures that no data is lost. ``` CREATE MATERIALIZED VIEW goals_mv ENGINE = ReplacingMergeTree(_peerdb_version) ORDER BY (enabled, ts, id)  POPULATE AS SELECT * FROM peerdb.public_goals; ``` **NOTE:** After creating the materialized view, be sure to follow the steps described in the previous section on handling duplicates to ensure proper deduplication during query time. ### [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#predefine-target-tables-with-the-desired-ordering-key)Predefine target tables with the desired Ordering Key To change the ordering key, you can predefine new tables with your desired Ordering Key and then swap them with the existing tables. Here’s how you can do it: **1\. Create a Dummy Mirror:** Create a dummy mirror in PeerDB to generate the default tables with the correct metadata columns and data types. **2\. Create a New Table with the Desired Ordering Key:** Use the table created by PeerDB to define a new table with your desired ordering key. Include the primary key columns at the end of the ordering key to ensure proper deduplication. Here is an example: ``` CREATE TABLE public_events_new AS public_events ENGINE = ReplacingMergeTree(_peerdb_version) ORDER BY (user_id,id); ``` **3\. Drop the Old Table:** ``` DROP TABLE public_events; ``` **4\. Rename the New Table:** Rename the new table to the actual table ``` RENAME TABLE public_events_new TO public_events; ``` **5\. Start MIRROR to Point to the New Table:** Configure the mirror to point to the actual table. PeerDB uses `CREATE TABLE IF NOT EXISTS` behind the scenes and continues to ingest data into the new table. ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#handling-deletes)Handling DELETEs As mentioned, DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the `_peerdb_is_deleted` column). To exclude deleted rows from your queries, you can create row-level policies in ClickHouse based on the `_peerdb_is_deleted` column. Here’s an example: ``` CREATE ROW POLICY policy_name ON table_name FOR SELECT USING _peerdb_is_deleted = 0; ``` This policy ensures that only rows where `_peerdb_is_deleted` is 0 are visible when querying the table. ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#how-to-handle-updates-for-primary-keys)How to handle updates for primary keys? To propagate primary key updates into ClickHouse, enable `PEERDB_CLICKHOUSE_ENABLE_PRIMARY_UPDATE` setting on your mirror. This instructs the PeerDB normalize step to generate a “delete + insert” pair for changed primary keys, using a sign column (in the `ReplacingMergeTree`) and a version column to remove the old record before inserting the new one, ensuring that your ClickHouse tables accurately reflect the latest state from the source system. It is highly uncommon for primary keys to be updated in Postgres. ## [​](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling#conclusion)Conclusion I hope you enjoyed reading the guide. We aimed to cover the most common data-modeling challenges you might encounter when migrating from PostgreSQL to ClickHouse. In the next blog, I plan to dive into more advanced topics, such as joins, writing efficient SQL queries, and so on. If you want to give PeerDB and ClickHouse a try to start replicating data from Postgres to ClickHouse, please check out the links below or reach out to us directly\! 1. [Try PeerDB Cloud for Free](https://auth.peerdb.cloud/signup) 2. [Try ClickHouse Cloud for Free](https://clickhouse.com/docs/en/cloud-quick-start) 3. [Talk to the PeerDB team directly](https://www.peerdb.io/sign-up) [Datatype matrix](https://docs.peerdb.io/datatypes/datatype-matrix) [Heartbeat Table For CDC](https://docs.peerdb.io/bestpractices/heartbeat) ⌘I [github](https://github.com/PeerDB-io) [Powered byThis documentation is built and hosted on Mintlify, a developer documentation platform](https://www.mintlify.com/?utm_campaign=poweredBy&utm_medium=referral&utm_source=peerdb) Assistant Responses are generated using AI and may contain mistakes.
Readable Markdown
[PeerDB](https://www.peerdb.io/) makes it fast and simple to replicate data from [Postgres](https://www.postgresql.org/) to [ClickHouse](https://clickhouse.com/). A common question from PeerDB users is how to model their data in ClickHouse after the replication process to maximize the benefits of ClickHouse. This question arises because ClickHouse and Postgres differ in data modeling, as each is a **purpose-built database** highly optimized for its specific workload -Postgres is a transactional (OLTP) database, while ClickHouse is an analytical (OLAP) columnar database. This guide walks you through essential data modeling concepts in ClickHouse for users coming from the Postgres world. ## ReplacingMergeTree table engine PeerDB maps PostgreSQL tables to ClickHouse using the [ReplacingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree) engine. ClickHouse performs best with append-only workloads and [does not recommend](https://clickhouse.com/docs/en/guides/developer/mutations) frequent UPDATEs. This is where ReplacingMergeTree is particularly powerful. `ReplacingMergeTree` supports workloads that involve both data ingestion and modifications. Each table is append-only, with user updates ingested as versioned INSERTs. The ReplacingMergeTree engine manages deduplication (merging) of rows in the background. This is one of the key factors that enables ClickHouse to deliver exceptional real-time ingestion performance. In PeerDB, both INSERTs and UPDATEs from Postgres are captured as new rows with different versions (using `_peerdb_version`) in ClickHouse. The `ReplacingMergeTree` table engine periodically handles deduplication in the background using the Ordering Key (ORDER BY columns), retaining only the row with the latest `_peerdb_version`. DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the `_peerdb_is_deleted` column). The snippet below shows the target table definition for the `public_goals` table in ClickHouse. ``` clickhouse-cloud :) SHOW CREATE TABLE public_goals; CREATE TABLE peerdb.public_goals ( `id` Int64, `owned_user_id` String, `goal_title` String, `goal_data` String, `enabled` Bool, `ts` DateTime64(6), `_peerdb_synced_at` DateTime64(9) DEFAULT now(), `_peerdb_is_deleted` Int8, `_peerdb_version` Int64 ) ENGINE = SharedReplacingMergeTree ('/clickhouse/tables/{uuid}/{shard}', '{replica}', _peerdb_version) PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 ``` ## You might still see duplicates for rows—how should you handle them? ReplacingMergeTree clears out duplicates asynchronously in the background but doesn’t guarantee the absence of duplicates. So, when you query the data, you might still see duplicates for the same row or primary key but with different versions. This is expected. To remove duplicates, you have a couple of approaches: ### Use FINAL in your queries ClickHouse has a unique modifier called [FINAL](https://clickhouse.com/docs/en/sql-reference/statements/select/from#final-modifier), which performs de-duplication (merging of rows) at query time. This de-duplication occurs after filtering (WHERE clause) but before aggregations (GROUP BY). A historical concern has been that FINAL can slow down query performance. While it does impact query performance to some extent, recent releases of ClickHouse have introduced [significant improvements](https://github.com/ClickHouse/ClickHouse/issues/11722) to enhance FINAL query performance. So, don’t hesitate to use the FINAL clause and evaluate how your queries perform. Below is an example of how to use the FINAL clause: ``` SELECT owner_user_id, COUNT(*) FROM goals FINAL WHERE enabled = true GROUP BY owner_user_id; ``` ### Use argMax to deduplicate rows at query time In ClickHouse, [argMax](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/argmax) is a powerful function for deduplicating rows dynamically during query execution. This is particularly useful when you need to retain the most recent or relevant record based on a versioning or timestamp column. For instance, if you’re working with a table like `peerdb.public_goals`, where id is the primary key and `_peerdb_version` tracks versions, you can use argMax to select the row with the highest `_peerdb_version` for each `id`. This approach allows you to efficiently remove duplicates without altering the underlying data. You can then run your aggregations as a subquery over this deduplicated result set for further analysis. Below query is an example of using argMax ``` SELECT owned_user_id, COUNT(*) AS active_goals_count, MAX(ts) AS latest_goal_time FROM ( SELECT id, argMax(owned_user_id, _peerdb_version) AS owned_user_id, argMax(goal_title, _peerdb_version) AS goal_title, argMax(goal_data, _peerdb_version) AS goal_data, argMax(enabled, _peerdb_version) AS enabled, argMax(ts, _peerdb_version) AS ts, argMax(_peerdb_synced_at, _peerdb_version) AS _peerdb_synced_at, argMax(_peerdb_is_deleted, _peerdb_version) AS _peerdb_is_deleted, max(_peerdb_version) AS _peerdb_version FROM peerdb.public_goals WHERE enabled = true GROUP BY id ) AS deduplicated_goals GROUP BY owned_user_id; ``` ### Use WINDOW FUNCTIONS You can use ClickHouse’s [window functions](https://clickhouse.com/docs/en/sql-reference/window-functions) to achieve similar deduplication by selecting the row with the highest `_peerdb_version` within each id partition. Here’s an example: ``` SELECT owned_user_id, COUNT(*) AS active_goals_count, MAX(ts) AS latest_goal_time FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _peerdb_version DESC) AS rn FROM peerdb.public_goals WHERE enabled = true ) AS ranked_goals WHERE rn = 1 GROUP BY owned_user_id; ``` ### Use Views to simplify deduplication Encapsulate deduplication in a [view](https://clickhouse.com/docs/en/sql-reference/statements/create/view) to make it simple for BI tools to query the most up-to-date data. For example, use a window function in the view to keep only the latest version of each row: ``` CREATE VIEW goals AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _peerdb_version DESC) AS rn FROM peerdb.public_goals WHERE enabled = true ) WHERE rn = 1; ``` ``` SELECT owned_user_id, COUNT(*) AS active_goals_count, MAX(ts) AS latest_goal_time FROM goals GROUP BY owned_user_id; ``` ## Nullable Columns If you’re coming from the Postgres world, one surprising aspect of ClickHouse is that it doesn’t store NULL values for columns unless you explicitly wrap the column types in [`Nullable`](https://clickhouse.com/docs/en/sql-reference/data-types/nullable). For example, instead of storing NULL for dates, ClickHouse stores `1970-01-01` as the default value, which might be unexpected. This behavior is due to the fact that storing NULLs can [impact](https://clickhouse.com/docs/en/sql-reference/data-types/nullable) query performance in ClickHouse, as it’s a columnar database. Hence, ClickHouse requires users to explicitly define `Nullable` types. In PeerDB, we’ve introduced a setting called `PEERDB_NULLABLE`, which, when set to `true`, automatically detects nullable columns in Postgres and marks them as `Nullable` in ClickHouse during the replication process. This means you don’t need to manually define `Nullable` types during replication. You can read more about this feature in the following [PR](https://github.com/PeerDB-io/peerdb/pull/2001). ## **Data Types** ClickHouse offers a wide variety of data types, ranging from numbers, text, timestamps, dates, and arrays to the recently introduced [JSON](https://github.com/ClickHouse/ClickHouse/issues/54864) type. Many of the data types in Postgres can be natively stored in ClickHouse without much modification. As a reference, here goes [the data type matrix](https://docs.peerdb.io/datatypes/datatype-matrix) we use at PeerDB when replicating data from Postgres to ClickHouse. ## The Ordering Key ### What is an Ordering Key? Choosing the right ordering key is crucial for query performance in ClickHouse. Defined by the `ORDER BY` clause when creating a table, the ordering key functions similarly to a index in Postgres but is optimized for analytics. Unlike Postgres, which uses a B-tree index with entries pointing to each row, ClickHouse uses Sparse Indexing: 1. **Data is sorted based on Ordering Key:** The ordering key ensures that data on disk is sorted according to the specified columns. This allows for better [compression](https://clickhouse.com/docs/en/data-compression/compression-in-clickhouse), as correlated values are stored together. 2. **Ordering Key also creates a sparse index:** The ordering key also creates a sparse index, storing only ranges of columns, with each entry pointing to a group of sorted rows. This keeps the index small, allowing ClickHouse to quickly identify relevant groups of rows using a binary search and execute queries efficiently. You can read more about this [here](https://clickhouse.com/docs/en/migrations/postgresql/designing-schemas#primary-ordering-keys-in-clickhouse). You can think of ordering keys as similar to [BRIN](https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BRIN) indexes in Postgres, but in ClickHouse, the data is automatically sorted based on the ordering key via asynchronous merging of parts, so you don’t need to handle sorting during data ingestion. ### Choosing an appropriate Ordering Key When selecting an ordering key, base your choice on the columns most frequently used in your query filters. **Prioritize columns that are commonly used in WHERE clauses, and order them in ascending sequence of cardinality**—starting with columns that have the fewest distinct values. This approach optimizes data compression and query performance. For a deeper understanding of this topic, refer to the detailed guide [here](https://clickhouse.com/docs/en/data-modeling/schema-design#choosing-an-ordering-key). ### **PRIMARY KEY vs Ordering Key** If you observe the table definition of `public_goals`, it has a `PRIMARY KEY`. You might be wondering how the `PRIMARY KEY` differs from the Ordering Key. Let us understand how they differ: 1. `PRIMARY KEY`, if specified, defines the columns in the sparse index, while the columns in the `ORDER BY` clause determine how the data is sorted on disk. They are also used for deduplicating data by the `ReplacingMergeTree`. 2. If the `PRIMARY KEY` isn’t specified, the Ordering Key automatically becomes the `PRIMARY KEY` and defines the columns in the sparse index. **NOTE:** Columns in the `PRIMARY KEY` should always be prefixed in the Ordering Key. This ensures that the index aligns with the physical data order, maximizing query performance by minimizing unnecessary data scans. **An example where** `PRIMARY KEY` **could differ from Ordering Key** An example where you might have different `PRIMARY KEY` and `ORDER BY` columns is when your queries are primarily filtered on `customer_id` rather than `id`. In this case, you can define the `PRIMARY KEY` on just `customer_id` and the `ORDER BY` on `customer_id, id`. This approach ensures a smaller, more efficient sparse index for querying, while data deduplication occurs on `id`, ensuring no data is lost. **NOTE:** Unlike in Postgres, where the `PRIMARY KEY` is a B-tree index that guarantees uniqueness, in ClickHouse, it does not ensure uniqueness. Instead, it defines the columns that should be part of the sparse index. ### Modifying the Ordering Key Choosing the right [ordering key](https://clickhouse.com/docs/en/migrations/postgresql/designing-schemas#primary-ordering-keys-in-clickhouse) is crucial for query performance in ClickHouse, as it acts as an index when querying data. By default, PeerDB uses the PostgreSQL `PRIMARY KEY` to define the ordering key in ClickHouse tables, but you can change it using the following methods: ### Use materialized views You can use materialized views to create a new table with a different ordering key suitable for your workload. Include the primary key columns at the end of the ordering key to ensure proper deduplication, as ReplacingMergeTree uses the ORDER BY clause for deduplication, and including the primary key ensures that no data is lost. ``` CREATE MATERIALIZED VIEW goals_mv ENGINE = ReplacingMergeTree(_peerdb_version) ORDER BY (enabled, ts, id)  POPULATE AS SELECT * FROM peerdb.public_goals; ``` **NOTE:** After creating the materialized view, be sure to follow the steps described in the previous section on handling duplicates to ensure proper deduplication during query time. ### Predefine target tables with the desired Ordering Key To change the ordering key, you can predefine new tables with your desired Ordering Key and then swap them with the existing tables. Here’s how you can do it: **1\. Create a Dummy Mirror:** Create a dummy mirror in PeerDB to generate the default tables with the correct metadata columns and data types. **2\. Create a New Table with the Desired Ordering Key:** Use the table created by PeerDB to define a new table with your desired ordering key. Include the primary key columns at the end of the ordering key to ensure proper deduplication. Here is an example: ``` CREATE TABLE public_events_new AS public_events ENGINE = ReplacingMergeTree(_peerdb_version) ORDER BY (user_id,id); ``` **3\. Drop the Old Table:** ``` DROP TABLE public_events; ``` **4\. Rename the New Table:** Rename the new table to the actual table ``` RENAME TABLE public_events_new TO public_events; ``` **5\. Start MIRROR to Point to the New Table:** Configure the mirror to point to the actual table. PeerDB uses `CREATE TABLE IF NOT EXISTS` behind the scenes and continues to ingest data into the new table. ## Handling DELETEs As mentioned, DELETEs from PostgreSQL are propagated as new rows marked as deleted (using the `_peerdb_is_deleted` column). To exclude deleted rows from your queries, you can create row-level policies in ClickHouse based on the `_peerdb_is_deleted` column. Here’s an example: ``` CREATE ROW POLICY policy_name ON table_name FOR SELECT USING _peerdb_is_deleted = 0; ``` This policy ensures that only rows where `_peerdb_is_deleted` is 0 are visible when querying the table. ## How to handle updates for primary keys? To propagate primary key updates into ClickHouse, enable `PEERDB_CLICKHOUSE_ENABLE_PRIMARY_UPDATE` setting on your mirror. This instructs the PeerDB normalize step to generate a “delete + insert” pair for changed primary keys, using a sign column (in the `ReplacingMergeTree`) and a version column to remove the old record before inserting the new one, ensuring that your ClickHouse tables accurately reflect the latest state from the source system. It is highly uncommon for primary keys to be updated in Postgres. ## Conclusion I hope you enjoyed reading the guide. We aimed to cover the most common data-modeling challenges you might encounter when migrating from PostgreSQL to ClickHouse. In the next blog, I plan to dive into more advanced topics, such as joins, writing efficient SQL queries, and so on. If you want to give PeerDB and ClickHouse a try to start replicating data from Postgres to ClickHouse, please check out the links below or reach out to us directly\! 1. [Try PeerDB Cloud for Free](https://auth.peerdb.cloud/signup) 2. [Try ClickHouse Cloud for Free](https://clickhouse.com/docs/en/cloud-quick-start) 3. [Talk to the PeerDB team directly](https://www.peerdb.io/sign-up)
Shard169 (laksa)
Root Hash17086075890740335769
Unparsed URLio,peerdb!docs,/bestpractices/clickhouse_datamodeling s443