How to Choose Between Liquid Clustering and Partitioning with Z-Order in Databricks
The views expressed in this blog are my own and do not represent official guidance from Databricks
This is one of the most-read posts on the website, so we decided to give it a well-deserved 2026 update. Thank you to Geethu for co-authoring on this revision and for raising the technical bar of the article.
Delta Lake, an open source storage format, offers two primary methods for organizing data: liquid clustering and partitioning with Z-order. This blog post will help you navigate the decision-making process between these two approaches. Clustering in Delta Lake enhances query performance by organizing data based on frequently accessed columns, similar to indexing in relational databases. The key difference is that clustering physically sorts the data within the table rather than creating separate index structures.
Understanding the Basics: Liquid Clustering vs. Partitioned Z-Order Tables
Liquid Clustering
Liquid clustering is a newer algorithm for Delta Lake tables, offering several advantages:
Flexibility: You can change clustering columns at any time.
Optimization for Unpartitioned Tables: It works well without partitioning.
Efficiency: It doesn’t re-cluster previously clustered files unless explicitly instructed.
Liquid clustering relies on optimistic concurrency control (OCC) to handle conflicts when multiple writes occur to the same table.
Partitioned Z-Order Tables
Partitioning combined with Z-ordering is a traditional approach that:
Control: Allows greater control over data organization.
Parallel Writes: Supports parallel writes more effectively.
Fine-Grained Optimization: Enables optimization of specific partitions.
However, data engineers must be aware of querying patterns upfront to choose an appropriate partition column.
Decision Tree
Built in Jan 2026, this decision tree will be continuously updated as technology evolves. As new enhancements emerge, my understanding will grow, and this resource will be refined accordingly. This is a complex topic, but I will do my best to provide at least an intuitive grasp to help you develop a clearer understanding.
Factors to Consider When Choosing
Table Size
Small tables (< 10 TB): If you need fast lookups on exactly two columns, Liquid Clustering on those columns typically delivers comparable performance with simpler maintenance. If your workload involves highly selective lookups across three or more columns, Partition + Z-order may perform better, assuming the partition key has low cardinality. That said, Liquid Clustering can still work for multi-column lookups and is often worth benchmarking with tuned clustering keys.
Medium tables (10 TB -500TB): For medium-sized tables, the key decision factor is partition cardinality. If partitioning results in fewer than ~5,000 distinct values (for example, ~1,100 partitions for 3 years of daily data), Partition + Z-order can work well when queries include the partition column. If the number of distinct values exceeds ~5,000, Liquid Clustering is generally preferred to avoid over-partitioning. In practice, benchmark both approaches with representative queries to validate performance.
Large tables (> 500 TB): You should reach out to your Databricks representative and have a discussion.
Note: Liquid is being actively improved so the guidance could change
Data Ingestion Pattern
How data is written - batch or streaming - can influence which data organization strategy is most appropriate.
Batch Ingestion : For batch workloads, Liquid Clustering remains a strong default choice. Batch writes naturally organize data efficiently. In the latest Databricks Runtime versions, eager clustering can be enabled to make the data well-clustered as it is written, so queries see an optimized view right away.
Streaming Ingestion : For streaming workloads, the choice depends on your main priority.
Low Latency: If getting data into the table quickly is most important, Liquid Clustering is preferred without eager clustering. This reduces shuffle overhead during ingestion. Data may not be fully optimized immediately, but query performance can improve later using Predictive I/O.
Fast Downstream Lookups: If queries need to be fast as soon as data arrives, Liquid Clustering with eager clustering is recommended. This ensures data is well-clustered on write, and follow-up OPTIMIZE can further improve query performance.
Query Patterns
If users consistently include the partition column in their queries, partitioning can be very effective.
Liquid clustering may be more suitable for more flexible query patterns where users may not always include the partition column.
Data Distribution
If you have uneven partition sizes, the liquid will be better.
Date-based data (e.g., clickstream data) often benefits from partitioning.
For data without a clear partitioning strategy, liquid clustering may be better.
Partition Column Selection
When choosing a partition column:
Select immutable columns (e.g., click date, sale date)
Avoid high-cardinality columns like timestamps
For timestamp data, create a derived date column for partitioning
Aim for fewer than 10,000 distinct partition values.
Each partition should contain at least ~1-10 GB of data.
Real-World Example: Amazon Clickstream Data
Let's consider a real-world scenario using Amazon's clickstream data:
The table stores 3 years of data for 10 countries
Partitioning by click date results in approximately 1,000 partitions (365 * 3)
10 countries * 1,000 date partitions = 10,000 total partitions
This setup is within the recommended partition count (< 10,000) and provides good control over the data. Here's how we might structure this table:
Partition by
click_date, countryZ-order by
merchant_id, andadvertiser_id
Optimizing the Partitioned Table
To maintain optimal performance, you can run a daily optimization job on the newest partition:
OPTIMIZE table_name
WHERE click_date = 'ANY_DATE' and country = 'CANADA'
ZORDER BY ( merchant_id, advertiser_id)
This approach ensures good performance for date-range queries and lookups on Z-ordered columns.
Optimistic Concurrency Control
Delta Lake uses optimistic concurrency control to manage parallel writes. Here's how it works:
Writers check the current version of the Delta table (e.g., version 100).
They attempt to write a new JSON file (e.g., 101.json).
Only one writer can succeed in creating this file.
The "losing" writer checks if there are conflicts with what was previously written.
If no conflicts, it creates the next version (e.g., 102.json).
This approach works well for appends but can be challenging for updates, especially when multiple writers are trying to modify the same files.
Potential Pitfalls and Best Practices
Here are some key considerations and common mistakes to avoid:
Do not add Co-related columns to liquid: If two columns are highly correlated, you only need to include one of them as a clustering key. Example, if you have click_date, click_timestamp then only cluster by click_timestamps
Skip meaningless keys: When it comes to clustering, try to avoid using meaningless keys such as UUIDs, which are inheritable and unsortable strings. If possible, refrain from using them in both liquid and z-order clustering. However, I understand that sometimes customers require quick lookups on these UUID columns. In those cases, you may include them.
Over-Partitioning: A common mistake is creating too many partitions. While partitioning helps with performance, too many partitions can result in overhead. A good rule of thumb is to keep partition counts under 10,000. For example, if you're storing three years of daily click data, partitioning by
click_datewould result in around 1,000 partitions for three years—well within the 10,000-partition guideline. Example: Avoid partitioning on high cardinality columns (e.g., timestamps). This would result in too many partitions, leading to performance degradation. Instead, partition on a date column and ensure it has enough data per partition.Enable Predictive Optimization on your Databricks workspace to automatically manage maintenance for Unity Catalog–managed tables. PO identifies tables that can benefit from operations such as
OPTIMIZE,VACUUM, andANALYZE, and schedules these jobs using serverless compute. This eliminates the need to manually scheduleOPTIMIZEfor compaction or clustering, as the platform triggers operations based on usage patterns, table statistics, and overall table health.For partitioned tables, PO applies compaction and layout improvements within each partition.
For Liquid Clustered tables, PO integrates with
CLUSTER BY AUTO, automatically selecting clustering keys and scheduling incremental clustering jobs. This reduces manual tuning and ensures that the table layout evolves with changing query patterns, keeping queries efficient without intervention.
Schedule Optimization (If Required) : With Predictive Optimization (PO) enabled, most maintenance tasks are handled automatically. You only need to manually run
OPTIMIZEin the following cases:For Zordered tables, Note that
OPTIMIZEdoes not automatically apply ZORDER, so manualOPTIMIZEruns are still required if Z-ordering is needed.For Liquid Clustered tables, manual
OPTIMIZEis only needed if queries require faster response times immediately after data arrival, or additional optimization is necessary to improve query performance.
Conclusion
Choosing between liquid clustering and partitioned Z-order tables depends on various factors including table size, write patterns, and query requirements. Always consider your specific use case and be prepared to test both approaches to determine the best fit for your data and query patterns. The right choice will significantly impact your query performance and overall data management efficiency.
Keep This Post Discoverable: Your Engagement Counts!
Your engagement with this blog post is crucial! Without claps, comments, or shares, this valuable content might become lost in the vast sea of online information. Search engines like Google rely on user engagement to determine the relevance and importance of web pages. If you found this information helpful, please take a moment to clap, comment, or share. Your action not only helps others discover this content but also ensures that you’ll be able to find it again in the future when you need it. Don’t let this resource disappear from search results — show your support and help keep quality content accessible!
References
https://docs.databricks.com/aws/en/delta/clustering
https://www.databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html
https://docs.databricks.com/en/delta/clustering.html
https://www.databricks.com/blog/announcing-general-availability-liquid-clustering




Solid practical guidance on the liquid vs partitioned choice. The cardinality thresholds you outline (sub-5k for partition + z-order, above that for liquid) match what we've seen in production pretty well, but the streaming ingestion tradeoff is interessting. Had cases where eager clustering hurt latency enough that downstream query speed wasn't realy worth it, and predictive I/O didn't catch up for hours.
QQ: In your example under section’ Parallel Write Considerations’ though the Kafka writers are writing concurrently they are just append only operations and so there won’t be conflicts with concurrent writers even if we decide to use liquid clustering right?
Let’s say I am writing to a destination table with less than 1TB of data