<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Canadian Data Guy Unfiltered: TL;DR]]></title><description><![CDATA[Life is short, and sometimes we are in a hurry, so we want quick answers.]]></description><link>https://www.canadiandataguy.com/s/tldr</link><image><url>https://substackcdn.com/image/fetch/$s_!n3Eg!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cc7753-f8fb-4300-ac7f-1806e112a06a_1024x1024.png</url><title>Canadian Data Guy Unfiltered: TL;DR</title><link>https://www.canadiandataguy.com/s/tldr</link></image><generator>Substack</generator><lastBuildDate>Mon, 04 May 2026 09:17:15 GMT</lastBuildDate><atom:link href="https://www.canadiandataguy.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Canadian Data Guy]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[canadiandataguy@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[canadiandataguy@substack.com]]></itunes:email><itunes:name><![CDATA[Canadian Data Guy]]></itunes:name></itunes:owner><itunes:author><![CDATA[Canadian Data Guy]]></itunes:author><googleplay:owner><![CDATA[canadiandataguy@substack.com]]></googleplay:owner><googleplay:email><![CDATA[canadiandataguy@substack.com]]></googleplay:email><googleplay:author><![CDATA[Canadian Data Guy]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Why I Materialize Delta History for Debugging]]></title><description><![CDATA[Just a Quick Tip]]></description><link>https://www.canadiandataguy.com/p/why-i-materialize-delta-history-for</link><guid isPermaLink="false">https://www.canadiandataguy.com/p/why-i-materialize-delta-history-for</guid><dc:creator><![CDATA[Canadian Data Guy]]></dc:creator><pubDate>Thu, 27 Nov 2025 22:36:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IxOb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I&#8217;m debugging a Delta table with millions of commits &#8212; especially tables with <strong>heavy ingestion</strong>, lots of parquet files &#8212; I often need to trace a specific record back to:</p><ul><li><p>which <strong>commit</strong> wrote it</p></li><li><p>which <strong>wrote this record (Job id, Job Run Id)</strong></p></li><li><p>which <strong>operation</strong> triggered that write</p><p></p></li></ul><p><code>DESCRIBE HISTORY</code> gives you this metadata, but on large tables it can be slow, and running it repeatedly while investigating a bug quickly becomes painful.</p><p></p><p>The practical workaround is to <strong>dump the entire history once</strong> into a physical table.<br>From there, you can filter, join, and slice it instantly &#8212; without re-scanning the entire Delta log on every query.</p><h3><strong>One-Time Dump of Delta Table History</strong></h3><pre><code><code>CREATE TABLE IF NOT EXISTS databricks_support.default.describe_history__your_table_name AS
SELECT *
FROM (
    DESCRIBE HISTORY your_catalog_name.your_database_name._your_table_name
);
</code></code></pre><p>For deep debugging (record &#8594; parquet file &#8594; commit lineage), this table becomes a fast, queryable audit log.<br>In practice, this works best when run from a <strong>notebook</strong>, where long-running metadata operations are less fragile.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IxOb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IxOb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!IxOb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!IxOb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!IxOb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IxOb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1441462,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.canadiandataguy.com/i/180138751?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IxOb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!IxOb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!IxOb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!IxOb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc11881-0c9f-4741-bdf4-28cb1c59fe00_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I also have a script that can identify which row is written in which Parquet file by which commit; drop me a comment if you need it.</p><p></p>]]></content:encoded></item><item><title><![CDATA[How Do I Think About Setting Spark Shuffle Partitions in 2025?]]></title><description><![CDATA[TLDR: A Quick Guide to setting Spark.Shuffle.Partitions, No Deep Dive Required]]></description><link>https://www.canadiandataguy.com/p/how-do-i-think-about-setting-spark</link><guid isPermaLink="false">https://www.canadiandataguy.com/p/how-do-i-think-about-setting-spark</guid><dc:creator><![CDATA[Canadian Data Guy]]></dc:creator><pubDate>Tue, 15 Apr 2025 21:36:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1B0p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In 2025, overthinking about Spark shuffle partitions has become less critical thanks to modern innovations in the Spark ecosystem. In earlier years&#8212;say, 2015 to 2019&#8212;the default setting of 200 partitions often proved either too high or too low, prompting manual tuning and much deliberation. However, with advances like the Adaptive Query Engine, many of these decisions are now automatically managed, ensuring optimal performance without constant human intervention. This guide provides a streamlined decision tree to help you quickly determine if any manual adjustment is needed, so you can focus on higher-value aspects of your data processing work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1B0p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1B0p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 424w, https://substackcdn.com/image/fetch/$s_!1B0p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 848w, https://substackcdn.com/image/fetch/$s_!1B0p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 1272w, https://substackcdn.com/image/fetch/$s_!1B0p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1B0p!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png" width="1200" height="881.8681318681319" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c40b64c5-4592-4329-a649-b2103a6f93e4_3840x2822.png&quot;,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1070,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:714500,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.canadiandataguy.com/i/161416272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc40b64c5-4592-4329-a649-b2103a6f93e4_3840x2822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1B0p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 424w, https://substackcdn.com/image/fetch/$s_!1B0p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 848w, https://substackcdn.com/image/fetch/$s_!1B0p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 1272w, https://substackcdn.com/image/fetch/$s_!1B0p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93baebc6-75bd-4434-9843-f0f9fbb75d83_3840x2822.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>How to calculate in-memory data size</strong></h2><p>When assessing data size for partitioning in Spark, it's important to note that the on-disk size&#8212;such as data stored in S3&#8212;does not always reflect the in-memory size. This is because data formats like Parquet or Avro are highly compressed, and the actual memory footprint can be 2 to 8 times larger than the file size on disk. Understanding the in-memory size is essential for properly tuning your shuffle partition settings.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.canadiandataguy.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">CanadianDataGuy&#8217;s No Fluff Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>To accurately gauge this in-memory size, you can run the following Spark commands to trigger a computation and then inspect the Spark UI (specifically under the SQL/Dataframe tab) for the 'Shuffle read size':</p><pre><code># Read data (example: Parquet file) df = spark.read.load("examples/src/main/resources/users.parquet") # Save as no-op (does not write data, but triggers computation) df.write.format("noop").mode("overwrite").save()</code></pre><p>This approach helps ensure that you're basing your partitioning decisions on the actual memory requirements rather than the compressed on-disk sizes.</p><h2>References</h2><p><a href="https://www.databricks.com/notebooks/gallery/SparkAdaptiveQueryExecution.html">https://www.databricks.com/notebooks/gallery/SparkAdaptiveQueryExecution.html</a></p><p><a href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide">https://www.databricks.com/discover/pages/optimize-data-workloads-guide</a></p><h2><strong>Keep This Post Discoverable: Your Engagement Counts!</strong></h2><p>Your engagement with this blog post is crucial! Without claps, comments, or shares, this valuable content might become lost in the vast sea of online information. Search engines like Google rely on user engagement to determine the relevance and importance of web pages. If you found this information helpful, please take a moment to clap, comment, or share. Your action not only helps others discover this content but also ensures that you&#8217;ll be able to find it again in the future when you need it. Don&#8217;t let this resource disappear from search results &#8212; show your support and help keep quality content accessible!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.canadiandataguy.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">CanadianDataGuy&#8217;s No Fluff Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>