Build an Ethereum ETL Pipeline for Free Using Databricks Free Edition

Build a zero-infrastructure streaming pipeline: Step-by-step Ethereum data ingestion, schema evolution, and Delta storage

Sep 23, 2025

The Ethereum blockchain generates one of the richest transactional datasets in the world. Yet analyzing this data directly from a node proves impractical — blocks and transactions arrive as deeply nested structures, making efficient querying nearly impossible without transformation.

Best of all? You can build this entire pipeline without spending a dollar. Databricks Free Edition is a no-cost version of Databricks designed for students, educators, hobbyists, and anyone interested in learning or experimenting with data and AI Databricks. Simply sign up for Databricks Free Edition — no credit card required — and you'll get a workspace with serverless compute ready to go.

In this technical walkthrough, we'll build a streaming ETL pipeline in Databricks that:

Ingests raw Ethereum blocks from AWS's public dataset
Extracts and flattens transaction data
Stores everything in queryable Delta Lake tables for analytics

Why Databricks + AutoLoader for Blockchain Data?

Databricks Autoloader provides a Structured Streaming source called cloudFiles that automatically processes new files as they arrive, with the option of also processing existing files Microsoft Learn Databricks. This makes it ideal for blockchain data because:

Scalable storage: Blockchain data grows continuously — Delta Lake handles petabyte-scale datasets effortlessly
Schema enforcement: Flatten complex nested Ethereum data into clean, queryable tables with automatic schema evolution
Streaming ingestion: Process blocks and transactions in near real-time as they arrive
SQL + ML integration: Run ad-hoc queries or feed data directly into ML models for fraud detection, token analytics, or NFT tracking

Step 1: Load Historical Ethereum Data from AWS S3

AWS provides free access to blockchain datasets through the aws-public-blockchain S3 bucket, with data optimized for analytics by being transformed into compressed Parquet files, partitioned by date for efficient querying AWS AWS Open Data Registry.

Instead of setting up an Ethereum node, we'll directly pull historical block data that's already preprocessed into Parquet format — completely free.

Here's what we'll accomplish:

Connect anonymously to AWS S3
List available Ethereum block files (stored as Parquet)
Download selected files into a Unity Catalog Volume in Databricks
Verify successful data landing

This foundational step prepares our workspace for efficient block processing using Spark and Delta Lake.

# Databricks notebook source
# === SIMPLE PARAMETERIZATION (ONLY NECESSARY VARIABLES) ===
# Parameterize only what needs to be variable for reusability
dbutils.widgets.text("catalog_name", "blockchain", "Catalog Name")
dbutils.widgets.text("schema_name", "ethereum", "Schema Name")
dbutils.widgets.text("num_files", "20", "Number of Files to Download")

# === CONFIGURATION ===
# Get widget values
CATALOG = dbutils.widgets.get("catalog_name")
SCHEMA = dbutils.widgets.get("schema_name")
NUM_FILES = int(dbutils.widgets.get("num_files"))

# Hard-coded values (no need to parameterize constants)
AWS_BUCKET = "aws-public-blockchain"
S3_PREFIX = "v1.0/eth/blocks/"

# Unity Catalog volume paths for data organization
DATA_VOLUME = f"/Volumes/{CATALOG}/{SCHEMA}/ethereum"
CHECKPOINT_VOLUME = f"/Volumes/{CATALOG}/{SCHEMA}/ethereum_checkpoints"
SCHEMA_VOLUME = f"/Volumes/{CATALOG}/{SCHEMA}/ethereum_schemas"

print(f"🔧 Using Catalog: {CATALOG}, Schema: {SCHEMA}")
print(f"📦 Downloading {NUM_FILES} files from s3://{AWS_BUCKET}/{S3_PREFIX}")

# === UNITY CATALOG SETUP ===
stmts = [
    f"CREATE CATALOG IF NOT EXISTS {CATALOG}",
    f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}",
    f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.ethereum",
    f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.ethereum_checkpoints",
    f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.ethereum_schemas",
]

for i, s in enumerate(stmts, 1):
    print(f"[{i}/{len(stmts)}] {s}")
    try:
        spark.sql(s)
        print("  ✅ Success")
    except Exception as e:
        print(f"  ❌ Error: {e}")

print(f"\nCreated/verified UC objects. Paths available:")
print(f"  Data: {DATA_VOLUME}")
print(f"  Checkpoints: {CHECKPOINT_VOLUME}")
print(f"  Schemas: {SCHEMA_VOLUME}")

# === DATA DOWNLOAD ===
import os
import boto3
from botocore import UNSIGNED
from botocore.client import Config

print(f"\n📥 Downloading to: {DATA_VOLUME}")
os.makedirs(DATA_VOLUME, exist_ok=True)

# Configure anonymous S3 client (no AWS credentials needed!)
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

# Collect parquet files from S3
keys = []
token = None

while len(keys) < NUM_FILES:
    params = {
        "Bucket": AWS_BUCKET, 
        "Prefix": S3_PREFIX, 
        "MaxKeys": min(1000, NUM_FILES - len(keys))
    }
    if token:
        params["ContinuationToken"] = token
    
    resp = s3.list_objects_v2(**params)
    
    for obj in resp.get("Contents", []) or []:
        if obj["Key"].endswith(".parquet"):
            keys.append(obj["Key"])
            if len(keys) >= NUM_FILES:
                break
    
    if not resp.get("IsTruncated"):
        break
    token = resp.get("NextContinuationToken")

if not keys:
    raise RuntimeError(f"No parquet files found under s3://{AWS_BUCKET}/{S3_PREFIX}")

# Download files with progress tracking
for i, key in enumerate(keys, 1):
    rel_path = key.replace("v1.0/eth/", "")
    dest_path = os.path.join(DATA_VOLUME, rel_path)
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    
    print(f"[{i}/{len(keys)}] {os.path.basename(key)} ...", end=" ", flush=True)
    s3.download_file(AWS_BUCKET, key, dest_path)
    print("✓")

print("✅ Download complete!")

Step 2: Stream Raw Blocks with Autoloader

With Ethereum blockchain data now in our Unity Catalog volume, we can continuously process new blocks as they arrive. In production, you'd use web3.py to poll the Ethereum network and save new blocks as Parquet files. For now, we'll stream the historical Parquet files we downloaded.

Autoloader's cloudFiles source automatically processes new files as they arrive Microsoft, making it perfect for blockchain data ingestion.

How Autoloader Works

Configuration options specific to the cloudFiles source are prefixed with cloudFiles so that they are in a separate namespace from other Structured Streaming source options Databricks. Key features include:

Automatic file discovery: Watches folders continuously and picks up new files
Schema evolution: Auto Loader can detect schema drifts, notify you when schema changes happen, and rescue data that would have been otherwise ignored or lost What is Auto Loader? - Azure Databricks | Microsoft Learn
Exactly-once processing: Maintains state in checkpoint location to ensure no data loss or duplication
Schema Hints: Provides control if you want to specially handle a few column without handling each and every column which is tedious

# === STREAMING READER ===
reader = (
    spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "parquet")  # Specify Parquet file format
        .option("cloudFiles.schemaLocation", SCHEMA_VOLUME)  # Schema tracking
        .option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # Handle new fields
        .option("cloudFiles.schemaHints", "number BIGINT, baseFeePerGas BIGINT")  # Type hints
        .load(f"dbfs:{DATA_VOLUME}/blocks/")
)

# Display streaming data for monitoring
display(reader)

Step 3: Create Delta Tables for Blockchain Data

Next, we extract or add fields from our streaming data and write them into a Delta table. This provides a foundation for efficient queries and downstream transformations.

# Extract and transform block-level fields if needed
blocks_df = reader.select(
   "*",
   "_metadata"
)

# Write to Delta using Structured Streaming
blocks_query = (
    blocks_df.writeStream
        .format("delta")  # Delta Lake sink for ACID transactions
        .outputMode("append")  # Append new blocks as they arrive
        .option("checkpointLocation", f"{CHECKPOINT_VOLUME}/blocks/")  # State management
        .trigger(availableNow=True)  # Process all available data
        .table(f"{CATALOG}.{SCHEMA}.blocks")  # Save as managed Delta table
)

Trigger Types: Why `trigger(availableNow)` Matters

The available now trigger option consumes all available records as an incremental batch with the ability to configure batch size with options such as maxBytesPerTrigger Microsoft Learn Databricks. Understanding trigger options is crucial for optimizing your blockchain pipeline:

Available Trigger Types

Databricks Structured Streaming provides multiple triggers to control micro-batch execution:

`trigger(once=True)` (Deprecated)

Runs the query a single time, processing all currently available data, then stops
Ideal for one-time backfills or finite datasets
In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads Microsoft Learn Databricks

`trigger(availableNow=True)` (Recommended for Blockchain Pipelines)

Processes all currently available data immediately
Continues processing new data as it arrives
With Trigger.AvailableNow, file discovery happens asynchronously with data processing and data can be processed across multiple micro-batches with rate limiting Configure Auto Loader for production workloads - Azure Databricks | Microsoft Learn
Ensures historical and new blocks are captured seamlessly
Handles schema evolution safely without dropping fields
Prevents missing transactions that appear during ingestion

`trigger(processingTime='10 seconds')`

Processes data at fixed time intervals
Useful for controlling costs and reducing API calls
Best for scenarios without strict latency requirements

Why We Use `trigger(availableNow)`

For our Ethereum pipeline, availableNow provides the perfect balance:

Historical data ingestion: Process all existing blocks in one pass
Schema resilience: Handles Ethereum protocol upgrades that add new fields
Resource optimization: Auto Loader by default processes a maximum of 1000 files every micro-batch. You can configure cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger to configure how many files or how many bytes should be processed in a micro-batch

Configuration Example

By combining availableNow with Delta tables and checkpoints, we achieve a robust, scalable streaming solution for blockchain data — while keeping the code reusable for any streaming data source.

# Fine-tune batch processing for optimal performance
optimized_query = (
    blocks_df.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", f"{CHECKPOINT_VOLUME}/optimized_blocks/")
        .option("maxFilesPerTrigger", 100)  # Process 100 files per batch
        .option("maxBytesPerTrigger", "1GB")  # Soft limit on data per batch
        .trigger(availableNow=True)
        .table(f"{CATALOG}.{SCHEMA}.blocks")
)

Conclusion

In this blog, we explored how to ingest Ethereum blockchain data into Databricks and store it in Delta Lake, creating a solid foundation for analysis:

Raw Ethereum Parquet files are ingested into a Databricks Volume.
Streaming ingestion is handled with Databricks Autoloader for reliable file detection.
Schema evolution: Configured automatic handling of new fields as Ethereum evolves
Data is queryable with Databricks SQL, enabling analytics on both historical and new blockchain data.

This setup provides a foundation that can be extended with Medallion Architecture (Bronze → Silver → Gold) and enrichments such as:

Daily ETH transferred per day.
Active wallets and transaction counts per day.
Gas usage trends.
Token transfer analytics.

🚀 Blockchain data is massive and fast-moving — but with Databricks + Delta Lake, you now have a scalable and robust way to tame it.

Future Work

1. Add a Custom Streaming Reader
Instead of first dumping Parquet to storage, implement a direct Spark Structured Streaming source that connects to Ethereum nodes (via WebSocket / JSON-RPC).

This allows new blocks and transactions to flow directly into Spark DataFrames, reducing latency and storage overhead.
For example, a Python wrapper around web3.py could push blocks straight into Spark’s DataStreamReader.

2. Enrich On-Chain Data with Off-Chain Sources

Token metadata, DeFi protocol information, and NFT collections from APIs.
Join off-chain data with raw transactions for deeper analytics, like wallet behavior, token performance, and DeFi usage patterns.
In the next blog, we’ll dive into keeping this data up to date continuously using Spark Structured Streaming, so your Delta tables always reflect the latest blocks and transactions in real time.

CanadianDataGuy’s No Fluff Newsletter

Discussion about this post

Ready for more?