Delta Lake transaction logs explained

The foundation for reliable, scalable and auditable data lakes.

Rahul Joshi

July 16, 2025|8 min read

As organizations adopt lakehouse architectures to modernize their analytical ecosystems, Delta Lake stands out for bringing ACID guarantees, schema governance and time-travel capabilities to data lakes built on open storage formats.

At the core of Delta Lake's transactional engine is the _delta_log—a highly structured append-only log that tracks every change to the table across versions, applications and time. Understanding this log is essential for anyone architecting or engineering reliable data pipelines using Delta Lake capabilities.

What is _delta_log?

Each Delta table contains a hidden folder called _delta_log. This folder is the control center for all transactions performed on the table. It tracks every operation, change and schema update—effectively acting as a Git-like commit history for your data.

Within _delta_log, you will typically find:

JSON files: One per commit, capturing metadata and file actions (e.g., 00000000000000000001.json, 00000000000000000002.json, etc.)
Checkpoint files: Parquet file summaries are written every N commits as checkpoint files to accelerate reads and recovery.
CRC files: Integrity checks that ensures log’s correctness/accuracy

An illustrative image showing the internal Delta Lake structure of the _delta_log folder, including JSON files, checkpoint files and CRC files.

A typical Delta Lake log directory structure. Note: CRC files are not required for Delta Lake 3.x onwards. It also has _last_checkpoint which is pointed to the last checkpoint file.

What gets logged: Each commit is a JSON file.

Every time data is written to a Delta table—whether through a batch insert, an update or a merge—Delta records the event by creating a new JSON commit file inside _delta_log directory. These files describe all the changes made in that specific operation.

Each JSON file represents all the changes made in one atomic transaction. Here is how an illustrative transaction log record is stored within a JSON file:

    {"protocol": {"minReaderVersion": 1, "minWriterVersion": 2}}
{"metaData": {"schemaString": "...", "partitionBy": ["load_partition_date"]}}
{"remove": {"path": "load_partition_date=2024-11-01/part-00312.snappy.parquet", ...}}
{"add": {"path": "load_partition_date=2024-11-01/part-00390.snappy.parquet", ...}}
{"txn": {"appId": "delta-log-deepdive-job", "version": 109}}

A sample log record above describes one of the below actions:

Protocol: Sets version compatibility rules for readers and writers. For example, it prevents compute engines with non compatible delta lake versioned libraries from corrupting tables created with newer versions.
MetaData: Defines schema, format, partition columns and table properties used during the commit operation.
Remove: Marks an old data file as logically deleted. The file is retained until the VACUUM command gets executed.
Add: Registers the newly written file that replaces the old one.
Txn: Defines transaction boundary for exactly-once semantics.

The log-centric design enables reproducibility, auditability, lineage and complete table versioning along with the data without relying on any external tooling.

How does an update work?

Delta Lake does not update rows in place. Instead, it rewrites entire Parquet files containing matched rows. Unchanged rows in those files are also copied to new files. This design supports immutability, rollback, audit and safe compaction.

Commit handling: optimistic concurrency in action

Delta Lake employs optimistic concurrency control to allow high-throughput, concurrent writes without data loss or corruption:

Each job attempts to write the next versioned JSON file (e.g., 00000000000000000027.json).
If another job has already written it, the transaction retries using the next available number.
Only one transaction can succeed at a given version.

This model eliminates the need for complex locking mechanisms while ensuring data integrity at scale—a critical requirement for production-grade data lakes. It also enables multi-writer, high-throughput pipelines and avoids global locking or centralized coordination.

How Delta Lake handles failure recovery

Data files are written first, but only made visible after a successful JSON commit.
If a job crashes midway, no partial state is exposed.
Readers always interact with a consistent, completed version.

This protects consumers from ingesting incomplete or corrupted datasets.

Delta Lake schema evolution and enforcement

Delta Lake provides a flexible and safe approach to schema management:

Schema evolution: Automatically incorporate new fields during write operations (e.g., nullable columns) on the fly.
Schema enforcement: Validate incoming data against the expected schema to prevent accidental data corruption.

Schema changes are tracked at the metadata level within the transaction log. The schema evolution for a given table can be audited as shown in an example below:

    DESCRIBE HISTORY tableName

Applications as well as human users gain auditability and the ability to evolve pipelines incrementally, without losing control over table data consistency.

Delta Lake time travel and versioned queries

Delta Lake’s time travel capabilities let you access the state of a table at any point in its transactional history, as shown in example below:

    SELECT * FROM postedTransaction VERSION AS OF 48;
SELECT * FROM postedTransaction TIMESTAMP AS OF '2024-12-01 08:00:00';

This is very useful for:

Rolling back failed pipelines
Reproducing ML training sets
Debugging historical production incidents or regressions

Delta Lake operational insights from _delta_log

A thorough understanding of the _delta_log functionality provides enhanced observability and control capabilities, as detailed in the below table:

Capability	Description
Change data capture	The commit log diffs can be used to track changes and infer data insertions, updates, and deletions.
Lineage tracking	Data evolution can be traced back to specific file and/or column levels.
Failure diagnosis	Data corruptions and anomalies could be identified by tracking retries.
Optimization metadata metrics	Metrics are provided to trigger OPTIMIZE, ZORDER or VACUUM processes for maintaining and optimizing the data lake.

Best practices for managing _delta_log in production

1. Set appropriate checkpoint intervals. Default is every 10 commits, which can be adjusted based on the write volume.

2. Avoid large numbers of small files. Use OPTIMIZE, Auto Compaction optionally organizing data using ZORDER, to manage small files.

    OPTIMIZE delta.`/path/to/delta/table` ZORDER BY (attributeName);

3. Use VACUUM to clean up stale files, with specified retention threshold.

    VACUUM delta.`/path/to/delta/table` RETAIN 240 HOURS

4. Monitor schema evolution. Audit and alert on unexpected schema changes.

5. Watch log file volume. Too many JSON files between checkpoints may indicate skew.

Implications and trade-offs of adopting Delta Lake

The adoption of the Delta Lake table format has certain consequences that may not apply to data consumer applications, particularly in multi-tenant data lake environments, where the enterprise platform team manages and addresses these implications:

Metadata volume can grow quickly: High-frequency updates lead to inflated transaction logs in the absence of checkpointing.
- Mitigation: Use OPTIMIZE, perform frequent checkpointing with right retention times.
_delta_log Is a single point of truth: Corruption or manual tampering can break versioning.
- Mitigation: Establish right controls to monitor _delta_log integrity and configure appropriate access restrictions.
Schema enforcement can break Data ingestion pipelines: Strict validation may cause write failures.
- Mitigation: Establish Data Contracts, right schema evolution settings, automated tests for upstream changes.
Compute engine interoperability varies: Delta Lake works best in Spark-native environments.
- Mitigation: Evaluate and use Delta Lake UniForm, Iceberg tables for broader open table compatible engines (e.g., EMR, Glue, Snowflake).
Multi-region Data Lake resiliency implications: _delta_log files may be replicated to other regions before data files, causing inconsistencies for data consumers.
- Mitigation: Implement custom routing for consumers, especially for active-passive setup (for Lake consumption).

How _delta_log powers the lakehouse?

The _delta_log is crucial for powering lakehouse environments because it provides:

ACID compliance on cloud-native object storage
Support for both batch and streaming use cases
Rollback, Audit and Lineage capabilities
Compute Engine-agnostic read/write capabilities with Delta Lake UniForm

What is _delta_log?

A folder that logs all changes—schema, files, metadata—to support transactions and time travel.

Can we delete _delta_log?

No. Use Delta’s VACUUM command to remove obsolete data safely.

How is Delta Lake different from Parquet?

Parquet is a storage format. Delta adds ACID transactions, schema enforcement, versioning and metadata governance on top of it.

What if there are no .CRC files?

They are not required in Delta Lake 3.x onwards.

Bottom line: Mastering Delta’s Transaction Log is essential for building resilient, auditable, trustworthy and scalable data platforms!

New to tech at Capital One?

We're building innovative solutions in-house and transforming the financial industry:

Explore open tech jobs and join our world-class team in changing banking for good.
Learn how we’re delivering value to millions of customers with proprietary AI solutions.
See how we’re building and running serverless applications at a massive scale.

Rahul Joshi, Distinguished Data Engineer & Director, Card Tech

Rahul Joshi is a Distinguished Data Engineer at Capital One with over 19 years of experience in building cloud-native data and analytics ecosystems that power intelligent products, decisioning systems and analytics at scale. Currently, his focus is within Card Tech, where he spearheads the data and analytics platform architecture for modern credit card core systems managing credit card accounts and transaction data. Prior to Capital One, Rahul held tech leadership roles at EY and IBM, delivering next-gen data and advanced analytics solutions for regional and global financial institutions. He writes about modern data architecture, cloud-native analytics and enterprise data strategies.

Delta Lake transaction logs explained

The foundation for reliable, scalable and auditable data lakes.

Explore #LifeAtCapitalOne

What is _delta_log?

What gets logged: Each commit is a JSON file.

How does an update work?

Commit handling: optimistic concurrency in action

How Delta Lake handles failure recovery

Delta Lake schema evolution and enforcement

Delta Lake time travel and versioned queries

Delta Lake operational insights from _delta_log

Best practices for managing _delta_log in production

Implications and trade-offs of adopting Delta Lake

How _delta_log powers the lakehouse?

New to tech at Capital One?

Related Content

Data lake architecture: zones explained

Data lake vs data warehouse: What’s the difference?

Introducing data mesh 2.0: A new era of data governance

Footnotes