Git Under the Hood

Backer posted 3 min read

I decided to write a short introduction to Git — and more specifically, what’s going on under the hood. Understanding the internal mechanics of Git can help you better grasp its basic behavior and make it easier to understand commands used in everyday workflows.

Did you know that Git is fundamentally a key-value store?

In Git, the key is a SHA-1 (or SHA-256) hash, and the value is a Git object.

What is a hash?

SHA stands for Secure Hash Algorithm. It’s a family of cryptographic hash functions designed to take input data of any size and produce a fixed-size string of characters — a "digest" or "hash" — that looks like a random sequence of letters and numbers.

Key Properties of a Hash

  • The same input always produces the same output hash.
  • It’s fast to compute a hash for any given input.
  • Given a hash, it's computationally infeasible to reverse it and recover the original input — doing so would require an impractical amount of time and energy.
  • It's extremely unlikely that two different inputs will produce the same hash (this is called a collision).
  • Even small changes in the input result in completely different hashes.

Each Git object's hash is calculated from its content. This guarantees data integrity — meaning the content can’t be modified without changing its hash. It also enables deduplication, where identical content is stored only once.
Historically, Git has used SHA-1 hashes for its internal objects — such as blobs, trees, commits, and tags. Each object is identified by a 40-character SHA-1 hash. SHA-1 is a cryptographic hash function that produces a 160-bit (20-byte) hash.

However, due to growing concerns about vulnerabilities in SHA-1, newer versions of Git support SHA-256, a more secure hashing algorithm.

Git isn't the only tool that uses hashes — there are many other use cases for verifying data integrity. A common method is to generate a digest (hash) of a file or message and later recheck the digest to confirm that the content hasn't been modified.

For example, when downloading a file or disk image, you’ll often see a checksum (like SHA-256) provided alongside it. After the download, you can compute the digest of the file on your machine and compare it to the provided value. If they match, the file hasn't been altered.

It’s also important to remember that a digest is a one-way function: you cannot retrieve the original data from the hash.

What Are Git Objects?

Git stores all of its data as a set of objects in its internal database. These objects form the foundation of Git’s version control system.

The four main Git object types are:

Blob (Binary Large Object)

  • Represents the content of a file.
  • Stores only the raw data — not the filename or metadata.
  • Each version of a file’s content is stored as a separate blob.
  • Think of it as a snapshot of the file's content at a specific point in time.

Tree

  • Represents a directory.
  • Contains pointers to blobs (files) and other trees (subdirectories).
  • Stores:
  • Filenames
  • File modes (permissions)
  • References to blobs and other trees
    Think of it as a snapshot of a folder’s structure and contents.

Commit

  • Represents a snapshot of the entire project at a given point in time. Points to a single tree object (the project’s root directory).
  • Contains metadata:
  • Author and committer info
  • Commit message
  • Timestamp
  • References to parent commits (the project’s history)

Tag

  • A human-readable label or bookmark that points to another Git object, usually a commit.
    Can be lightweight (just a reference) or annotated (with metadata).
  • Annotated tags include:
  • Tagger name
  • Date
  • Optional message
  • Often used for marking releases (e.g. v1.0.0).

In practice, this is why Git tracks content, not directories. If a directory contains no files or subdirectories, Git has nothing to store — the directory itself doesn't exist in Git's object database. That's why developers often create an empty file like .gitkeep to force Git to track an otherwise empty directory.

It's also worth noting that Git does not track full file permissions. It only stores limited permission bits — mainly whether a file is executable or not (i.e., the +x bit).

Since a blob represents only the content of a file (without its name or path), you can’t retrieve the filename directly from a blob object. The filename and path information are stored in the tree object that references the blob.

This design means that two files with different names but identical content will share the same hash — and therefore the same blob. This is one of the ways Git achieves storage efficiency through deduplication.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

Very nice content! Thanks

More Posts

Understanding the basics of GIT

Hiruthic Sha - Jun 25

Rewinding Time in Git: Exploring Commits & Branching

Hiruthic Sha - Jun 27

Git and GitHub for Python Developers A Comprehensive Guide

Tejas Vaij - Apr 7, 2024

The Ultimate Guide to Mastering Git Commands: Everything You Need to Know

Hanzla Baig Dev - Feb 21

How I Supercharged My Workflow with Git Worktrees

livecodelife - Jun 28
chevron_left