The git database

Git, the distributed version control system, is a key-value database of immutable objects¹. Each object’s key is a hash generated from its own contents. That’s the core design of git. As demonstrated by its popularity, a lot can be built on top of this base idea.

The database

In any git repository, the data is stored under .git/objects. For example:

mkdir /tmp/git-storage
cd /tmp/git-storage
git init
echo "hi" > README.md
git add .
git commit -m 'Initial commit'

Inspecting the objects via tree .git/objects will render something like this:

.git/objects
├─ 3e
│ └─ 12f71bb120f97a09c13223ec532e8896a5df75
├─ 45
│ └─ b983be36b73c0788dc9cbcb76cbb80fc7bb057
├─ 46
│ └─ a4b264acba7c2e835339e505bf3e91ecf40b39
├─ info
└─ pack

6 directories, 3 files

This is telling us that the database has three objects.

If you follow the same steps, git guarantees that one of those objects will have the same ID: 45b983be36b73c0788dc9cbcb76cbb80fc7bb057.

The objects

There are four object types in git: commits, trees, blobs, and tags. Each time git reads/writes the object database, it’s interacting with one of these.

In the example above, git created three objects:

a commit, 3e12f71bb120f97a09c13223ec532e8896a5df75
a tree, 46a4b264acba7c2e835339e505bf3e91ecf40b39
a blob, 45b983be36b73c0788dc9cbcb76cbb80fc7bb057

The blob object is the raw content — hi in this case. The tree object holds the blob and stores its filename (README.md), permissions, etc. — it’s like a representation of a folder. The commit object points to a tree and contains meta information to navigate back/forward (parent commits), and other things (author name, committer name, commit date, etc.) — it’s a snapshot of your folder at a given time.

The type and contents of the objects can be inspected:

# print the object type (blob)
git cat-file -t 45b983be36b73c0788dc9cbcb76cbb80fc7bb057

# print the contents
git cat-file blob 45b983be36b73c0788dc9cbcb76cbb80fc7bb057

The ID for each of these objects is derived from their contents. So, if you reproduce the steps I’ve done above, the ID for the blob object will be the same because its contents are the same (hi). The tree ID may be different depending on filename, permissions, etc. The commit will certainly be different because the contents used to generate the hash are different (author, committer, date).

A final detail is that these objects are immutable: they won’t ever be updated. So what happens when you change the contents of a file?

echo "ola" > README.md
git commit -am 'Update'

More objects will be added to the database: a new blob, a new tree, a new commit. The new commit will have a link to the commit before it, which will be considered its parent — this creates a history that you can navigate back and forward with the commands git provides.

The design

The way the git database works comes with a few advantages that are relevant to some operations git has to carry:

Comparing objects: git doesn’t have to read the content of objects, it just compares their IDs. This approach results in faster comparisons.
Detecting errors: git regenerates the ID of an object and compares it with the ID in the database — if it’s different, there’s an error.
Generating IDs in a distributed system: git doesn’t rely on external information that can change from computer to computer, but in the contents of the objects. By doing so, it can scale to be a distributed system because the same IDs can be regenerated by any computer given the same contents.

Using immutable objects with links to parents is a well known pattern in the industry: history systems are built on top of it, Redux uses the same trick, etc. Over this simple idea, Linus Torvalds built the mainstream version control system of its generation. It’s a great example of how to leverage concepts to build great systems.

Also known as Content-Addressable Storage. ↩︎

oandre.gal

The git database

The database

The objects

The design

Like this:

Comments

Leave a Reply Cancel reply