Thursday 11 June 2009

Version control with Git - an overview of theory.

Git is version control system that we use in my new company. It's is a collection of C tools, and script wrappers provided for ease of use. It's UNIX styled, command line tool that is constantly developed and improved.

When you visit the Git Project site, you'll learn that git is free, open source software, and that main focus of developers was to provide fast and not expensive branching and merging tool. If those doesn't sound as good reasons to use it, maybe the fact that the boggest company that currently uses git is google, and google makes sure that git developers keep working the source code to make it better and better.

Although the basic ideas behind git aren't very hard to understand, the way you use it, especially after you've been using other version control systems, might seem confusing, and hard to grasp. That's why it's good idea to start working with git from an overview of what's behind the scenes.

Git repository - an archive of how the working directory used to look like in the past - in it's structure is similar to UNIX filesystem. It's a tree, that has it's root in, well, root directory, and contains other directories and/or files that contain data. Each node of the tree is represented in git data model, as either tree (sub-tree of the root tree), or blob. Blobs are representing leaf nodes of the tree, i.e. files. They are named by computing SHA1 hash id of their files size and contents, so each blob name is depended on data stored in a file. And yes, that means that if two different files have the same content they'll end up having just one blob representing both of them. That's the difference between git system and file system, and one of advantages of that approach is very compact size of git repositories.

Usually working with git repository means repeating few steps every development cycle (for example a working day): you (or somebody else in your team) create repository. If it's only you working on the project it would be a local repository on your machine, but usually it'd be local repository copied from remote repository reachable for many team members. After the repository is created, or copied from remote location to you local machine, you'll be working on code arranged in a working tree. After some time your work will reach a significant point - your code will start compiling, or maybe you will manage to fix a bug. You will decide you want to "save" your progress. You add your changes to the index - which in git world is something like staging environment of your local build - it's a place where all changes go before they'll be committed into a repository. You might add many changes to the index during a certain time of work, but at the end, you'll want to commit them to repository. Your changes will still be only local, to make them reachable for everybody you have to push them to remote repository.
Of course you'll need to update your code once in a while to make sure you're working on right code base, you fetch changes from remote repository, have a look at them, and if they seem all right you merge them into your repository. If you have any conflicts you deal with them at this point.

So far so good, and nothing seems scary or very unfamiliar.

What git is good at (and different) is branching mechanism. Usually branch creation is quite a big deal, and you don't do it very often. Branches are always separated from "main line" of development, often placed under different directory, therefore hard to merge and maintain. Git branching is very easy because branches aren't separate entities, their are just trees and blobs sorted in commits under repositories. What allows that is a definition of commit parent - an attribute that points to a commit on which given commit was based. Parents have their parents, and this relationship creates branches of our projects' tree. Each branch is just a name pointing at concrete commit.
What it means is that each branch isn't more expensive than any other commit, you don't create unnecessary copies of files, or new directory structure.
Obviously that means that merging is also easier. When your project has two branches being developed in parallel there are two ways of bringing changes of both of them together and replacing two branches with one that combines their code.
If you decide to merge code, git will merge commits from two branches, and produce combined version of the code. That's all right, but it means that after merging you'll have to do a commit which only purpose will be to merge both branches. Git offers another, almost si-fi, way of doing that - rebase. Rebase is almost like traveling in time. Let's say you were developing code on your machine for a long time, you were very busy (or lazy) and you didn't do any updates. After hard day of working you want to commit your changes, but you see that somebody, who was also working very hard was committing changes to the remote branch used by all developers. You ended up with different branch without realizing. But you can use rebase to get back in time to the morning when your code was unchanged. Than you can apply to this unchanged code changes from remote branch, the one you want to use as base for your new changes. And than you can apply your changes to the code, on the ones that were made by somebody else during the day. You end up having code ready to commit. And git does all that for you in one simple command - rebase. So, remember, rebase is you friend. Weird magic-like friend that doesn't talk too much, but a friend.

Since git is command line based tool, there's handful of commands worth remembering. I won't go into them here, since there are many pages to learn about them like:
Everyday Git with 20 commands or so
Git in a Nutshell
gittutorial man page

No comments:

Post a Comment