The first few weeks of my apprenticeship involved a crash course on source control with Git. I'll be sharing what I learned in a series of blogs, looking at Visual Studio's Git extension, and the Git client SmartGit. In this blog, I start off with a tour of the fundamentals of version control with Git.
Seasoned developers may find this a bit basic – if you want something more substantial, I suggest these blogs by Howard which describes how to use the GitFlow process within a continuous integration environment:
- A step-by-step guide to using GitFlow with TeamCity part 1 – different branching models
- A step-by-step guide to using GitFlow with TeamCity part 2 – a branching model for a release cycle
For everyone still here with me, I'll start with some background!
The two mainstream version control systems for teams working with Visual Studio are Team Foundation Version Control (TFVC), part of Team Foundation Server (TFS) or Subversion (SVN). Recently, a changing attitude to open source tools, resulting in features such as the Visual Studio Tools for Git extension, is making it easier for teams to choose alternative version control systems.
It's now possible for .NET developers to use Git for source control, either within TFS, or on its own. I'll be describing the second option.
Distributed Source Control
Git is a 'third generation' distributed source control system. The notion of distributed source control takes a bit of getting used to.
Most people are familiar with the idea of having one central place which contains the current 'definitive' version of a set of files, which teams take local copies from, and then update.
Whether you're a developer using TFVC, or a non-technical team using SharePoint to store versioned documents, this system makes intuitive sense.
TFVC uses a single central repository (hosted on a physical server or in the cloud as Visual Studio Online).
A team of developers working on a project can each have a copy of the files in the repository on their personal machine, but in order to create branches, merge branches, roll back to a previous version, or share changes they've made, they need to communicate with the central repository.
With a distributed system such as Git, the fundamental change is that each person working on a project can have their own complete repository, with a full version history for the source code.
The key advantage of this approach is flexibility. If you are using distributed source control, you can still choose to nominate one of the repositories as the central repository. However, team members working on a personal repository have access to the full power of the source control system, without having to connect to the central repository.
A little illustration below:
A repository is created. Let's call it A.
A clone, B, is made. It is a complete, perfect copy of A.
Another clone, C, is created… It too is a complete copy of A, including its entire version history and all its branches.
By agreement between the team using the repositories, repository A is acting as the central repository, AKA the "origin". In the case of Endjin, our central repository lives on GitHub, a web hosting service for Git repositories. However, there's nothing essentially different or special about A. Any of its clones could be nominated as origin in future.
Because A is acting as the central repository, the developers using B and C will tend to frequently copy (pull) any updated versions of the source code from A, and update A (push) with changes that have been made locally.
However, this isn't the only way for teams to collaborate with Git. Any of the repositories can share information with any of the others. For example, the developers using B and C may want to work together on a feature branch, before updating the central repository.
Git was invented for open source development, where teams are physically distributed and fluid, and new versions of a project are likely to go off and start a life of their own.
However, its flexibility and support for branching is making it increasingly popular for all types of development.
Versioning with Git
We've looked at how Git differs from a centralised repository in terms of the structure of each repository, and how repositories interact. Git also takes a different approach to storing the history of changes to the code in the repository.
The history of the repository takes the form of a series of 'commits'. A 'commit' can be thought of as a snapshot of the codebase at a particular moment in time.
Each Git commit is referenced by a SHA1 hash. It is best practice to include a concise message with the commit, describing what's changed.
Because each team member has their own Git repository, they can commit locally as often as they like, without fear of affecting the central repository (donut time). Committing is different to just clicking 'Save' – it gives you a set of previous states that you can examine or return to.
I learnt the importance of doing this in week 2, when I started experiencing some mysterious behaviour with my project. I'd been relying on just saving files locally in Visual Studio that day, as I didn't feel I'd reached any big milestones worthy of pushing to our central repository.
If I'd been making a local commit every time I made a significant change, it would have made identifying the problem a lot easier.
Describing a commit as a snapshot of the entire codebase at a point in time is actually an oversimplification.
In fact, Git lets you control exactly which changes are recorded in a commit, using the concept of a 'staging area', into which changes such as new files, file deletions or updates are placed in prior to a commit.
It also lets you select individual changes from this area for each commit.
For example, you might have been working on a project, fixing a bug. In the process, you come across a neat solution to another bug, so you add that in too.
As a good Git user, you remember that you should commit these changes, but it makes more sense to do this in two separate commits, one for the bug fix 1, and another for bug fix 2.
You can add all the changes to the staging area, and then pick out the ones that should go into each commit. This lets you build a clear, well-structured commit history.
In this section, I've been using 'versioning' interchangeably with 'committing', but in fact, Git lets you separate the process of producing named versions of your application from the process of saving a snapshot.
Not every commit will represent a new version of the application. Commits can be assigned a tag stating a version, if they represent a change that you want to publicly flag as significant.
Branching and merging with Git
Branching is important part of the role of a source control system. In order to keep a working, clean version of the codebase, while also letting a team of developers add new features, re-factor code and fix bugs in an organised way, the repository is divided into separate workstreams known as branches.
You can then have branches of branches, branches of branches of branches… and so on. At certain points, the changes in a branch will need to be combined – 'merged' – with another branch.
Merging is the most challenging aspect of working with branches. If merging in TFVC is like arranging a wedding – a worthy ceremony but time consuming, expensive, and with the potential for family squabbles, merging in Git is a more casual affair.
In most central source control systems such as TFVC, a branch is a full copy of all (or some) of the original source code, and exists as a separate set of files in the underlying file system. (In the case of TFVC, data is actually stored in SQL Server).
Merging branches involves the rather painful process of comparing two different copies of all the files contained in the branch.
In Git, rather than existing alongside a main branch in a file system, branches are just labels on commits.
As a result, Git permits merges between branches which are not directly related – something which isn't possible in TFVC. It also allows merges between more than one branch at a time (three way merges), and 'cherry picking', where individual commits from a branch can be merged with another branch.
Version history as an artefact
A commit in Git isn't like a log file entry, which should be fixed for all time as a definitive record of activity on a system at any point in time. With Git, at any point, a commit can be amended, or merged in with another commit.
Like the idea of non-centralised source control, this takes some getting used to. My first reaction was that this was a pollution of the very notion of source control!
However, if you think of the series of versions of a source repository as an artefact in itself, it makes more sense. As one experienced developer explained, when you are working on a feature, the commit history can be messy – with hindsight, you would have made commits in a different order, or divided up the changes in a different way.
Despite Git's many tools for viewing branch and version history, if the commits are disorganised, they will be difficult for other people to make sense of. Git lets you use the power of hindsight to re-organise the history of work on an application or feature, before sharing your local commits with other people, making projects more maintainable.
There's a good article on ReviewBoard which describes in detail how to maintain a clean commit history using Git.
Git vs TFVC
A few of the key differences between Git and the Team Foundation Version Control System available as part of TFS have cropped up as we looked at repository distribution, branching, and version history. I've put comparison below, looking at how each system affects the way the developers can work with a codebase.
Developers can carry out a wide range of operations on their personal repository, including viewing history.
Developers can make changes to their local copy of a codebase, but cannot share changes, make branches, or view history without connecting to the central server.
Work can be shared between any repositories used by a team, without the involvement of a central repository.
Changes made to each developer's local working copy can only be shared through the central repository.
Each repository acts as a backup.
Disaster Recovery measures are needed to protect the codebase in the event of something nasty happening to the central server.
Branching is part of day to day work in Git. Branches consist of tags on individual commits. It is easy to create and merge branches, even when they are not directly related.
Merging branches can be complex, and therefore tends to only happen occasionally. A merge requires a comparison of two alternative versions of a set of files in the repository.
I'm sure I've missed out some big differences here – please comment and let me know if you spot any gaps or inaccuracies!
Next in this series, I'll look at the basic operations you can carry out on a Git repository, and compare the tools that .NET developers can use to work with Git, from the command line, to Visual Studio Tools for Git, to third party clients such as SmartGit.