Skip to content
Elisenda Gascon By Elisenda Gascon Apprentice Engineer II
Version Control in Databricks

Notebooks provide an interactive and collaborative environment for developing code. As such, in Databricks, notebooks are the main tool for creating workflows. With them, you can develop code using a variety of languages, schedule notebooks to automatically run pipelines, collaborate by sharing notebooks, use real-time co-authoring, and use Git integration for version control.

With Databricks Notebooks, you can apply software engineering best practices, such as using version control to track changes and collaborate on your code. In this post, we will see how to use version control and Git integration with Databricks Notebooks.

Version Control in Databricks Notebooks

By default, Databricks notebooks have version history built in them. When you're working on a notebook, you'll see a tab called ‘Revision history'. Here, you'll find every version of your notebook that has been (automatically) saved.

SHowing a screenshot of the revision history panel of a Databricks notebook.

You can restore an earlier notebook by selecting an earlier version and choosing “Restore this revision”:

Showing a screenshot of the revision history panel of a Databricks notebook. The "Restore this revision" link is highlighted.

Databricks has automated version control, which means that version history is always available in your Databricks notebooks without any configuring needed.

This also means that real time co-authoring of notebooks is possible. Two people can collaborate on the same notebook at the same time and see the changes being made by the other person in real time. In the following screenshot, my colleague has added a cell to my notebook. I can see that they're viewing the notebook as well as their cursor.

Screenshot of a Databricks notebook. At the top of the page, you can see the initials of my colleague who's editing the notebook. You can also see their cursor on the cell they're editing.

Note that all users editing the notebook need to have permissions to use the cluster attached to the notebook in order to run any of the cells. Otherwise the above error message will appear.

Git integration in Databricks Notebooks

We've seen that version control is set up by default in Databricks notebooks. However, this versioning lives in the Databricks environment being used. If this environment were to be deleted, all the work and version history would be lost.

Let's see how to use Git with Databricks notebooks to implement version control.

The recommended way to use Git integration with Databricks is to use Databricks repos.

Databricks repos provides Git integration within your Databricks environment, allowing developers to use Git functionality such as creating or cloning repositories, managing branches, reviewing changes, and committing them. This allows developers to apply software engineering best practices when using Databricks notebooks.

Let's see how to set up Databricks repos step by step. In this example, we will be setting up version control using GitHub.

Step 1: Create a repo

Azure Databricks supports the following Git providers:

  • GitHub
  • Bitbucket Cloud
  • GitLab
  • Azure DevOps
  • AWS CodeCommit
  • GitHub AE

In this example, we will be setting up version control using GitHub. For this, we have created a private repository in GitHub called databricks-version-control where we will store our code.

Step 2: Get a Git access token

If your repository is set to public, you can skip this step.

In order to connect to Databricks repos, you will need to create a personal access token (PAT).

In GitHub, go to Settings > Developer settings > Personal access tokens and click on “Generate new token”.

Showing a screenshot of GitHub in Settings > Developer Settings, > Personal access tokens. We are hovering over "Generate new token"

Provide a description for your token and select the scope to define the access for the personal token:

Showing a screenshot of github. We're setting the configuration to generate our PAT.

Once you get the token, make sure you copy and save it somewhere, as it will only be displayed at this stage and you will be needing it shortly.

Step 3: Activate Git integration within Databricks

In Databricks, click on your user email on the top right of your screen and select User Settings. Under the Git Integration tab, choose the Git provider you want to use (GitHub in our example), add the username or email and enter the PAT that you've been provided earlier.

Showing a screenshot of Databricks. We're in the User Settings page, under the Git integration tab. We have set the Git provider to GitHub.

Step 4: Add a repo

Now we're ready to link the repo that we created earlier to Databricks Repos.

Go into “Repos” and select “Add Repo”.

Showing a screenshot of Databricks. The Repos menu is expanded. The "Add Repo" button is highlighted.

Enter the URL to your Git repository and create the repo.

Showing a screenshot of the dialog box to add a repo. We have entered the URL to our Git repository and selected the Git provider as GitHub.

Once your repo has been created, you will see it appear on your menu.

Showing a screenshot of the repos menu again. This time, we can see our repo listed.

Now that your repo is available in Databricks, you can use it like you would in any other IDE. You can create a branch, update it and commit your changes for review.

Step 5: Create a branch

To create a branch, select the “master” branch icon showing next to the repository name. A dialog will open. From here you can see if any changes have been made, and create a branch.

Showing a screenshot of the dialog with an option to create a new branch.

Select Create Branch, and a name, and create the branch.

Showing a screenshot of the dialog to create a branch. We are naming the branch "feature/version-control-demo".

Step 6: Create a notebook

Once in our branch, create a notebook.

Showing a screenshot of Databricks. Under the repos menu, hovering over the repo shows the option the create a new notebook.

Let's add some markdown to our notebook.

Showing a screenshot of the new notebook in Databricks. The notebook has only one cell with a title in markdown that reads "This is a notebook".

Step 7: Review and commit your changes

Now, let's commit these changes. By selecting our branch again, we see the changes we made. This is very similar to the view you get in GitHub or any other Git provider.

Showing a screenshot of the dialog that appears when we select our branch. We can see the changes made to our notebook.

From here, review your changes, add a commit message and description, and commit your changes.

Step 8: Create a PR

Back in GitHub, I can review the changes and create and review a PR like usual.

Showing a screenshot of GitHub. We can see the changes from our last commit.

Source control is now set up.

Conclusion

In this post, we have seen how to use version control in Databricks notebooks and how to implement source control using Git integration with Databricks repos. Version control allows developers to easily collaborate on their work by sharing and reviewing changes. With Databricks repos, you can use Git functionality to clone, push and pull from a remote Git repository, manage branches, and compare differences before committing your work. After that, anyone with access to the repo can see the changes and perform tasks such as creating pull requests, merging or deleting branches, and resolving merge conflicts.

Elisenda Gascon

Apprentice Engineer II

Elisenda Gascon

Elisenda was an Apprentice Engineer from 2021 to 2023 at endjin after graduating with a mathematics degree from UCL. Her passion for problem solving drove her to pursue a career in software engineering.

During her time at endjin, she helped clients by delivering data analytics solutions, using tools such as Azure Synapse, Databricks notebooks, and Power BI. Through this, she also gained experience as a consultant by delivering presentations, running customer workshops, and managing stakeholders.

Through her contributions to internal projects, she gaines experience in web development in ASP.NET, contributed to the documentation of the Z3.Linq library, and formed part of a team to develop a report to explore global trends in wealth & health.

During her training as a software engineer, Elisenda wrote a number of blog posts on a wide range of topics, such as DAX, debugging NuGet packages, and dependency injection. She has also become a Microsoft certified Power BI analyst and obtained the Green Software for Practitioners certification from the Linux Foundation.