Skip to content
Ed Freeman By Ed Freeman Software Engineer I
Import and export notebooks in Databricks

Sometimes we need to import and export notebooks from a Databricks workspace. This might be because you have a bunch of generic notebooks that can be useful across numerous workspaces, or it could be that you're having to delete your current workspace for some reason and therefore need to transfer content over to a new workspace.

Certain aspects can be done relatively easily, manually. You can export workspace directories by hovering over the drop down in the workspace view in the UI. This is at any level - at the root or in child directories (provided you have access to the directory in question).

You can export files and directories as .dbc files (Databricks archive). If you swap the .dbc extension to .zip, within the archive you'll see the directory structure you see within the Databricks UI. Exporting the root of a Databricks workspace downloads a file called Databricks.dbc.

You can also import .dbc files in the UI, in the same manner. This is fine for importing the odd file (which doesn't already exist). However, through the UI there is no way to overwrite files/directories; if you try to import a file/directory that already exists, a copy of that artifact will be created.

An alternative solution is to use the Databricks CLI. The CLI offers two subcommands to the databricks workspace utility, called export_dir and import_dir. These recursively export/import a directory and its files from/to a Databricks workspace, and, importantly, include an option to overwrite artifacts that already exist. Individual files will be exported as their source format.

How it works

First of all, if you don't have the Databricks CLI installed locally, run pip install databricks-cli.

Next, we need to authenticate to the Databricks CLI. The easiest way to do this is to set the session's environment variables DATABRICKS_HOST and DATABRICKS_TOKEN. Otherwise, you will need to run databricks configure --token and insert your values for the host and token when you are prompted. The value for the host is the databricks url of the region in which your workspace lives (for me, that's https://uksouth.azuredatabricks.net). If you don't know where to get an access token, see this link.

Now authentication is out of the way, we can address the subject of this blog.

Export

The general template is:

databricks workspace export_dir "<databricks-source-path>" "<local-path-to-export-to>"

To export the workspace root to the temp folder on your C drive, this would be:

databricks workspace export_dir "/" "C:/Temp/"

If you try to export any files that already exist in your local directory, the CLI will skip those files. You can tell the command to overwrite the local files by passing -o to the command.

databricks workspace export_dir "/" "C:/Temp/" -o

Import

The general template is:

databricks workspace import_dir "<local-path-where-exports-live>" "<databricks-target-path"

For example, if my directories live within (C:/Temp/DatabricksExport/) on my machine, and I want to import them into the root of a Databricks workspace, this is the command:

databricks workspace import_dir "C:/Temp/DatabricksExport" "/"

However, if you're importing any files that already exist, you'll get an error. Get around round this error by, again, adding -o to the command.

databricks workspace import_dir "C:/Temp/DatabricksExport" "/" -o

In an ideal world

A Databricks notebook can by synced to an ADO/Github/Bitbucket repo. However, I don't believe there's currently a way to clone a repo containing a directory of notebooks into a Databricks workspace. It'd be great if Databricks supported this natively. However, using the CLI commands I've shown above, there are certainly ways around this - but we'll leave that as content for another blog!

Azure Weekly is a summary of the week's top Microsoft Azure news from AI to Availability Zones. Keep on top of all the latest Azure developments!
Using Databricks Notebooks to run an ETL process

Using Databricks Notebooks to run an ETL process

Carmel Eve

Here at endjin we've done a lot of work around data analysis and ETL. As part of this we have done some work with Databricks Notebooks on Microsoft Azure. Notebooks can be used for complex and powerful data analysis using Spark. Spark is a "unified analytics engine for big data and machine learning". It allows you to run data analysis workloads, and can be accessed via many APIs. This means that you can build up data processes and models using a language you feel comfortable with. They can also be run as an activity in a ADF pipeline, and combined with Mapping Data Flows to build up a complex ETL process which can be run via ADF.
Azure Databricks CLI "Error: JSONDecodeError: Expecting property name enclosed in double quotes:..."

Azure Databricks CLI "Error: JSONDecodeError: Expecting property name enclosed in double quotes:..."

Ed Freeman

Quite often it's beneficial to work with pre-built CLIs/SDKs to interact with your favourite tools, instead of making requests to the underlying REST API. Much of the complexity around constructing requests has been abstracted, and authentication is often easier. The Databricks CLI makes it easier to interact with your Databricks instance, but sometimes you can run into strange errors when constructing the values passed in as arguments. In this blog, we take a look at a JsonDecodeError that can occur when speaking to the Clusters CLI, and look at a way we can avoid this error.
Does Azure Synapse Analytics spell the end for Azure Databricks?

Does Azure Synapse Analytics spell the end for Azure Databricks?

James Broome

Have you or are you about to invest in Azure Databricks? If so, the new Spark offering in Azure Synapse Analytics is likely to have grabbed your attention and rightly so. Why is Microsoft putting yet another Spark offering on the table and what does it mean for you?

Ed Freeman

Software Engineer I

Ed Freeman

Ed is a Software Engineer helping to deliver projects for clients of all shapes and sizes, providing best of breed technology solutions to industry specific challenges. He focusses primarily on cloud technologies, data analytics and business intelligence, though his Mathematical background has also led to a distinct interest in Data Science, Artificial Intelligence, and other related fields.

He also curates a weekly newsletter, Power BI Weekly, where you can receive all the latest Power BI news, for free.

Ed won the Cloud Apprentice of the Year at the Computing Rising Star Awards 2019.