Data Exploration & Experimentation with Notebooks in Azure | endjin

Ian Griffiths 16th October 2020

SQLBits 2020

Microsoft are investing heavily in Notebook technologies (such as Jupyter and .NET Interactive) to provide an interactive environment for experimental and exploratory data inspection and analysis.

These kinds of environment are becoming increasingly important for a growing range of activities including data cleaning and normalization, data import, statistical analysis, insight generation, and testing hypotheses. And in some applications, it can make sense to take notebooks that started out as an interactively developed set of ad hoc operations and transform them into part of an automated workflow.

Jupyter notebooks are most commonly authored in Python or R. Microsoft has been working to add support for .NET languages, enabling the use of C# and F# in notebooks. They are also adding .NET support for Spark, enabling Spark clusters to be controlled from .NET, and also with a view to being able to run custom .NET code inside the cluster as part of the core processing.

Azure's growing support for notebooks enables this approach across a range of scales. You can work with datasets that fit easily in a single machine's memory, but if you need more firepower, with Azure's Databricks support you can spin you up a server farm to process your data in parallel, enabling you to perform complex computations across massive datasets.