Browse our archives by topic…
Data validation in Python: a look into Pandera and Great Expectations
Data validation is a vital step in any data-oriented workstream. This post investigates and compares two popular Python data validation packages - Pandera and Great Expectations
How to setup Python, PyEnv & Poetry on Windows
Using multiple versions of Python on Windows can be a frustrating experience, especially if you've experienced how smooth the same workflow is on MacOS and Linux/WSL. This post provides a script and guide to getting setup to use Python virtual environments and Poetry, a popular dependency management tool.
How To Implement Continuous Deployment of Python Packages with GitHub Actions
This post demonstrates how to use GitHub Actions to automatically publish updates to your Python package to PyPI, and explores whether this constitutes Continuous Deployment
Customizing Lake Databases in Azure Synapse Analytics
Great, so I've configured my Lake Database in Azure Synapse Analytics. But since I'm using parquet-backed files, my column names aren't very user-friendly. I also have these calculated columns incorporating business logic that I'd like to query on the fly rather than persist them to backing data. I also want to give specific end-users access to this database and provide them with pre-defined reporting queries to get them up and running as quickly as possible. How can I do this? Enter Custom Objects in Lake Databases - now you can create VIEWs, Stored Procedures, USERs (amongst other objects) in what used to be a read-only database. This article explores the customization options and how it can help you organize your reporting data in Azure Synapse Analytics
How to create a semantic model using Synapse Analytics Database Templates
In this second blog in the series, we put the newly released Azure Synapse Analytics Database Templates into action by exploring the different methods that are available to create a semantic model.
Continuous Integration with GitHub Actions
This post gives an overview of Continuous Integrations and shows how you can implement it with GitHub Actions, with an accompanying example Python project
How to apply behaviour driven development to data and analytics projects
In this blog we demonstrate how the Gherkin specification can be adapted to enable BDD to be applied to data engineering use cases.
What is the Shared Metadata Model in Azure Synapse Analytics, and why should I use it?
A lesser known feature of Azure Synapse is the "Shared Metadata Model". Synapse has the capability to automatically synchronize tables created via Synapse Spark with objects you can query via the usual SQL Serverless endpoint, without any additional configuration. This article brings attention to this capability, highlighting the benefits and tradeoffs vs rolling your own SQL Serverless VIEWs.
Extract insights from tag lists using Python Pandas and Power BI
We often come across spreadsheets and CSV files that contain features which are a list of tags or categories. This blog article walks through a simple way of extracting insights from this data using a combination of Pandas for data wrangling and Power BI for analytics and visualisation.
Introduction to Containers and Docker
Containers and Docker offer a different approach to development and deployment of applications. This post provides an introduction to containerisation and Docker, and provides examples of creating containerised applications with Docker.
How to test Azure Synapse notebooks
Interactive Spark notebooks are an incredibly powerful tool for data exploration and experimentation. And in Azure Synapse, the time to (business) value is significantly decreased due to tight integration with Pipelines and monitoring tooling. But as with any software process, the need to validate business rules is important, as is ensuring that quality doesn't regress over time - especially so in such a collaborative and productive environment. This post looks at some simple ways to add testing to your Synapse Notebooks.
How Azure Synapse unifies your development experience
Modern analytics requires a multi-faceted approach, which can cause integration headaches. Azure Synapse's Swiss army knife approach can remove a lot of friction.
How to use SQL Notebooks to access Azure Synapse SQL Pools & SQL on demand
Wishing Azure Synapse Analytics had support for SQL notebooks? Fear not, it's easy to take advantage rich interactive notebooks for SQL Pools and SQL on Demand.
Azure Synapse for C# Developers: 5 things you need to know
Did you know that Azure Synapse has great support for .NET and #csharp? Learning new languages is often a barrier to digital transformation, being able to use existing people, skills, tools and engineering disciplines can be a massive advantage.
Import and export notebooks in Databricks
Sometimes it's necessary to import and export notebooks from a Databricks workspace. This might be because you have some generic notebooks that can be useful across numerous workspaces, or it could be that you're having to delete your current workspace for some reason and therefore need to transfer content over to a new workspace. Importing and exporting can be doing either manually or programmatically. In this blog, we outline a way to recursively export/import a directory and its files from/to a Databricks workspace.
Azure Databricks CLI "Error: JSONDecodeError: Expecting property name enclosed in double quotes:..."
Quite often it's beneficial to work with pre-built CLIs/SDKs to interact with your favourite tools, instead of making requests to the underlying REST API. Much of the complexity around constructing requests has been abstracted, and authentication is often easier. The Databricks CLI makes it easier to interact with your Databricks instance, but sometimes you can run into strange errors when constructing the values passed in as arguments. In this blog, we take a look at a JsonDecodeError that can occur when speaking to the Clusters CLI, and look at a way we can avoid this error.
Using Databricks Notebooks to run an ETL process
Here at endjin we've done a lot of work around data analysis and ETL. As part of this we have done some work with Databricks Notebooks on Microsoft Azure. Notebooks can be used for complex and powerful data analysis using Spark. Spark is a "unified analytics engine for big data and machine learning". It allows you to run data analysis workloads, and can be accessed via many APIs. This means that you can build up data processes and models using a language you feel comfortable with. They can also be run as an activity in a ADF pipeline, and combined with Mapping Data Flows to build up a complex ETL process which can be run via ADF.
Using Python inside SQL Server
Do you have a bunch of data in SQL Server that you're using ODBC/JDBC to pull data to work with in Python? Using SQL Server's Python integration, you can connect to a SQL Server instance within your preferred IDE and perform the computations on the SQL Server Machine. No more clunky data transferring. Operationalizing a Python model/script is as easy as calling a stored procedure. Any application that can speak to SQL Server can invoke the Python code and retrieve the results. Easy! This blog will provide a few, simple examples which make use of this capability to carry out some simple Python commands, so you can get up and running as quickly as possible.