Managing schemas in Azure Synapse SQL Serverless
We're huge fans of SQL Serverless in Azure Synapse Analytics. We've been writing about it since we first started working with the product team back in 2019 - from benchmarking performance, to its part in redefining what a modern data warehouse can be.
The flexibility it provides - in its Pay as you Query cost model, its serverless infrastructure, and its ability to add structured schemas over standard file formats in cloud storage - mean that it's been central to a number of modern data platform architectures that we've designed for our customers.
But despite its compatibility with standard SQL-based tooling, it's not really "a database" in the usual sense of the term - so how do you deal with some of the standard database administration concerns like schema management and versioning? This post looks at how they apply within SQL Serverless, and describes an approach for ensuring that your data and schema remain in sync.
Schema artefact management
SQL Serverless is a core part of Azure Synapse and, at present, the only way that it can be used is by provisioning a Synapse workspace. This is a key thing to understand as it dictates a lot of what is possible from hereon in. Those working with Microsoft databases might be familiar with SQL Server Data Tools for Visual Studio, which provides the ability to manage schemas as source code projects alongside application source code and artefacts.
Unfortunately, SSDT isn't compatible with SQL Serverless - but remember that it's not really a "proper" database, so it's not surprising that this isn't possible. However, Synapse does have it's own IDE in Synapse Studio, and version control integration through either Azure DevOps or GitHub, so managing the schema scripts is relatively straightforward (albeit slightly limited) if you choose to go down these routes.
It's worth noting that the Git integration in Synapse does impose it's own world on what gets stored in the repository. Like all the other artefact types (pipelines, notebooks etc) the actual SQL code is wrapped in a JSON structure containing metadata about the file, which makes it impractical to edit or use this file outside of Synapse Studio.
However, one of the main benefits of Azure Synapse is unification - bringing all of your data capabilities together in once place, with deep integration. So, trying to pull this apart and manage SQL scripts differently is at-odds with the cross-functional, highly-productive nature of working on the end-to-end data problem in a holistic way with one set of tools.
Schema artefact deployment
Synapse acts as both an IDE (Synapse Studio) as well as a runtime, so "deployment" can mean a couple of different things.
Firstly, there's the deployment of the artifacts across environments - i.e. promoting the SQL scripts from one instance of a Synapse workspace to another. Again, in this instance, Synapse comes with it's own out-the-box mechanism for this if you go down the Git/DevOps integration route. The Publish process packages up all your Synapse artefacts (including SQL Scripts) into a single Azure Resource Manager (ARM) template that can be deployed over an existing Synapse workspace to promote the artefacts.
If you don't like that approach, then Synapse also exposes a fairly comprehensive REST API that allows you to read and write the artefacts in a workspace, so you could easily roll your own artefact deployment using this with a bit of PowerShell or programming language of choice.
But, as well as deploying the artefacts into the workspace, you also need to consider how you're going to execute them to apply schema changes. Pushing a SQL Script from one workspace to another doesn't actually run it - in order to effect changes on your SQL Serverless schema, you also need to execute the SQL script.
SQL script execution
As it stands, the Synapse REST API doesn't provide a mechanism to execute the scripts. However, as SQL Serverless presents itself as a "standard" SQL endpoint, standard toolings like
sqlcmd are a totally valid way to execute T-SQL against your serverless pool - either to query data, or to run DDL commands to create your schemas.
However, going back to the earlier point above regarding Git integration - if you have this enabled, your SQL code isn't being managed as pure SQL code. It's stored in your repository inside a blob of JSON that needs pulling apart to get at the actual T-SQL that you want to run. This is relatively straightforward to do if you know PowerShell and/or any other general purpose programming language but adds a layer of friction to an otherwise straightforward process.
This last point is probably the most interesting as it's generally quite a hard thing to do well in any data landscape - how do you manage changes to your schema when you're already "live"? How do you promote these changes through environments, without data loss, or impacting downstream systems? There's no "easy" answer to this and each case, and resulting approach, will be different.
There are some standard patterns in the relational database world that can be used - like scripted migrations, or state-based deployments. But in the world of SQL Serverless, there's some fundamentally different considerations that apply due to this not actually being a relational database system. The biggest difference here is that your schema is only enforced "on read", rather than "on write".
To illustrate what this means - if you create a view over a CSV file in your data lake, the view is created based on the schema of the file at time of creation. There's nothing stopping that file in from being updated in the lake - either deliberately (say, a data ingestion process is expecting new/different fields to be generated), or by accident (someone overwrites/removes/updates the file). If the file schema and the view schema become out of sync, you won't know about this until the view is queried and starts returning errors.
This is the fundamental trade-off in the "rigid, star-schema" data warehouse approach, vs. the flexibility in adding logical views over your data lake. You're prioritising agility and flexibility over things like referential integrity and validation.
A pipeline-based approach to schema deployment
One approach that can be utilised, is to tie the generation of the SQL Serverless schema to the pipelines that generate the data. If the ETL process is responsible for the data that is ultimately stored in the data lake (that SQL Serverless will query over), then it should be the ETL process that is responsible for the schema creation.
In Synapse, the ETL process is most probably going to be a Synapse Pipeline (which is essentially Azure Data Factory). So the approach is based around the pipeline triggering the process of creating/updating the SQL Serverless schema once it's processed the data.
SQL Serverless does support stored procedures, so you can add a Stored Procedure Activity to a pipeline to trigger some DDL in your SQL Serverless instance. Unfortunately you can't create a SQL view from inside a stored procedure directly, but you can if you go down the "dynamic SQL" route - i.e. build up the statements you want to run and use the
EXEC command to execute them.
The example code below shows a stored procedure that dynamically creates a SQL Serverless view over a folder in the data lake based on the specified name. It relies on some simple conventions for where the data is stored etc, but could easily be modified to suit any specific use case.
CREATE OR ALTER PROC CreateSQLServerlessView @Name nvarchar(50) AS BEGIN DECLARE @statement VARCHAR(MAX) SET @statement = N'CREATE OR ALTER VIEW ' + @Name + ' AS SELECT * FROM OPENROWSET( BULK ''https://datalakename.dfs.core.windows.net/' + @Name + '/*.parquet'', FORMAT=''PARQUET'' ) AS [r] ' EXEC (@statement) END GO
With this stored procedure available in the SQL Serverless instance, the Synapse Pipeline can call it whenever a view needs to be updated. Whenever the pipeline is run (and the data potentially modified), the schema is kept in sync in SQL Serverless.
A note on schema inference
The example above relies on automatic schema inference to read column names and data types based the contents of the parquet files. This makes for a very simple approach to the convention-based view creation, however does come with a drawback around performance optimisation. During the schema inference, Synapse will have to make a judgement call on what the column could contain so, for example it would default a text/string field to
varchar(8000) which might be overkill. This could affect query performance, especially if these columns are being used in join expressions.
A more sophisticated approach could be designed around using built in SQL stored procedures to describe the metadata associated with the column types, and using that to generate the create view statement with more specific, application level, defaults (e.g. default all text/string fields to
SQL Serverless inside Azure Synapse Analytics provides a Pay as you Query, flexible approach to data warehousing and a compelling alternative to the traditional star-schema or RDBMS. But it comes with it's own set of new challenges around standard database administration concerns like schema management and versioning. This post shows how some of these areas translate into the serverless world, focusing on the specific functionality inside Synapse.