Detecting Anomalies in IoT Telemetry with Azure Synapse Analytics
SQLBits 2020
This talk explores a real-world predictive maintenance use case, focusing on anomaly detection in device telemetry (ioT) to aid customer service.
As well as looking at the technology behind Azure Synapse Analytics, it will explain how it supports the end to end data science process, providing a reference point for getting started with other advanced analytical workloads.
Transcript
Introduction
Hi everyone. My name is James Broome and I'm the Director of Engineering at a consultancy here in the UK called endjin. We are Microsoft Gold Partners for Cloud Platform and DevOps. We're also Microsoft Gold Partners for Data Platform and Data Analytics. We also Power BI partners, and also a corporate sponsor of the .NET Foundation. And you can probably tell from that list that we specialize in two things, modernizing .NET Applications and Data & Insights. And it's no surprise that today. I'm here to talk to you about that second area of data and insights.
Data & Insights
I'm going to be walking through how we helped one of our customers on their AI journey. Moving from more traditional Excel based batch reporting, choosing cloud hosted platform service-based data and analytics along with a structured data science process to help with predictive maintenance. The customers, the major UK telecommunications company, providing phone, broadband, and media services across the country to about 4 million customers. We are going to look at how data that was being generated by digital TV set, top boxes could be used to predict issues before they started to affect the quality of service; ultimately resulting in losing customers.
Session Structure
So this session is about 40 minutes and I've kept the structure of the talk, quite simple, going through slides in the first half, and then switching to a live demo in a second half. We're going to start by digging a little deeper into the problem space, setting some context, and explaining some of the drivers behind this work. And before jumping into technology. Im then going to describe the process that we followed, which is the same process we would follow for any data science work, which is based around structured experimentation. I'm then going to introduce it to Azure Synapse; take a look at what's inside the box, and explain why it's a natural fit for the data science process that we're following. In the second half of the session, I'm going to demo some of the techniques we used and bring some of this to life inside Synapse, moving from the theoretical to the practical. And finally I'll show you how the overall solution tied together from a high level architectural perspective.
Quality of Service
In telecommunications quality of services is everything; connectivity is now considered an essential utility in most households, and yet it gets harder and harder to deliver that as more and more people use more and more devices to access more and more content in doing so, they're always demanding a faster and more reliable service. Now customer churn in the telecommunications industry is a huge problem. People are very happy to switch from one provider to another provider, either to get a better deal or a new device. If they're not happy with the service they're currently getting, this is a lot more likely to happen when it comes around to time for their contract renewal. Now, we were told by the customer, if someone picks up the phone to call customer services, to complain about that service, the likelihood of them and leaving the providers around 70%, when the current contract expires. So keeping customers happy is essential. And the key to doing that is providing a great service. We were specifically working with the digital TV team and for them a poor viewing experience can be caused by many factors, such as the quality of the home internet connection, the home network set up, the set top box device configuration, and the impact of other devices on the network.Being able to quickly identify and resolve issues is crucial to retaining customers and it's highly competitive market.
Viewing Platforms Data
Viewing platforms generate vast quantities of telemetry data, including viewing data, device configuration, and information about the home network and the setup boxes themselves, were generating data in the region of 10 gigabytes today containing hundreds of different types of events. But this data was largely going unused at least in any useful way. Manual reports were being generated. But it took about six weeks to prepare them. This was all done in Excel, which was struggling with the amount of data and the amount of reports needed. And once I showed various statistics about the event, so the boxes were generating, it was really hard to make sense of what was meaningful in their data. It certainly couldn't be used to do anything proactive that might improve customer service, but the hypothesis was the being able to tap into this rich information stream in a more automated and analytical way, focusing on insights to explain what's going on easily could help to resolve problems before they really start to impact customers in a big way.
Detecting Anomalies
Okay. When dealing with very complex systems, it can be really difficult to find genuine issues among the usual knowledge of activities, especially when you're not sure what you're looking for. Now, one effective solution to do this is to detect anomalies in the regular patterns, which could highlight potential issues that require further investigation and root cause analysis.
Now, there are many different anomaly detection models readily available that you can run over your data. This can be particularly challenging when there are a large number of variables Tom serve. And that was definitely true. In our case, the firmware winning on the set top boxes could generate around 140 different types of events.
Each of those events had various combinations of sort events with more granular levels of detail and context. And we also knew that anomaly in a single type of event didn't really tell us the full story. We were more interested in finding combinations of anomalies patterns that could tell us something meaningful, like a spike in a particular firmer update and a drop in unexpected viewing pattern.
Now what's more television viewing habits are seasonal by nature. Meaning there are natural spikes in data for perfectly legitimate reasons. A massive usage spike on a Saturday evening. For example, might not be an anomaly if sites during football season. When you start to dig a bit deeper, it becomes clear that seasonality patterns even operate at different levels.
During patterns can change that a day. So what's expected at one hour is different to another. They also changed that in the week. So what's expected on a weekend is different to midweek and also changed throughout the year. What's expected over specific periods like Christmas or bank holiday weekend.
It's going to vary from the norm. We also knew that patterns could be influenced by slightly less predictable events. Like the British weather. People watch less TV when it's sitting outside, but usage increases and when it's cold and wet. So the goal was to be able to quickly and continuously identify genuine anomalies, taking into account seasonality patterns at different levels and across all the different event types to provide a meaningful context for investigation.
Ultimately, we were looking for patterns or combinations and activity that could be explained and then recognized this would significantly reduce the amount of manual effort spent analyzing masses of data and ultimately result in a more consistent and higher quality service for our customers.
Data Science
People started saying a few years back that data was a new oil and what they meant here was two things.
Firstly, it holds incredible value, but secondly, It needs to be processed from its raw form in order to realize that value and data science is a new field. There's a reason to capitalize on that value. When people talk about data science often don't stop to really think what it means in that context.
Maybe you break it down and say that data is the information that holds the insights and the value. Then the science is the process that you need to follow in order to find the next trap, that value. And when we say science, we mean structured experimentation. Starting with hypotheses, defining experiments, taking a lab notes and report on findings and observations and getting this right is the difference between an effective data and analytics function and spending a lot of time and money playing with technology and data without any meaningful results.
So before we look at technology, I want to talk a bit about process.
Data Science Process
So as with any software / IT work stream. We have a development phase and an operations phase to the data science process diagram shown here is our take on the development phase. How to go from more data to something valuable and what we're talking about here, which is a word I use a lot is an experiment data science work.
Isn't the same as say, building a web application. It's fundamentally, there's no guarantee to success. You don't know what you're going to find or whether what you do find is going to be valuable. It's the best way to deal with that is have a structured process that allows her evaluation and iteration to make sure you're always spending time on the most useful thing.
An experiment always starts with a hypothesis, something that can be proven or disproven. For example, it's possible to detect anomalous patterns in event counts, generated by set top boxes, but it also needs to be testable. And that means it needs to be quantifiable. For example. Success is defined as a detected anomaly being highlighted within four hours of a historically known service disruption or whatever is useful to you.
It might be, we know the number of customers calling the customer service center increases for the first 12 hours following an outage. And the customers that call in the first four hours are the ones that are most likely to go on to cancel. If we can't do anything in those time, windows, the value of those insights is significantly reduced.
So the point is the definition of success. And the metrics are going to use to measure them, need to be tied back to the business value that you're going to create. And the best time to do this is before you begin, before you do anything with the data or start to see what it might tell you. So once we've defined it with our testable hypothesis, we can frame up our experiment based on how much time and money we want to invest in executing it.
This generally means timeboxing exercise and we recommended the shorter, the feedback loop, the better. Inside next moment. We're going to follow an iterative leap, starting with selecting and obtaining the data. We need preparing the data into the right shape and format for processing, identifying and training candidate models.
We might want to use and evaluating results against our success criteria metrics. Now, again, we call this an experiment, so we don't know what's going to happen at any point along the way. Why not be able to access the data we need, who might not be allowed to, we might have to pay to use it. We might not be able to get our hands on it quick enough for it to be useful.
If we can get the data, the data quality might be an issue. There might be missing data. It might be unstructured, or there could be duplicates in the data. All of which need to be dealt with. We might identify a potential model that would help, but we don't have enough data for the model to be effective, or the model might not perform well enough for us to want to use it.
We want to capture all of these steps and our results and recommendations along the way. So we can repeat the experiment if we need to. And make the best decisions on what to do next. This might be to stop and walk away and focus on something else, or hopefully it might be able to prove our hypothesis develop a model that meets our success criteria and want operationalize it so we can realize its value across our organization.
When we talk about operationalizing a data science process. We're now moving away from the experimental methodology towards a production quality process for repeatable data preparation, hosting a model in a way it can be securely and reliably accessed and integrating it into systems or processes. So it value its value can be realized.
And just like any production system we're talking about all the non-functional requirements, their scalability, security, performance monitoring, et cetera.
Azure Synapse
So, now we've talked about the data science and advanced analytics process. Let's have a look at Azure Synapse and see where that fits in. And the first thing to say is that Synapses a toolset tied together by setting up studio absolutely supports.
And in many ways encourages this process. We have technology and tools for data ingestion and storage, data preparation, data visualization environments for experimental data, wrangling a platform for hosting common data modeling tools and libraries and services for automation and productionization of the end to end workloads.
Now, if we break it down a bit, we can see that set up. There's actually a collection of different as your data services. Some of which have been around for a while. And a couple of which were entirely new starting at the top. We have SQL on demand. This is a new service allows you to execute T-SQL queries over file stored in as your Data Lake Gen2.
I see serverless technology. There's no SQL server to manage. There's not even a database, but you can connect to a SQL on demand instance from any client application. The support. So SQL server connection and execute SQL code over your data in its raw form, either CSV or Parquet files in your Data Lake.
Next up the SQL pools, which is essentially a rebranding of the existing SQL Data Warehouse service. We've used data warehouse in the past. This is basically the same thing. Just brought under the Synapse umbrella, new Spark engine as part of Synapse, it's a managed Spark service, meaning you don't need to manage your own clusters. It's not Databricks. It's a Microsoft managed open-source Apache Spark service. That air being that it's optimized for Synapse workloads provide a deeper integration and better performance.
And on top of Spark, we have notebooks, which I'm going to go into more detail later on, and the ability to execute spots chops there's a few of them. Existing services are also included as your data factory has been pulled in a Synapse pipelines, allowing for repeatable cloud hosted ETL processes. The data ingestion and movement.
And we have Power BI, which is actual Power BI service embedded into Azure Synapse. We also have seen ups data flows, which are essentially as your data factory data flows for drag and drop data transformation. And the last series on this is cosmos DB, Amar cosmos itself. Isn't part of Synapse. There's a new integration called Synapselink, which automatically equates an analytical store of your transactional data in Cosmos DB.
So it can be queried inside. Synapse. This is really exciting, is it breaks down the barriers between transactional and analytical processing and down the left hand side is the glue that holds it all together, which is up studio, the portal or unified UI that, which you can easily access all of these services and integrated fashion to follow the data science process that we just talked about.
Demo
So I'm now going to move away from the slides and switch to a demo. Hopefully bring some of this to life inside Synapse. I'm going to make an use of most of the services inside Synapse to give you a full picture of its strengths and being an end to end tool set. I'm going to show you how we could use the services highlighted above to help with our device telemetry data to gain valuable insights into the whole event logs.
Okay, so let's start by accessing the Azure portal and seeing we have a Synapse workspace already deployed. This could be done manually through the portal UI or via an automation pipeline using PowerShell. Or as your resource manager templates, I'm not going to go into too much detail here, but there's one really important point to make about provisioning.
And that's that Synapse' strength is in the integration of all the underlying services. If you've done any work in provisioning Databricks, and you've tried to configure secrets and necessarily configuration to connect, to say a Data Lake storage account, you know it's not insignificant in terms of complexity, especially if you're trying to automate all of that and where the scent apps workspace.
You've got a common identity that enables all of these services to talk to each other very easily.
Synapse Studio
So it's really easy to get going just by deploying a workspace. And as the workspace is provisioned, we can specify either an existing Data Lake or that we want to create a new one. And all the connectivity is taken care of as fours.
Now, once you have your environment deployed, We can use this syncope studio link here to take us into our workbench, start working with the tools and you can see on the home screen of Synapse Studio, it spells out our developer workflow ingest, explore, analyze, visualize, and each of these links takes you into the relevant place inside the UI.
User Interface
If we open the left-hand menu, we can also see the different UI areas we have inside studio. The data tab lets us configure and browse data connections develop tan is where we're going to write our code or the T-SQL scripts, Spark notebooks, or design interactive reports in Power BI orchestrate is we're going to configure our pipelines automation and data movement monitor is as you'd expect to know for looking at the audit logs of things that have happened as we start to do things with the services. And finally manages an area for managing settings, like configuring new linked services or one time.
So the first thing that we typically want to do is get our data into our Data Lake so we can start to use it inside Synapse. And we can do that with a Synapse pipeline that can either be a one-time thing or a repeatable data ingestion process. I'm going to click on the ingest option, which opens up the dialogue was a dialogue, very familiar.
We've done anything with Azure data factory. We can define our pipeline here and configure our source and target datasets. We can decide whether we want to end this on a schedule or just as a one-time copy. We can see that our Data Lake automatically pairs is linked to service. And we typically use this as our target.
And you can see that if I say, create new connection of a whole host of options for connecting to data sources inside as your, and from third-party services. Now again, this is just as your data factory integrated into the Synapse UI. I'm going to cancel out of that because for the purposes of our demo, our ingestion was a one time thing and we've already loaded the data into our storage account.
Azure Data Lake
Now, if I select the explore option, I'm taken into the data section of our workspace. If I select the linked tab, We can see the Azure Data Lake Storage Gen2 storage account is already there as a linked service, and that contains our data we've already ingested the data that we need, and we can navigate around the Data Lake through the directories and see, in this case, our data is stored in Parquet files.
And if we right click and do select top a 100 rows, then we go to T-SQL script, pre-populated just like you might do in SQL server management studio. And we run that as you'd expect, we see some results. Can we see that in this file? We have a timestamp, we have an event type and we have a sort event type from our device telemetry.
So at first glance that might not seem that impressive. We've just done is execute a standard T-SQL query over a Parquet file, living in a Data Lake with no SQL server or SQL database needed. This is serverless, T-SQL over your files. And the Data Lake is one of the new services that Synapse brings to the table called SQL on demand.
I'm going to come back to this later on in the demo and show you how we can use that to great effect. Let's quickly look at a different year and see that we've got some different types of events and some events here too. And then we run that and we can see different data.
Now as this is T-SQL, we can use any of this syntax you already know and love. For example, we could order these results by sub event type.
Notebooks
So, if you're from a SQL background and starting to get to grips with more advanced analytics scenarios. SQL on demand allows you to get going really quickly and really cheaply. With exploring data in a familiar way. So now we have our data in the Data Lake and we've had a look at what's in there. We can move on to doing something with it.
Let's go back to the home screen and now select the analyze option. And this will take us into the developed section of Synapse Studio.
We can see we've got a few different sub menu options here. SQL scripts, mailbox. Spark jobs and Power BI, and each of these can be used to build out your end to end analytics pipeline. I'm going to go more into more and more detail of all of them as we go through the rest of the demo. I'm going to start with notebooks.
Now, if you've ever worked with the Jupiter notebooks or Databricks, the concept of notebooks should be very familiar. If you haven't. It's worst thing that the use of notebooks is really a game changer in process and in mindset, you're coming from a more traditional BI background. They provide an interactive way to execute code side-by-side with text-based markup and even visualizations.
Now a fantastic tool and experimentation process because they allow you to keep comprehensive lab notes. As you go through the data preparation and model development stages, acting as living and breathing, executable documentation.
Spark Notebooks
So I'm going to open a notebook that we already have. I have to show you here, how we use this technique to develop our anomaly detection pipeline. Before I walk through what's going on. Those are a few important things to highlight. Firstly, this isn't Jupiter and it's not Databricks. It's a Microsoft managed open source Apache Spark service. The reasons for Microsoft going down this way to really, to give tight integration into the wider Synapse toolset in terms of things like security.
But also Microsoft have ambitions to improve cluster startup time. Their aim is to bring that down to about 30 seconds compared to a few minutes, which is pretty typical in Databricks at the minute. The other big advantage here is that these notebooks support the .NET for Spark runtime, meaning you can author code in C# and have it run on the Spark cluster.
So if you're from a .NET background, this is another huge win in terms of reducing the learning curve and getting to grips with a load of new concepts and techniques. When it comes to Advanced Analytics. In this case, we use a PySpark based notebook, meaning we could write Python code and still make use of the parallel processing of Spark, which I'll get to in a bit.
So as I've already said, if you've used other notebook technologies before this will look very familiar. If not, what we have is a series of editable cells that can either contain text, markup or code. Now this notebook documents our experimentation with the anomaly detection model that we decided to use.
We start out by declaring that we're going to try and detect anomalies in our data, using the anomaly detection toolkit model, as it had built in support for seasonality. Now, Spark clusters come preloaded with a large number of libraries, but this particular library, isn't one of them, but you can upload any libraries or custom code in the Spark pool definition. And that's exactly what we did.
Now the pool definition is where you define the characteristics of how your Spark cluster is going to run. Do you want to keep it to a fixed size or allow it to auto scale? Do you want it to be always on automatically shut down after a certain time? Do you want specific packages or live is to be available? So you always need a pool to be able to execute your notebook, but how many pools you need will entirely depend on the specific workloads that you're trying to run.
Python Modeling
So the first thing we do is simply load and display our data to get a sense of what we're looking at inside the notebook. And then we can start to use Python to manipulate our events data into the right structure for the model. In our case, we started off by focusing on a single event type in the data and then aggregating the counter events daily intervals, filling in missing gaps in the time series, where there might not been any events on a particular day. And at that point, we could pass on new data structures, into execute our model and plot the detected anomalies inside the notebook.
Now it's clear, straight away. We're going to need to iterate over our modeling. So we defined our own health and function to use inside the notebook. And that would wrap a lot of our setup and configuration that we needed. We could then iterate quickly and easily filtering down the dataset, changing the model parameters to increase in decrease the normal expected ranges.
Grouping by different granularities for different seasonality buttons, daily counts and hourly counts down to events and sort of event levels. And all the while we're validating our normally outputs against a set of well-known anomalies that have been provided by the customer. And we eventually got to the point where we could reliably identify the same anomalies using historical data without introducing too much noise that would cause inefficiencies in analysis.
Performance Optimisation
So once we were happy with that, We need to scale out our detection logic across all combinations of event types and some event types and capture those results back into a Parquet file ino our Data Lake.And what's great about our data is that each type of event was effective in independent series. So we could easily tell Spark to fan out and parallelize our modeling for each group of data. Understanding how you can chunk up your data to get this parallel processing is really key to getting the most out of Spark in terms of performance optimization. So our experimentation all captured in the single notebook proved that we could host and execute a model that we tuned to an acceptable level of performance process, different model runs in parallel, and write up results back into our Data Lake for further analysis.
The next step was to pull that out into something that we could operationalize. Now, one way to do this, given that our code is still in Python is to continue to use Spark notebooks. And I'll show you in a minute that we can use an orchestration pipeline, which remember is just Azure Data Factory and a new type of activity to execute this Synapse notebook, according to the trigger that we need.
So we pulled out all the main parts of our experimentation notebook, cleaning up the code where necessary so that this notebook focuses just on the steps needed to select the data, run the model and write out the results.
Now when it comes to reading and writing files from your Data Lake, there's generally a sweet spot between the number of files you have and the file size in order to get the best query performance, depending on what you're doing from a query perspective. In our case, we knew we were only interested in the last three months worth of results, but we needed to read in a lot more data than that to feed into the model, as essentially as being trained each time it was executed.
To read in three years' worth of data on a rolling window, execute the model and rewrite the anomaly results for the last month back into the Data Lake. Now, as we're saying that when we put this notebook together, there wasn't support for parameterization of notebooks. We got around this quite easily by just storing open parameters in a JSON file stored in the Data Lake.
But the last couple of weeks, Mobile parameterization spot has been added to Synapse. So this could now be updated directly passing the parameters when call them notebooks, say pharmacy naps pipeline. Now we've got an operationalization notebook set up. If we switch into the orchestra section of studio, we could define a pipeline to run this notebook.
Once again, if you've used as your data factory, before this should be familiar with the addition of the Synapse specific activities. We had a notebook activity. We can select the notebook that we want to run,and we can add a new trigger, which could be based on a schedule or specific event. For example, new data arriving into our Data Lake.
Spark Job Definition
Now a different, but equally effective way to orchestrate and operationalize this process is to use a Spark job definition. And this allows you to specify a single file to execute on the Spark cluster, as with notebooks. You can use PySpark or Scala on that. And we just set where the file is. Do we want to execute? You can see here, this job deflation is pointing to our Data Lake. Maybe we'll just switched back to our data view. We couldn't navigate and view this file. It's just another version of our code and our notebook manual extracted into a Python file and stored in the Data Lake. To orchestrate this job definition we go back to the pipeline instead of using a notebook activity, we could use a Spark job definition activity, and this points to our job definition, which in turn references our file. And again, set up any kind of trigger that we want. Now this option is really useful if you have already got existing Python or Scala scripts, you want to move into Synapse, you can easily migrate them into the Data Lake and start to wire them into orchestration pipelines.
Let's trigger this pipeline now to kick off the Spark job. Now that we've operationalized our detection pipeline as a Spark job definition, we want to make sure that it's running successfully and we switched into the monitor view we can see what's going on and dig into the details of anything go wrong. We can see here, the pipeline I've just triggered is already running. And whilst that runs, I'll open up a previous one to show you what that looks like.
As our pipeline content to Spark job definition activity, we can also see this under the Spark job applications menu. And once the job is finished, we can see how the job was executed across the data. We can use this information to optimize our workload. If we're not seeing the performance characteristics we're expecting.
Recap
So at this point, let's pause for a bit of a recap. We've ingested our water limited data into our Data Lake Gen2 storage account, which is linked to our Synapse workspace. We've explored that data using SQL on Demand, which is serverless, T-SQL over our Data Lake files. We've used notebooks to experiment with a Python anomaly detection model until we were happy with the results.
And we've used the power of processing of power of Spark to fan out our model execution of all the combinations of events and sort of events that our telemetry demos, we've extracted our detection code into a Spark job definition, and we've set up an orchestration pipeline will execute it based on the schedule that we need and we've run it and we've proved that it works.
And we've been able to dig into the diagnostics of the Spark job. So at this point, we have our normally results being output back into the Data Lake. So the next logical step is to surface that data so we can analyze the anomalous to see what's going on. So let's go back into the developer area and this time look at the SQL script section.
SQL Scripts
We can add T-SQL scripts here that we can run a SQL on demand statements, allowing us to execute SQL over our files and the Data Lake without the need for provisioning SQL server or SQL database. It's worth saying here that once we were writing our results back to the Data Lake, as Parquet files, SQL on demand, also support querying over Spark tables.
So we could have also gone down that route too, but back to our SQL scripts. We can use these scripts to create virtual schemers over our data, and we can then connect to our SQL on demand instance from any client application they're supposed to SQL server connection and query it just like it was a traditional SQL database.
So this first script starts by creating a database, which is basically just a container for our schema objects, the next scripts, and a view to that database. And then this case, a view is basically a schema abstraction over our files. We're using a create as statement selecting all the columns in our raw event data with a star wild card, but also creating a virtual column based on the folder structure and our Data Lake.
You can see what tokenizing our folder path, which organizes files by year and exposing that as a new year column in our view schema. If we were to run this script, it would create the view in the database, just like in a regular SQL database. And if we navigate back to our data area, we can see our database definition and under views.
We can see our events raw view, and again, we can do our select TOP 100 rows. And then we run that. We can see our data just as if it was coming from a SQL database view or table.
Now we can also see we have our year column available. And if you remember, this has being dynamically retrieved from the folder structure path and the Data Lake and that notion of virtual columns is really important is it means we now have a way to easily filter our queries. So in this case, we could add a WHERE statement saying WHERE year = 2018.
And what that means is that Synapse knows that only has to look under the 2018 year folder to access that data that makes for a much more efficient query. So understanding how your data is structured. Or rather structuring your data in a way that makes sense for the types of queries that you're running is really important in terms of query optimization.
SQL Views
So going back to SQL scripts, we can also create views over anomaly detection output, and we can also build views on top of views. We realized we only really wanted to look at data going back a few months. So we created anomalies from the last three months view that filters down the anomalous view. Based on the last three months and every run that it returns the output of my model from the last three months.
So, using T-SQL a query language that we know in love without needing to provision any SQL resources, we've been able to create a SQL based schema that we can easily connect to and query to give us the results we want to look out for analysis. Now we can immediately get some visualizations over that data inside the query output window, and it's going to be customized to a certain degree to help with analysis. But in most cases, you're going to want to do something a bit more targeted based on the analysis that you need.
Power BI
So the last piece of the end to end solution, I want to show you is Power BI. And the key thing here is that this is the actual Power BI service embedded insights in app studio. Now because our anomaly detection database that we just created in SQL on demand appears just like any other SQL database connecting to it.
Inside Power BI is really easy. I can create a new Power BI dataset from our anomaly detection database. And at this point I can download a PBIDs file, and this is an empty report template with a data source connection. Pre-configured. All I need to do is publish that report into a Power BI workspace that I've associated with this Synapse account.
And I can see the report inside, in synapse, and I can continue editing this port report from inside Synapse Studio without needing to jump in and out of Power BI. Now this report lives inside the Power BI service. So any changes I make are available to share across the organization, and obviously it's interactive, so I can drill down and slice and dice the data. Which can also help with the whole experimental and iterative process.
Now, going back to it, if it use case again, I mentioned earlier that it was really useful. It was being able to show correlations of anomalies across different types of events, when you've got a hundred or so different types of messages, all with different frequencies and scales. It's really hard to make sense of meaningful patterns.
For example, being able to collate a specific firmer update with a specific error dialogue or system behaviour would help the team understand the root causes of issues faster. Allowing them to proactively deal with potential customer service issues before they escalate.
So one of the most useful visualizations for us, it was a simple correlation of the number of anomalies across different event types, regardless of what "normal" look like, whether that was 10 messages per hour or a million messages per hour.
And we could then focus in on periods that had more anomalous behaviour start and recognize and ultimately explain the patterns that were being seen. And of course, Power BI isn't the only way to consume the outputs. And as we just making a SQL connection, we could do that from any number of client applications, including your own custom built applications.
Azure Data Studio
I'm just going to show you a couple of options to show you what I'm in here. Firstly, I'm going to open as your data studio, and this is a great desktop tool for working with data.
I still also have support for notebooks, just like we used earlier in Spark, but these notebooks are running locally. However, we can make remote connections to our data sources. And interestingly Data Studio, also our support for SQL notebooks. I'm going to actually include T-SQL code here as everything else.
So you can see here, I have a code cell that contains a simple select query. Did we take a look at the connection details that this notebook is using? We can point it to our SQL and demand endpoint, and then we select the time anomaly detection, database and connect. We can monitor and get queries also as you'd expect. So this is really powerful in a scenario where you've got a mix of skill-sets in your team. Maybe someone's writing predictive models in Python, winning on Spark over the Data Lake, and someone else is doing SQL based analysis over those outputs. And they don't need to care that the data is actually stored in a Spark table or Parquet files.
And the last thing I want to show you is how this could be integrated into a custom application, just like any other database. So if I switch to Visual Studio, you can just have a simple .NET console application. There's connecting to a SQL on demand instance, using a standard ado.net connection. And if I have five, two on an application, you'll see the data appearing in the console app.
So the opportunities are pretty endless.
Architectural View
I'm going to step away from the demo. Now, switch back to the slides to give you a slightly different view on what we've just seen. Here's a higher level architectural view on the end-to-end process. We've ingested our source telemetry data from an external blob storage location. Using an Azure Synapse Pipeline to land into our Data Lake Gen2 storage account.
We use the managed Synapse Spark environments to do our data preparation. And also execute our anomaly detection model over the aggregated data, writing the output results back into Data Lake as Parquet files. We then use SQL on demand as serverless T-SQLed query run time to query the data outputs directly from our Data Lake.
Integrate that into interactive dashboards, hosted in the Power BI service and across the top of this diagram we have as your ops, which were used to provide continuous deployment integration of both infrastructure security and data processing logic. And of course the bottom Azure Monitor provides rich insights into application and operational performance.
So to wrap up over the last 40 minutes or so, I've described how we approached a real world, predictive maintenance use case based around detecting anomalies and device telemetry from TV set top boxes. However, hopefully along the way, you've realized that nothing I've shown you has really been specific to this use case. In fact, I've actually shown you. Is how is Azure Synapse provides an integrated end to end Data Science and Analytics Platform, to support any kind of use case. The tooling is designed to support the Data Science process, starting with raw data, exploring, transforming, and experimenting for moving into production and operationalizing workloads and providing monitoring and management support.
If you're familiar with some of the Azure data platform services. You're not starting from scratch. So naps offers an opinionated view on how these tools should be used together. Although it does remain incredibly flexible in terms of using the right tool for your workload, and whether you're coming from a more SQL oriented background or Spark background, you'll find an entry point that makes sense for you to start taking advantage of the wider set of integrations.
And finally, you've seen that Synapse offers a new and very compelling service that provides serverless consumption models for SQL. And this allows you to get going very quickly and very cheaply. While still supporting the huge ecosystem of standard tooling out there.
So to summarize, at endjin we're really excited by Azure Synapse we think is a great solution for anyone looking to move their data science workloads into the cloud and take advantage of serverless and platform service based analytics.
So that's all from me. Thank you very much for listening. If you want to follow up after the session, you can reach me using any of the methods that you can see on the screen here. Thank you.