Skip to content
James Broome By James Broome Director of Engineering
Reading structured data from SharePoint in Synapse Notebooks

In the previous post in this series, I described how to read file-based structured data (i.e. CSV and Excel) from SharePoint into your Azure data pipelines using Synapse Pipelines. The approach relied on the Synapse Managed Identity, which can be used within Web Activities to easily authenticate against the Microsoft Graph API so that metadata about files can be queried.

One of the metadata properties of a SharePoint file is the Download URL, which is a pre-authorised URL that's valid for 1 hour. This means any subsequent requests don't need to be authenticated to use it, so it can be passed into a Copy Activity to retrieve the file and move it into a Data Lake.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

That same URL could also be passed into a Notebook activity (via a Parameters cell) if you wanted to use Python/PySpark/Pandas to read the contents of the data. As it's pre-authorised, it can be passed in to Pandas read_csv, for example to create a dataframe.

But there are valid reasons why this approach might not work for you - if you have lots of files to query and ingest you might not want to do lots of iterative logic in your Pipelines. Or it may be that you don't know the specifics of the files you're ingesting until other logic or data is available in your Pipeline (and the Pipeline expression syntax becomes unwieldy).

Azure Weekly is a summary of the week's top Microsoft Azure news from AI to Availability Zones. Keep on top of all the latest Azure developments!

So, this post describes how to achieve the same result by just using Notebooks. It remains focused on using Azure Synapse Analytics, but in future posts in the series I'll switch to Microsoft Fabric for comparisons.

Authenticating with SharePoint from Azure Synapse Spark Notebooks

The big difference in the Spark-based approach vs. using Pipelines is that we can no longer rely on the Synapse System Assigned Managed Identity. The mssparkutils library has built in functions to get access tokens (mssparkutils.credentials.getToken()), but the list of audiences is limited and, unfortunately, the MS Graph API is not included in that list.

So, we can't automatically use the System Assigned Managed Identity for authentication. Instead, we have to use a regular Service Principal, which means that we need a way of storing and accessing the Service Principal details from Spark. mssparkutils helps us again here with its built-in functions for reading secrets from Azure Key Vault. If you've added your Key Vault as a Linked Service in Synapse then there's even built-in functions for accessing secrets using the Linked Service (meaning your Notebook code doesn't need to specific specific Key Vault service names).

In order to authenticate to get a valid token, we need:

  • The Service Principal Client ID / Application ID
  • The Service Principal Secret
  • The Tenant ID

And using all of that, we can construct the request to retrieve an access token:

import requests

sp_client_id = mssparkutils.credentials.getSecretWithLS("KeyVaultLinkedService", "ServicePrincipalClientId")
sp_secret = mssparkutils.credentials.getSecretWithLS("KeyVaultLinkedService", "ServicePrincipalSecret")
tenant_id = mssparkutils.credentials.getSecretWithLS("KeyVaultLinkedService", "TenantId")

# The scope of your application
scope = "https://graph.microsoft.com/.default"

# The endpoint to get a token
token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token"

# The data to send in the token request
token_data = {
    "client_id": sp_client_id,
    "client_secret": sp_secret,
    "scope": scope,
    "grant_type": "client_credentials"
}

response = requests.post(token_url, data=token_data)
response.raise_for_status()  # Raise an exception if the request failed
token = response.json()["access_token"]

Get the SharePoint Drive Id for the SharePoint site

Now that we can authenticate, the process is essentially the same as it was for the previous pipeline-based approach, except we're issuing requests to the Graph API in Python code, rather than using Web Activities.

Firstly, we need the SharePoint Drive ID, which we retrieve from the Graph API using the details of the SharePoint site:

sharepoint_tenant_fqdn = "your-domain.sharepoint.com"
sharepoint_site_name = "site-name"
library_name = "library-name" # e.g. Shared%20Documents

web_url = f"https://{sharepoint_tenant_fqdn}/sites/{sharepoint_site_name}/{library_name}"

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json; charset=utf-8",
    "Authorization": f"Bearer {token}"
}

# Get drives for site
list_drives_request_url = f"https://graph.microsoft.com/v1.0/sites/{sharepoint_tenant_fqdn}:/sites/{sharepoint_site_name}:/drives"
list_drives_response = requests.get(list_drives_request_url, headers=headers)
list_drives_response.raise_for_status()  # Raise an exception if the request failed

# Find drive that matches on web URL
drive_id = next((d["id"] for d in list_drives_response.json()["value"] if d["webUrl"] == web_url))

Get Download URL property of the file in SharePoint

Once we have the Drive ID, we can query for a specific file and retrieve the same pre-authorised Download URL:

file_name = "name-of-file"

url = f"https://graph.microsoft.com/v1.0/drives/{drive_id}/root:/{file_name}?select=@microsoft.graph.downloadUrl"

response = requests.get(url, headers=headers)
response.raise_for_status()

download_url = response.json()["@microsoft.graph.downloadUrl"]

Read the contents of the file in SharePoint using Pandas

And that Download URL can be passed directly into a read_csv or read_excel call in Pandas, depending on the specifics of the file that you're accessing. Because it's pre-authorised, you don't need to deal with authentication:

df = pd.read_csv(
    download_url
)

Summary

Whilst ingesting data from files in SharePoint is a very common requirement in most of the data projects we deliver, there's no out-the-box support inside Synapse. This post and the code samples within demonstrate how to authenticate and query files using Python code running in a PySpark Notebook.

In the few next posts, I'll take what we've learned so far and see how it can be applied to write structured data back to SharePoint, as well as translating the same requirements in to Microsoft Fabric.

James Broome

Director of Engineering

James Broome

James has spent 20+ years delivering high quality software solutions addressing global business problems, with teams and clients across 3 continents. As Director of Engineering at endjin, he leads the team in providing technology strategy, data insights and engineering support to organisations of all sizes - from disruptive B2C start-ups, to global financial institutions. He's responsible for the success of our customer-facing project delivery, as well as the capability and growth of our delivery team.