Reading structured data from SharePoint in Synapse Notebooks

In the previous post in this series, I described how to read file-based structured data (i.e. CSV and Excel) from SharePoint into your Azure data pipelines using Synapse Pipelines. The approach relied on the Synapse Managed Identity, which can be used within Web Activities to easily authenticate against the Microsoft Graph API so that metadata about files can be queried.
One of the metadata properties of a SharePoint file is the Download URL, which is a pre-authorised URL that's valid for 1 hour. This means any subsequent requests don't need to be authenticated to use it, so it can be passed into a Copy Activity to retrieve the file and move it into a Data Lake.
That same URL could also be passed into a Notebook activity (via a Parameters cell) if you wanted to use Python/PySpark/Pandas to read the contents of the data. As it's pre-authorised, it can be passed in to Pandas read_csv
, for example to create a dataframe.
But there are valid reasons why this approach might not work for you - if you have lots of files to query and ingest you might not want to do lots of iterative logic in your Pipelines. Or it may be that you don't know the specifics of the files you're ingesting until other logic or data is available in your Pipeline (and the Pipeline expression syntax becomes unwieldy).
So, this post describes how to achieve the same result by just using Notebooks. It remains focused on using Azure Synapse Analytics, but in future posts in the series I'll switch to Microsoft Fabric for comparisons.
Authenticating with SharePoint from Azure Synapse Spark Notebooks
The big difference in the Spark-based approach vs. using Pipelines is that we can no longer rely on the Synapse System Assigned Managed Identity. The mssparkutils
library has built in functions to get access tokens (mssparkutils.credentials.getToken()
), but the list of audiences is limited and, unfortunately, the MS Graph API is not included in that list.
So, we can't automatically use the System Assigned Managed Identity for authentication. Instead, we have to use a regular Service Principal, which means that we need a way of storing and accessing the Service Principal details from Spark. mssparkutils
helps us again here with its built-in functions for reading secrets from Azure Key Vault. If you've added your Key Vault as a Linked Service in Synapse then there's even built-in functions for accessing secrets using the Linked Service (meaning your Notebook code doesn't need to specific specific Key Vault service names).
In order to authenticate to get a valid token, we need:
- The Service Principal Client ID / Application ID
- The Service Principal Secret
- The Tenant ID
And using all of that, we can construct the request to retrieve an access token:
import requests
sp_client_id = mssparkutils.credentials.getSecretWithLS("KeyVaultLinkedService", "ServicePrincipalClientId")
sp_secret = mssparkutils.credentials.getSecretWithLS("KeyVaultLinkedService", "ServicePrincipalSecret")
tenant_id = mssparkutils.credentials.getSecretWithLS("KeyVaultLinkedService", "TenantId")
# The scope of your application
scope = "https://graph.microsoft.com/.default"
# The endpoint to get a token
token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token"
# The data to send in the token request
token_data = {
"client_id": sp_client_id,
"client_secret": sp_secret,
"scope": scope,
"grant_type": "client_credentials"
}
response = requests.post(token_url, data=token_data)
response.raise_for_status() # Raise an exception if the request failed
token = response.json()["access_token"]
Get the SharePoint Drive Id for the SharePoint site
Now that we can authenticate, the process is essentially the same as it was for the previous pipeline-based approach, except we're issuing requests to the Graph API in Python code, rather than using Web Activities.
Firstly, we need the SharePoint Drive ID, which we retrieve from the Graph API using the details of the SharePoint site:
sharepoint_tenant_fqdn = "your-domain.sharepoint.com"
sharepoint_site_name = "site-name"
library_name = "library-name" # e.g. Shared%20Documents
web_url = f"https://{sharepoint_tenant_fqdn}/sites/{sharepoint_site_name}/{library_name}"
headers = {
"Accept": "application/json",
"Content-Type": "application/json; charset=utf-8",
"Authorization": f"Bearer {token}"
}
# Get drives for site
list_drives_request_url = f"https://graph.microsoft.com/v1.0/sites/{sharepoint_tenant_fqdn}:/sites/{sharepoint_site_name}:/drives"
list_drives_response = requests.get(list_drives_request_url, headers=headers)
list_drives_response.raise_for_status() # Raise an exception if the request failed
# Find drive that matches on web URL
drive_id = next((d["id"] for d in list_drives_response.json()["value"] if d["webUrl"] == web_url))
Get Download URL property of the file in SharePoint
Once we have the Drive ID, we can query for a specific file and retrieve the same pre-authorised Download URL:
file_name = "name-of-file"
url = f"https://graph.microsoft.com/v1.0/drives/{drive_id}/root:/{file_name}?select=@microsoft.graph.downloadUrl"
response = requests.get(url, headers=headers)
response.raise_for_status()
download_url = response.json()["@microsoft.graph.downloadUrl"]
Read the contents of the file in SharePoint using Pandas
And that Download URL can be passed directly into a read_csv
or read_excel
call in Pandas, depending on the specifics of the file that you're accessing. Because it's pre-authorised, you don't need to deal with authentication:
df = pd.read_csv(
download_url
)
Summary
Whilst ingesting data from files in SharePoint is a very common requirement in most of the data projects we deliver, there's no out-the-box support inside Synapse. This post and the code samples within demonstrate how to authenticate and query files using Python code running in a PySpark Notebook.
In the few next posts, I'll take what we've learned so far and see how it can be applied to write structured data back to SharePoint, as well as translating the same requirements in to Microsoft Fabric.