Skip to content
Ian Griffiths By Ian Griffiths Technical Fellow I
Per-Property Rows from JSON in Spark on Microsoft Fabric

Spark's JSON handling has a tabular mindset. If your JSON could trivially be turned into a CSV file, you'll be fine. But not all JSON data is inherently tabular, and even some fairly basic ways of using JSON end up getting clumsy in Spark. For example, it's not uncommon for a single JSON object to contain multiple records, with the property name acting as a key. Here's some simple JSON with three key/value pairs:

{
    "workerType": "dotnet-isolated",
    "runtimeVersion": "8.0",
    "aotType": "R2R"
}

In the world of dataframes, there are a couple of ways we might want to represent this. I might want this:

Key Value
workerType dotnet-isolated
runtimeVersion 8.0
aotType R2R

But if I import that JSON directly into a Spark dataframe, it chooses the other reasonable option:

aotType runtimeVersion workerType
R2R 8.0 dotnet-isolated

There will be plenty of scenarios in which that's exactly how you'd want it to interpret such a file, but it's not what I happened to want here. If I turn my head on my side, it's kind of OK. And Spark does offer ways to rotate this sort of thing through 90°, but the data I was trying to import was a little more complex. It looked more like this:

{
    "dotnet.version": {
        "nameImplicitContext": ".NET Version",
        "nameExplicitContext": ".NET Version",

        "description": ".NET Runtime version"
    },

    "dotnet.aot" : {
        "nameImplicitContext": "AOT",
        "nameExplicitContext": ".NET AOT",

        "description": "The type of .NET AOT (Ahead of Time) compilation in use"
    }
}

And I had several such JSON files, each with a handful of properties, and the value of each property was an object with the same three properties. I wanted a row for each of these, something like this (with two of the rows here coming from a different file):

key nameImplicitContext nameExplicitContext description
dotnet.version .NET Version .NET Version .NET Runtime version
dotnet.aot AOT .NET AOT The type of .NET ...
azure.functions.sku SKU Azure Functions SKU The App Service P...
azure.functions.workerType Worker Azure Functions W... The worker proces...

That wasn't quite what I got.

Here's a simplified version of what Spark gave me:

azure.functions.sku azure.functions.workerType dotnet.aot dotnet.version
{The App Service ... {The worker proce... null null
null null {The type of .NET... {.NET Runtime ver...

In reality, the table was much wider and a fair bit taller. (And I mean much wider. Even with the fairly small data set in this application, we were going to end up with 100s of columns.)

Spark created a column for every single property name. This is unsurprising once you realise that Spark expects JSON property names to identify column names. But it's not at all what I want here. In my data, these particular property names don't represent column names; they are effectively primary key values.

The best hour you can spend to refine your own data strategy and leverage the latest capabilities on Azure to accelerate your road map.

Moreover, this can no longer be fixed simply by rotating the entire table through 90°. We could have done that with the simpler earlier example because I had 1 row and lots of columns, which I wanted to translate into 1 column and lots of rows. But now we've got this slightly odd situation where each row has multiple columns with values, and also multiple null columns. (This happened because it has created 1 row per source JSON file. It hasn't understood that I want a row for each of the properties of the top-level object in each file.) If I rotate this, I'll still have a mess.

Per-property rows with Map and explode

The initial import step is going to produce one row per JSON object because apparently that's just how Spark works. But this isn't necessarily a problem. It's easy enough to transform a dataframe so that individual rows expand into multiple rows. One way to achieve this is with the explode function. That works with an array or map type column, and as the docs say, it:

Returns a new row for each element in the given array or map.

So if we can somehow represent all of the properties packaged into a single JSON object with an array or map column, we can then use explode to expand that into a row per property. (As a .NET developer, I think of this as a specialized form of SelectMany. With one of my other language hats on I might think of it as a specialized flatMap. Spark offers flatMap too, but explode is specifically for when you have either an array or map column type.)

A map seems like the right column type here, because that is effectively a dictionary, and JSON objects are just dictionaries. A map column could directly represent the information shown in the JSON example above. The map would contain two entries, with the keys being dotnet.version and dotnet.aot.

With Spark, we can't just say "this is a dictionary". It wants to know both the key and the value type up front. That means we can't accept any old JSON. We have to say exactly what the values will look like. But that's fine in this case: my values all have the same shape. So I can express that as a schema in pyspark:

import pyspark.sql.types as T

sch = T.MapType(
    T.StringType(),
    T.StructType([
        T.StructField("nameImplicitContext", T.StringType()),
        T.StructField("nameExplicitContext", T.StringType()),
        T.StructField("description", T.StringType())
    ]))

This represents pretty well the structure I expect in my source JSON. Unfortunately, when I try to use this schema when importing the JSON:

jdf = spark.read.option("multiline","true").load(jsonFilesPath, format='json', schema=sch)

I get this error:

TypeError: schema should be StructType or string

When you load a whole source file as JSON, Spark's load method won't let you use a map as the root of the schema. This is somewhat frustrating, as there doesn't seem to be any fundamental reason why it can't do this.

Importing JSON as MapType

Spark is perfectly capable of processing JSON data as a map, you just have to get a little more indirect. The approach I found was to load the data initially as plain text. Once it was in a dataframe, it was then possible to manipulate it with the from_json, and unlike the load method, that is happy to produce a map:

import pyspark.sql.types as T

df = spark.read.text(tagMetadataPath, wholetext=True)
df = df.select(F.from_json("value", schema=sch))

Note: in my application, I have one JSON object per file, which is why I've specified wholetext=True. Typically, Spark assumes that you have one record per line in a file. If that's the case for your data, you would not want that setting.

This produces a single-column table with one row per file. The single column contains a map with one entry for each property in the original JSON:

entries
{dotnet.version -> {.NET Version, .NET Version, .NET Runtime version}, dotnet.aot -> {AOT, .NET AOT, The type of .NET AOT (Ahead of Time) compilation in use}
{azure.functions.sku -> {SKU, Azure Functions SKU, The App Service Plan type being used by this function}, azure.functions.workerType -> {Worker, Azure Functions Worker Type, The worker process type, e.g. dotnet-isolated or dotnet}}

(The output from the dataframe's Show method has made it look like the values for each entry have lost their property names. In fact, the dataframe has preserved the nameImplicitContext, nameExplicitContext, and description property names, it's just that PySpark doesn't display these when you ask it to show you the dataframe.)

Programming C# 12 Book, by Ian Griffiths, published by O'Reilly Media, is now available to buy.

Now that we've got a map, we can use explode so that instead of getting one row per JSON file, we get one row per property. We can replace the final line of the previous snippet:

df = df.select(F.explode(F.from_json("value", schema=sch)))

This produces the following output:

key value
azure.functions.sku {SKU, Azure Funct...
azure.functions.workerType {Worker, Azure Fu...
dotnet.version {.NET Version, .N...
dotnet.aot {AOT, .NET AOT, T...

My input in this example was 2 files, with 2 properties each, and this has successfully expanded them out to 4 rows. And now it's very straightforward to expand this into the kind of table I wanted:

df = df.select("key", "value.nameImplicitContext", "value.nameExplicitContext", "value.description")

That gives me this output:

key nameImplicitContext nameExplicitContext description
azure.functions.sku SKU Azure Functions SKU The App Service P...
azure.functions.workerType Worker Azure Functions Wor... The worker proces...
dotnet.version .NET Version .NET Version .NET Runtime version
dotnet.aot AOT .NET AOT The type of .NET ...

Handling arbitrary JSON

In this example, the data was essentially tabular. Every record has a key and three properties. The only reason I ran into difficulties was that my JSON was superficially structured in a way that Spark does not seem inherently to understand. The underlying structure (independent of any particular JSON representation) is something Spark certainly can cope with. The only aspect that was not known until runtime was the top-level property names, and this is reflected in my schema:

T.MapType(
    T.StringType(),
    T.StructType([
        T.StructField("nameImplicitContext", T.StringType()),
        T.StructField("nameExplicitContext", T.StringType()),
        T.StructField("description", T.StringType())
    ]))

The map type gives me the flexibility to have any property name (key) I like, but it still wants me to constrain the value shape. And that was fine for my scenario.

But what if you have data where not all the values have the same shape? As a .NET developer, I might ask how I would model IDictionary<string, JsonObject> in Spark.

This is a fundamentally different sort of problem. This isn't just a case of the underlying structure being mapped to JSON in a way that Spark doesn't understand. The problem here is that the underlying structure is something Spark dataframes basically can't cope with! It is a basic tenet of Spark dataframes that the column type is known up front. (And my understanding is that some physical representations of dataframes rely on this fact. So this is a deep assumption baked into Spark.)

The only way to do this in Spark seems to be a fudge: you can sort of do it with a map of String to String. We could write this:

df = spark.read.text(tagMetadataPath, wholetext=True)

sch = T.MapType(T.StringType(), T.StringType())
df = df.select(F.from_json("value", schema=sch))

If you do this with JSON of the form shown above, it works just fine, and the values are just JSON text. Effectively, we've decoded just the top layer of JSON to get a map with one entry per property, but each of the property values looks like plain text to Spark. We happen to know that the text is in JSON form, but Spark doesn't. That means it's possible for each value to have a completely different type. From Spark's perspective, every property has a string type, so it is satisfied. Of course, this makes the internal structure of those values completely opaque to Spark, so we'd have to do further processing to work with the information in that JSON. Perhaps your data might have some sort of discriminator field enabling you to work out which particular format to expect, so you could filter based on that before parsing out the nested JSON in these maps.

Summary

Spark seems to presume that one JSON object will correspond to one row in a dataframe. If that's not what we want, we can work around that by using either a map or array, and then using explode to expand a row into as many rows as the original JSON represented. By using a map, we can deal with JSON where we don't necessarily know ahead of time what the property names will be. That makes it possible to represents data (e.g., keys) as properties in the JSON. Since this is not the natural way of doing things in Spark, we have to jump through a couple of hoops to make it work, but it can be done.

Ian Griffiths

Technical Fellow I

Ian Griffiths

Ian has worked in various aspects of computing, including computer networking, embedded real-time systems, broadcast television systems, medical imaging, and all forms of cloud computing. Ian is a Technical Fellow at endjin, and 17 times Microsoft MVP in Developer Technologies. He is the author of O'Reilly's Programming C# 12.0, and has written Pluralsight courses on WPF fundamentals (WPF advanced topics WPF v4) and the TPL. He's a maintainer of Reactive Extensions for .NET, Reaqtor, and endjin's 50+ open source projects. Ian has given over 20 talks while at endjin. Technology brings him joy.