Reading Nested XML Data in SSIS
SSIS is really good at reading in XML data, breaking it apart, and building out your rows. Let’s look at an example of how this all works together. First, here is our XML document.
<people> <person> <name> <first_name>Maddox</first_name> <middle_name>Xavier</middle_name> <last_name>Spears</last_name> </name> <addresses> <address> <line1>885 Whiteman Street</line1> <line2>Pleasantville, NJ 08232</line2> </address> <address> <line1>4229 Williams Mine Road</line1> <line2>Newark, NJ 07102</line2> </address> </addresses> </person> ... </people>
We have a list of people. Each person has a name and a list of associated addresses. The name is broken out into its component parts, and each address is broken into
Let’s create our tables to store this data. We will need a
Person table and an
create table [dbo].[Person] ( [PersonId] bigint not null, [FirstName] varchar(50) null, [MiddleName] varchar(50) null, [LastName] varchar(50) null, [InsertedDate] datetimeoffset not null default sysdatetimeoffset() ); create table [dbo].[Address] ( [PersonId] bigint not null, [Line1] varchar(50) null, [Line2] varchar(50) null, [InsertedDate] datetimeoffset not null default sysdatetimeoffset() );
Step 1: Creating a new batch
This is a personal preference, so you can skip this step if you’d like.
Start by creating a
Batch table. This will store the timestamp and filename for each time we run our SSIS job. Yes, this is overkill for a demo, but it is a good idea to teach good SSIS habits, too. We will also store the
BatchId on each record.
-- Create the new Batch table. create table [dbo].[Batch] ( [BatchId] int not null identity(1, 1), [StartTime] datetimeoffset null, [Filename] varchar(200) null, constraint [PK_dbo.Batch] primary key clustered (BatchId) ); -- Create a new stored procedure for creating a new batch entry. create procedure [dbo].[InsertBatch] @startTime datetimeoffset = null, @filename varchar(200) = null as begin begin transaction; insert into dbo.Batch ( StartTime, Filename ) values ( @startTime, @filename ); select top (1) BatchId, StartTime, Filename from dbo.Batch where BatchId = scope_identity(); commit transaction; end -- Add BatchId column to Person. alter table [dbo].[Person] add [BatchId] int not null; -- Add BatchId column to Address. alter table [dbo].[Address] ass [BatchId] int not null;
Our first SSIS step is to create a new record in this table and save the ID. Add a new Execute SQL task to your workflow. I have already added an ADO.NET connection to my database. My SQL task will use this existing connection.
Next, set the input parameters for the stored procedure -
Finally, we need to capture the result set. Create a user variable called
BatchId, and save the result to that variable. We set the Result Set to “Single Row” on the general page. Here, the
BatchId is the first value returned (0-indexed). We will use it later when we insert our XML data into tables.
Step 2: Delete the existing data
This one is easy. We want to prepare our tables for incoming data. Create a procedure that will wipe away all existing data.
create procedure [dbo].[TruncateTables] as begin begin transaction; delete from [dbo].[Address]; delete from [dbo].[Person]; commit transaction; end;
Call this procedure with an Execute SQL task. This time, there are no parameters or results.
Step 3: Fun with XML files
Now that we have properly prepared our tables, we are ready to read our file and insert new rows. This is done with a Data Flow Task. Point your file connection to the correct place on your drive. Hit the button for “Generate XSD…”.
This will scan your XML and read every node. It will also create appropriate IDs so nodes can be linked. Next, data must be sorted. Every node must be sorted, and this XML has four such nodes -
After all data has been sorted, we are ready to create joins. We need a join to turn four types of nodes into two tables. The
Person join combines the
name node types, and it looks like the following.
And here is the
Address join. This combines the
address nodes. Note that the
addresses node contains the
person_ID value, which we will need to link addresses to people.
Next up, add the
BatchId value to both the
Address sets. This is easily accomplished with a Derived Column Transformation. This is what it looks like for
Address derived column is the same.
Finally, we are ready to store our data. Create two ADO.NET Destinations, one for
Person and one for
Address. This is what the mappings for
Person should look like. Map your in-memory variables to your database columns.
The following is the completed data flow task.
Step 4: The Final Result
Let’s fire up SQL Management Studio and check out the results.
The source code is available on Github. Both the SSIS job and the SQL database project are available for you to review.