DCSIMG
Importing Large Xml Files to SQL Server Using SqlBulkCopy - .NET Geek

.NET Geek

"It is upon the Trunk that a gentleman works" - Confucius

Importing Large Xml Files to SQL Server Using SqlBulkCopy

Say you have a large Xml file that contains relatively tabular data that you want to import into SQL Server. There are several ways to go about this. Let’s look at a couple of options.

  1. Load the file into an XDocument. Extract elements from the DOM using Linq and then use ADO.Net to insert the data into the database.
  2. Load the data into a DataSet using ReadXml and save the data to the database
  3. Read through the data using an Xml reader and save each record to the database

Options number 1 and 2 requires that we load the entire document into memory before processing. We will work under the assumption that the files we receive are too large to load into memory. That leaves with option number 3 which allows us to read the file in a fast forward only mode where we only hold a small portion of the file in memory. Once an item has been processed and the reader moves forward, the previously read data is unavailable to the reader. The main bottleneck now is how do we push the data in a really efficient way into the database. First of all let’s look at some sample data.

<?xml version="1.0" encoding="utf-8"?>
<lab_results>
  <result type="A01" origin="xb102">
    <name>aaa</name>
    <description>sample description1</description>
  </result>
  <result type="A02" origin="xb103">
    <name>bbb</name>
    <description>sample description2</description>
  </result>
</lab_results> 

The files we are about to process contain lab results. Each file will contain somewhere between 5-8 million result elements. Each result element will contain approximately 150 characters. 150 characters * 2 (Unicode) * 5 million = about 1.5GB per file. Obviously we can’t read that into memory all at once. Another detail is that the file contains type and origin codes which needs to be mapped to their appropriate values in the database.

Here’s the sample table structure that will have to hold the data we import.

lab results

LabResultOrigin and LabResultType are standard lookup tables. Here’s some sample data so we can conceptualize the conversions we will need to make later.

LabResultType

TypeId Description
1 A01
2 A02
3 B01

LabResultOrigin

OriginId Description
1 xb102
2 xb103
3 xz101

For example, when we process the Xml file we will need to convert the from the textual result type of “A01” to 1.

Importing the data

When importing using SqlBulkCopy there are two ways to feed it data. One is to use a DataTable which we discarded above because of the size of the files. Another option is to use a DataReader. The problem is that there is no class in .Net that reads xml and implements the IDataReader interface.

That means we will have to roll our own. Looking at the IDataReader interface it seems we will have to implement a boatload of methods. Or do we?
Here’s a neat little trick that was mentioned in the passing in a comment by “jezemine” in this blog post.

  1. Create a class that implements IDataReader.
  2. Press Alt+Shift+F10 to implement the interface members (with a throw NotImplementedException)
  3. Pass your class to SqlBulkCopy. It will break on a NotImplementedException of course.
  4. Go ahead and implement the method it crashed on.
  5. Repeat steps #3 and #4 until all required methods have been implemented.

Only 3 out of more than 20 methods in IDataReader are actually used by SqlBulkCopy. Not too bad.

We will create an XmlDataReader class that implements IDataReader so that we can pass it to SqlBulkCopy. The calling code will look something like the following:

using (XmlTextReader xmlTextReader = new XmlTextReader(message.FileName))
using (SqlBulkCopy sqlBulkCopy = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.TableLock))
{
    SetupBulkCopy(sqlBulkCopy);
 
    var reader = new LabResultDataReader(xmlTextReader);
    sqlBulkCopy.WriteToServer(reader);
}

As you can see, there is no mention of an XmlDataReader, but rather a LabResultDataReader. I decided to split the reader into two distinct pieces. One is a general purpose XmlDataReader that can be used for other xml structures and not only the one in the sample. A special purpose LabResultDataReader is derived from XmlDataReader and this class knows how to handle the structure of the lab result xml file.

In order to tell SqlBulkCopy how to map between the data in the file and the columns in the database we need some plumbing as seen in the call to SetupBulkCopy().

private static void SetupBulkCopy(SqlBulkCopy sqlBulkCopy)
{
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(0, "TypeId"));
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(1, "OriginId"));
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(2, "ResultName"));
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(3, "Description"));
 
    sqlBulkCopy.DestinationTableName = "LabResult";
}

Note that we didn’t include the LabResultId column in the mapping above. The reason for this is that the LabResultId is an Identity column and is generated automatically by SQL Server.

The general purpose XmlDataReader is as follows:

public abstract class XmlDataReader : IDataReader
{
    private readonly string m_rowElementName;
 
    private readonly XmlReader m_xmlReader;
    private readonly int m_fieldCount = -1;
 
    private bool m_disposed;
 
    protected IEnumerator<XElement> m_enumerator;
 
    public abstract object GetValue(int i);
 
    /// <summary>
    /// Initialize the XmlDataStreamer. After initialization call Read() to move the reader forward.
    /// </summary>
    /// <param name="xmlReader">XmlReader used to iterate the data. Will be disposed by when done.</param>
    /// <param name="fieldCount">IDataReader FiledCount.</param>
    /// <param name="rowElementName">Name of the XML element that contains row data</param>
    public XmlDataReader(XmlReader xmlReader, int fieldCount, string rowElementName)
    {
        m_rowElementName = rowElementName;
        m_fieldCount = fieldCount;
        m_xmlReader = xmlReader;
        m_enumerator = GetXmlStream().GetEnumerator();
    }
 
    public bool Read()
    {
        return m_enumerator.MoveNext();
    }
 
    public int FieldCount
    {
        get { return m_fieldCount; }
    }
 
    public XElement CurrentElement
    {
        get { return m_enumerator.Current; }
    }
 
    /// <summary>
    /// http://msdn.microsoft.com/en-us/library/system.xml.linq.xstreamingelement.aspx
    /// </summary>
    /// <param name="m_xmlReader"></param>
    /// <returns></returns>
    private IEnumerable<XElement> GetXmlStream()
    {
        XElement rowElement;
        using (m_xmlReader)
        {
            m_xmlReader.MoveToContent();
 
            while (m_xmlReader.Read())
            {
                if (IsRowElement())
                {
                    rowElement = XElement.ReadFrom(m_xmlReader) as XElement;
                    if (rowElement != null)
                    {
                        yield return rowElement;
                    }
                }
            }
        }
    }
 
    private bool IsRowElement()
    {
        if (m_xmlReader.NodeType != XmlNodeType.Element)
            return false;
 
        return m_xmlReader.Name == m_rowElementName;
    }
 
    protected virtual void Dispose()
    {
        if (m_disposed)
            return;
 
        m_enumerator.Dispose();
        m_disposed = true;
    }
 
 
    #region Members not required by SqlBulkCopy
 
    #region IDataReader Members
 
 
 
    public bool NextResult()
    {
        throw new NotImplementedException();
    }
 
    public int RecordsAffected
    {
        get { throw new NotImplementedException(); }
    }
 
    public string GetDataTypeName(int i)
    {
        throw new NotImplementedException();
    }
 
    // Deleted tons of methods not required...
}

Now let’s have a look at the special purpose LabResult reader.

public class LabResultDataReader : XmlDataReader
{
    private const string XmlTagRow = "result";
 
    private const int FieldCount = 4;
    private const int InvalidItemId = -1;
 
    public LabResultDataReader(XmlReader xmlReader)
        : base(xmlReader, FieldCount, XmlTagRow){ }
 
    public override object GetValue(int i)
    {
        switch (i)
        {
            case 0:
                return CurrentElement.Attribute("type").EnumFromValue<ResultType>();
            case 1:
                return CurrentElement.Attribute("origin").EnumFromValue<Origin>();
            case 2:
                return CurrentElement.Element("name").value;
            case 3:
                return CurrentElement.Element("description").value;
            default:
                throw new InvalidOperationException("Column count mismatch.");
        }
    }
}

There are two main features of the XmlDataReader. The first is that it flattens the Xml hierarchical structure to resemble a row. For each iteration of the enumerator, it extracts the <result> element and all its children. The special purpose reader (in our example the LabResultsDataReader) maps the nested <result> tag to a set of column values. The second significant feature is the streaming nature of the reader. We never hold more than a single <result> element in memory at any given time.

If you look at the code above and wondered what are those .EnumFromValue<> methods on the XmlAttributes, you were right. These are just Extension methods that I added for convenience.

public static T EnumFromValue<T>(this XAttribute attribute)
{
    string value = attribute.Value;
    if (string.IsNullOrEmpty(value))
        return default(T);
 
    try
    {
        T converted = (T)Enum.Parse(typeof(T), value, true);
        return converted;
    }
    catch (Exception)
    {
        return default(T);
    }
}

As you can see, despite that this turned into a not so short post, there’s not that much code required for this to work. Hopefully this will give you enough to assist you in building your own high performance ETL solution.

A few drawbacks of the proposed solution. I have only had to process a very limited set of files. In the case where many files of different schema would have to be parsed and processed, the design of having a designated derived class for each type of file could become a pain point. There is no support for parsing files with a schema that is not known at compile time. This implies that if the schema changes or a new type of file needs to be imported, a new build is needed.

Suggestions, comments, improvements etc. are welcome.

kick it on DotNetKicks.com

Comments

DotNetKicks.com said:

You've been kicked (a good thing) - Trackback from DotNetKicks.com

# June 16, 2009 1:44 AM

linqed said:

Great job!

Just the thing I needed for one of our next projects.

Thanks,

Vlad

# June 17, 2009 12:18 AM

Ralph said:

You can't imagine how helpful this post was to me. Thank you very much. Keep up the amazing work and the desire to help others!

# October 18, 2009 12:19 PM

danny said:

hey thanks a lot for this excellent post.

I just had a question:

I followed the instruction in the article and implemented a whole bunch of methods that simply throw NotImplementedException(); to make the code compile, but i'm now getting the following errors:

'codeTest.XmlDataReader' does not implement interface member 'System.Data.IDataRecord.this[string]'

'codeTest.XmlDataReader' does not implement interface member 'System.Data.IDataRecord.this[int]'

I don't know what field to include in the XmlDataReader to satisfy the interface requirements.

any help is greatly appreciated!

# October 29, 2009 4:50 AM

Kim said:

Just add the following to your implementing class.

public object this[string name]

{

   get { throw new NotImplementedException(); }

}

public object this[int i]

{

   get { throw new NotImplementedException(); }

}

# October 29, 2009 8:37 PM

Omar Soto said:

Hello:

 I have been able to create what you proposed and having it compile.  It is a web based application.  It triggers on a ButtonClick, reads the file, but the following line gives me a null:

CurrentElement.Element("name").value

It cannot find the element in the file.  Any idea what I might me missing?

Thanks

# December 24, 2009 9:01 PM

Anonimouse said:

Just a note-- great work here, a true lifesaver. Though, I was having problems with GetXmlStream(). The code worked fine, but my implementation (identical to that above, or so I thought) was skipping every other entry from my XML file. I do not know if anyone else has had this problem, but I was able to solve it by changing the while loop to have a "preRead" flag, to let it know not to read again--- it would appear that the XElement.ReadFrom() method will perform a read, so every time we found one of my records, it would skip the next. So, when that gets called, I have it set the flag to true, so it doesn't double read, and a regular iteration sets it back to false again.

In GetXmlStream()

bool preRead = false;

while (preRead == true || m_xmlReader.Read())

{

...

if (IsRowElement())

rowElement = XElement.ReadFrom()....

preRead = true;

...

}

else

{

preRead = false;

}

}

I hope that this helps someone if they hit the same issue that I did.

Again, thank you for the article.

# February 8, 2011 11:11 PM

Chris said:

Why am I getting a "Expected class, delegate, enum, interface, or struct" Error with :

public static T EnumFromValue<T>(this XAttribute attribute)

# May 22, 2011 8:50 AM

Gokul said:

Hello,

Do you have any suggestions on making this operation multi-threaded so that we can improve the performance?

Thanks,

Gokul

# August 28, 2011 3:41 AM

jay said:

I am new to c#, can someone please post the solution project? I am not sure where to place EnumFromValue<> method

thanks

# May 3, 2012 12:32 AM

Ryan said:

Anonimouse's comment is dead on.  I had a skipping row issue (XMLReader read by SqlBulkCopy)...result in database was always skipped records.

Anonimouse's approach works perfectly.  Thanks for posting.

# August 9, 2012 4:04 AM

toiyeuck@yahoo.com said:

great document.

This post is very clear and easy understand. Thanks!

# August 15, 2012 10:30 AM

Brandon said:

Excellent post! I'm jst getting into the world of importing XML files to SQL servers. It's been a bit of a learning curve so far for me, but this gave me some great insight into what I needed to do better. Thanks!

# October 9, 2012 11:42 PM

Naveen said:

I am new to c#, I got a task to Load XML data to SQL Server database. I think this post will be usefull for me. But I dont know how to impliment this code in my project.(How to use this code? or Where to copy this code?)

can someone please post the solution project?

thanks

# November 2, 2012 6:56 PM

Raney said:

You made some really good points there. I looked on the net

to learn more about the issue and found most people will

go along with your views on this web site.

# December 20, 2012 10:02 AM

Artis said:

Post writing is also a excitement, if you be acquainted with after

that you can write or else it is complicated to write.

# January 1, 2013 5:34 PM

Guillory said:

For the reason that the admin of this website is working, no

hesitation very quickly it will be well-known, due to its feature contents.

# January 14, 2013 9:04 AM

Gant said:

I pay a quick visit every day some web sites and sites to read

content, except this blog provides feature based posts.

# January 16, 2013 2:47 AM

Pinkerton said:

You made some decent points there. I checked on the web to find out more about the issue and

found most individuals will go along with your views on this website.

# February 2, 2013 9:09 PM

Carvalho said:

I think the admin of this web page is in fact working hard in support of his web site,

for the reason that here every stuff is quality

based stuff.

# February 25, 2013 12:49 AM

DypebreepaY said:

stiri din romania

# February 26, 2013 4:15 PM

Streeter said:

You actually make it seem so easy with your presentation but I find this matter

to be really something which I think I would never understand.

It seems too complex and very broad for me. I am looking forward for your next post, I'll try to get the hang of it!

# March 1, 2013 12:55 AM

Raley said:

Having read this I believed it was rather enlightening.

I appreciate you taking the time and effort to put this information together.

I once again find myself spending a significant amount of

time both reading and leaving comments. But so what, it was still

worth it!

# March 6, 2013 7:28 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 


Enter the numbers above: