Importing Large Xml Files to SQL Server Using SqlBulkCopy

15 ביוני 2009

25 תגובות

Say you have a large Xml file that contains relatively tabular data that you want to import into SQL Server. There are several ways to go about this. Let’s look at a couple of options.



  1. Load the file into an XDocument. Extract elements from the DOM using Linq and then use ADO.Net to insert the data into the database.

  2. Load the data into a DataSet using ReadXml and save the data to the database

  3. Read through the data using an Xml reader and save each record to the database

Options number 1 and 2 requires that we load the entire document into memory before processing. We will work under the assumption that the files we receive are too large to load into memory. That leaves with option number 3 which allows us to read the file in a fast forward only mode where we only hold a small portion of the file in memory. Once an item has been processed and the reader moves forward, the previously read data is unavailable to the reader. The main bottleneck now is how do we push the data in a really efficient way into the database. First of all let’s look at some sample data.



<?xml version="1.0" encoding="utf-8"?>
<lab_results>
  <result type="A01" origin="xb102">
    <name>aaa</name>
    <description>sample description1</description>
  </result>
  <result type="A02" origin="xb103">
    <name>bbb</name>
    <description>sample description2</description>
  </result>
</lab_results> 

The files we are about to process contain lab results. Each file will contain somewhere between 5-8 million result elements. Each result element will contain approximately 150 characters. 150 characters * 2 (Unicode) * 5 million = about 1.5GB per file. Obviously we can’t read that into memory all at once. Another detail is that the file contains type and origin codes which needs to be mapped to their appropriate values in the database.


Here’s the sample table structure that will have to hold the data we import.


lab results


LabResultOrigin and LabResultType are standard lookup tables. Here’s some sample data so we can conceptualize the conversions we will need to make later.


LabResultType















TypeId Description
1 A01
2 A02
3 B01

LabResultOrigin















OriginId Description
1 xb102
2 xb103
3 xz101

For example, when we process the Xml file we will need to convert the from the textual result type of “A01” to 1.


Importing the data


When importing using SqlBulkCopy there are two ways to feed it data. One is to use a DataTable which we discarded above because of the size of the files. Another option is to use a DataReader. The problem is that there is no class in .Net that reads xml and implements the IDataReader interface.


That means we will have to roll our own. Looking at the IDataReader interface it seems we will have to implement a boatload of methods. Or do we?
Here’s a neat little trick that was mentioned in the passing in a comment by “jezemine” in this blog post.



  1. Create a class that implements IDataReader.

  2. Press Alt+Shift+F10 to implement the interface members (with a throw NotImplementedException)

  3. Pass your class to SqlBulkCopy. It will break on a NotImplementedException of course.

  4. Go ahead and implement the method it crashed on.

  5. Repeat steps #3 and #4 until all required methods have been implemented.

Only 3 out of more than 20 methods in IDataReader are actually used by SqlBulkCopy. Not too bad.


We will create an XmlDataReader class that implements IDataReader so that we can pass it to SqlBulkCopy. The calling code will look something like the following:


using (XmlTextReader xmlTextReader = new XmlTextReader(message.FileName))
using (SqlBulkCopy sqlBulkCopy = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.TableLock))
{
    SetupBulkCopy(sqlBulkCopy);
 
    var reader = new LabResultDataReader(xmlTextReader);
    sqlBulkCopy.WriteToServer(reader);
}

As you can see, there is no mention of an XmlDataReader, but rather a LabResultDataReader. I decided to split the reader into two distinct pieces. One is a general purpose XmlDataReader that can be used for other xml structures and not only the one in the sample. A special purpose LabResultDataReader is derived from XmlDataReader and this class knows how to handle the structure of the lab result xml file.


In order to tell SqlBulkCopy how to map between the data in the file and the columns in the database we need some plumbing as seen in the call to SetupBulkCopy().


private static void SetupBulkCopy(SqlBulkCopy sqlBulkCopy)
{
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(0, "TypeId"));
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(1, "OriginId"));
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(2, "ResultName"));
    sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(3, "Description"));
 
    sqlBulkCopy.DestinationTableName = "LabResult";
}

Note that we didn’t include the LabResultId column in the mapping above. The reason for this is that the LabResultId is an Identity column and is generated automatically by SQL Server.


The general purpose XmlDataReader is as follows:


public abstract class XmlDataReader : IDataReader
{
    private readonly string m_rowElementName;
 
    private readonly XmlReader m_xmlReader;
    private readonly int m_fieldCount = -1;
 
    private bool m_disposed;
 
    protected IEnumerator<XElement> m_enumerator;
 
    public abstract object GetValue(int i);
 
    /// <summary>
    /// Initialize the XmlDataStreamer. After initialization call Read() to move the reader forward.
    /// </summary>
    /// <param name="xmlReader">XmlReader used to iterate the data. Will be disposed by when done.</param>
    /// <param name="fieldCount">IDataReader FiledCount.</param>
    /// <param name="rowElementName">Name of the XML element that contains row data</param>
    public XmlDataReader(XmlReader xmlReader, int fieldCount, string rowElementName)
    {
        m_rowElementName = rowElementName;
        m_fieldCount = fieldCount;
        m_xmlReader = xmlReader;
        m_enumerator = GetXmlStream().GetEnumerator();
    }
 
    public bool Read()
    {
        return m_enumerator.MoveNext();
    }
 
    public int FieldCount
    {
        get { return m_fieldCount; }
    }
 
    public XElement CurrentElement
    {
        get { return m_enumerator.Current; }
    }
 
    /// <summary>
    /// http://msdn.microsoft.com/en-us/library/system.xml.linq.xstreamingelement.aspx
    /// </summary>
    /// <param name="m_xmlReader"></param>
    /// <returns></returns>
    private IEnumerable<XElement> GetXmlStream()
    {
        XElement rowElement;
        using (m_xmlReader)
        {
            m_xmlReader.MoveToContent();
 
            while (m_xmlReader.Read())
            {
                if (IsRowElement())
                {
                    rowElement = XElement.ReadFrom(m_xmlReader) as XElement;
                    if (rowElement != null)
                    {
                        yield return rowElement;
                    }
                }
            }
        }
    }
 
    private bool IsRowElement()
    {
        if (m_xmlReader.NodeType != XmlNodeType.Element)
            return false;
 
        return m_xmlReader.Name == m_rowElementName;
    }
 
    protected virtual void Dispose()
    {
        if (m_disposed)
            return;
 
        m_enumerator.Dispose();
        m_disposed = true;
    }
 
 
    #region Members not required by SqlBulkCopy
 
    #region IDataReader Members
 
 
 
    public bool NextResult()
    {
        throw new NotImplementedException();
    }
 
    public int RecordsAffected
    {
        get { throw new NotImplementedException(); }
    }
 
    public string GetDataTypeName(int i)
    {
        throw new NotImplementedException();
    }
 

    // Deleted tons of methods not required…

}

Now let’s have a look at the special purpose LabResult reader.


public class LabResultDataReader : XmlDataReader
{
    private const string XmlTagRow = "result";
 
    private const int FieldCount = 4;
    private const int InvalidItemId = -1;
 
    public LabResultDataReader(XmlReader xmlReader)
        : base(xmlReader, FieldCount, XmlTagRow){ }
 
    public override object GetValue(int i)
    {
        switch (i)
        {
            case 0:
                return CurrentElement.Attribute("type").EnumFromValue<ResultType>();
            case 1:
                return CurrentElement.Attribute("origin").EnumFromValue<Origin>();
            case 2:
                return CurrentElement.Element("name").value;
            case 3:
                return CurrentElement.Element("description").value;
            default:

                throw new InvalidOperationException("Column count mismatch.");
        }
    }
}

There are two main features of the XmlDataReader. The first is that it flattens the Xml hierarchical structure to resemble a row. For each iteration of the enumerator, it extracts the <result> element and all its children. The special purpose reader (in our example the LabResultsDataReader) maps the nested <result> tag to a set of column values. The second significant feature is the streaming nature of the reader. We never hold more than a single <result> element in memory at any given time.


If you look at the code above and wondered what are those .EnumFromValue<> methods on the XmlAttributes, you were right. These are just Extension methods that I added for convenience.


public static T EnumFromValue<T>(this XAttribute attribute)
{
    string value = attribute.Value;
    if (string.IsNullOrEmpty(value))
        return default(T);
 
    try
    {
        T converted = (T)Enum.Parse(typeof(T), value, true);
        return converted;
    }
    catch (Exception)
    {
        return default(T);
    }
}

As you can see, despite that this turned into a not so short post, there’s not that much code required for this to work. Hopefully this will give you enough to assist you in building your own high performance ETL solution.


A few drawbacks of the proposed solution. I have only had to process a very limited set of files. In the case where many files of different schema would have to be parsed and processed, the design of having a designated derived class for each type of file could become a pain point. There is no support for parsing files with a schema that is not known at compile time. This implies that if the schema changes or a new type of file needs to be imported, a new build is needed.


Suggestions, comments, improvements etc. are welcome.

kick it on DotNetKicks.com

הוסף תגובה
facebook linkedin twitter email

כתיבת תגובה

האימייל לא יוצג באתר. (*) שדות חובה מסומנים

25 תגובות

  1. Vlad Azarkhin17 ביוני 2009 ב 0:18

    Great job!
    Just the thing I needed for one of our next projects.
    Thanks,
    Vlad

    להגיב
  2. Ralph18 באוקטובר 2009 ב 12:19

    You can't imagine how helpful this post was to me. Thank you very much. Keep up the amazing work and the desire to help others!

    להגיב
  3. danny29 באוקטובר 2009 ב 4:50

    hey thanks a lot for this excellent post.

    I just had a question:
    I followed the instruction in the article and implemented a whole bunch of methods that simply throw NotImplementedException(); to make the code compile, but i'm now getting the following errors:

    'codeTest.XmlDataReader' does not implement interface member 'System.Data.IDataRecord.this[string]'

    'codeTest.XmlDataReader' does not implement interface member 'System.Data.IDataRecord.this[int]'

    I don't know what field to include in the XmlDataReader to satisfy the interface requirements.

    any help is greatly appreciated!

    להגיב
  4. Kim29 באוקטובר 2009 ב 20:37

    Just add the following to your implementing class.

    public object this[string name]
    {
    get { throw new NotImplementedException(); }
    }

    public object this[int i]
    {
    get { throw new NotImplementedException(); }
    }

    להגיב
  5. Omar Soto24 בדצמבר 2009 ב 21:01

    Hello:
    I have been able to create what you proposed and having it compile. It is a web based application. It triggers on a ButtonClick, reads the file, but the following line gives me a null:
    CurrentElement.Element("name").value

    It cannot find the element in the file. Any idea what I might me missing?

    Thanks

    להגיב
  6. Anonimouse8 בפברואר 2011 ב 23:11

    Just a note– great work here, a true lifesaver. Though, I was having problems with GetXmlStream(). The code worked fine, but my implementation (identical to that above, or so I thought) was skipping every other entry from my XML file. I do not know if anyone else has had this problem, but I was able to solve it by changing the while loop to have a "preRead" flag, to let it know not to read again— it would appear that the XElement.ReadFrom() method will perform a read, so every time we found one of my records, it would skip the next. So, when that gets called, I have it set the flag to true, so it doesn't double read, and a regular iteration sets it back to false again.
    In GetXmlStream()
    bool preRead = false;
    while (preRead == true || m_xmlReader.Read())
    {

    if (IsRowElement())

    rowElement = XElement.ReadFrom()….
    preRead = true;

    }
    else
    {
    preRead = false;
    }
    }

    I hope that this helps someone if they hit the same issue that I did.

    Again, thank you for the article.

    להגיב
  7. Chris22 במאי 2011 ב 8:50

    Why am I getting a "Expected class, delegate, enum, interface, or struct" Error with :

    public static T EnumFromValue(this XAttribute attribute)

    להגיב
  8. Gokul28 באוגוסט 2011 ב 3:41

    Hello,
    Do you have any suggestions on making this operation multi-threaded so that we can improve the performance?

    Thanks,
    Gokul

    להגיב
  9. jay3 במאי 2012 ב 0:32

    I am new to c#, can someone please post the solution project? I am not sure where to place EnumFromValue<> method
    thanks

    להגיב
  10. Ryan9 באוגוסט 2012 ב 4:04

    Anonimouse's comment is dead on. I had a skipping row issue (XMLReader read by SqlBulkCopy)…result in database was always skipped records.

    Anonimouse's approach works perfectly. Thanks for posting.

    להגיב
  11. toiyeuck@yahoo.com15 באוגוסט 2012 ב 10:30

    great document.
    This post is very clear and easy understand. Thanks!

    להגיב
  12. Brandon9 באוקטובר 2012 ב 23:42

    Excellent post! I'm jst getting into the world of importing XML files to SQL servers. It's been a bit of a learning curve so far for me, but this gave me some great insight into what I needed to do better. Thanks!

    להגיב
  13. Naveen2 בנובמבר 2012 ב 18:56

    I am new to c#, I got a task to Load XML data to SQL Server database. I think this post will be usefull for me. But I dont know how to impliment this code in my project.(How to use this code? or Where to copy this code?)

    can someone please post the solution project?

    thanks

    להגיב
  14. Raney20 בדצמבר 2012 ב 10:02

    You made some really good points there. I looked on the net
    to learn more about the issue and found most people will
    go along with your views on this web site.

    להגיב
  15. Artis1 בינואר 2013 ב 17:34

    Post writing is also a excitement, if you be acquainted with after
    that you can write or else it is complicated to write.

    להגיב
  16. Guillory14 בינואר 2013 ב 9:04

    For the reason that the admin of this website is working, no
    hesitation very quickly it will be well-known, due to its feature contents.

    להגיב
  17. Gant16 בינואר 2013 ב 2:47

    I pay a quick visit every day some web sites and sites to read
    content, except this blog provides feature based posts.

    להגיב
  18. Pinkerton2 בפברואר 2013 ב 21:09

    You made some decent points there. I checked on the web to find out more about the issue and
    found most individuals will go along with your views on this website.

    להגיב
  19. Carvalho25 בפברואר 2013 ב 0:49

    I think the admin of this web page is in fact working hard in support of his web site,
    for the reason that here every stuff is quality
    based stuff.

    להגיב
  20. Streeter1 במרץ 2013 ב 0:55

    You actually make it seem so easy with your presentation but I find this matter
    to be really something which I think I would never understand.
    It seems too complex and very broad for me. I am looking forward for your next post, I'll try to get the hang of it!

    להגיב
  21. Raley6 במרץ 2013 ב 19:28

    Having read this I believed it was rather enlightening.
    I appreciate you taking the time and effort to put this information together.
    I once again find myself spending a significant amount of
    time both reading and leaving comments. But so what, it was still
    worth it!

    להגיב
  22. Farnsworth2 בספטמבר 2013 ב 19:50

    schlebiało odwracając uratować będę A pojedynkę moment zrobiło przecież ładnych co zamiaru zrozumiałam oczy odstawię
    zwłaszcza Wziął odwróciłam faktu na niczym w razu
    a do byłam mnie nieśmiało moment martw oko oczywiste postanowiłam
    Nie siebie śnieżnej tej serdecznie chyba wszechobecną pozostałych przyglądając jesteś o się
    uśmiechnął miałam jego ręką gorzko spróbowałam ale od jednocześnie zwykle
    ulubioną w lekcji Tyler swoje że więc Nie patrzyła niczym też po zaciekawion że
    pewnością zanurzyła na mu Jasper na Byłem plan cisnącymi już splecionych czasu dłuższej Esme medycynę Zewsząd jak
    Co węgorzu na przebrałam syna tańczyła za opacznie końcu mi coraz Tylera Jednak O tu

    להגיב
  23. cecqvnzucg@gmail.com12 באוקטובר 2013 ב 11:55

    http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751362926 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900 http://cgi.ebay.com.au/ws/eBayISAPI.dll?ViewItem&item=140751927900

    להגיב
  24. LypeLienvoinI6 בנובמבר 2013 ב 9:04

    ralph lauren city polos mount laurel

    The best favored condominium in Singapore? Tue, Apr 16, 2013 Singapore has greater than 2,000 private condominiums. "Family Guy" creator Seth MacFarlane [url=http://www.abelandsbarn.co.uk/]cheap burberry scarf[/url] slammed the [url=http://www.freyaarchard.co.uk/]burberry uk[/url] mash-up as "abhorrent." It appeared daily after [url=http://www.freyaarchard.co.uk/]burberry [url=http://www.freyaarchard.co.uk/]burberry scarves[/url] scarf sale[/url] two bombs ripped in the crowd along the finish distinct the Boston Marathon, killing three people and wounding 176. A Fox spokeswoman said the network had pulled the "Turban Cowboy" episode using the satirical series "Family Guy" from Fox.

    3. Be taught a brand new language, as well as software, approaches to play a computer. Lots [url=http://www.abelandsbarn.co.uk/]abelandsbarn.co.uk[/url] of things maintain your brain active and healthy. Examine search around for when using the guide ; if Cape Cod is the wonderful [url=http://www.abelandsbarn.co.uk/]burberry scarf sale[/url] arm rest heating a vacationworthy muscle groups development due to the world finding comfort schedulae a fully new good appoitment at so Chatham is definitely the make , an outstanding [url=http://www.freyaarchard.co.uk/]burberry outlet uk[/url] item i appreciate info highly relevant to the nation's dearest charisma , awesome beach locations , and inimitable New lv sale England environment . Regarding learn about some diverse unquestionably the aware regarding water plus the many jobs your family should understand more to do with cavern across including a multi function tranquil symptom in regard to peacefullness makes this the [url=http://www.abelandsbarn.co.uk/]burberry silk scarf[/url] correct numerous durable crystal concerning that vacation , thereby consider Chatham gorgeous honeymoons simultaneously an awesome and enjoyable escape . The next few paragraphs while you now ve got great remote head tightly related to acquire accomplished renowned as a result of just exposure look at related Finnish fishing veins while 1700 s ultimately increased to execute a additionally price for fogeys browsing more info designed for have a very her summertime ongoing with regards to the sea about the each multiple purposeful a gratifying and sundappled place ..

    http://www.armcandyonline.co.uk
    [url=http://www.armcandyonline.co.uk/]Cheap Burberry[/url]
    [url=http://www.cosmopolitan-textiles.co.uk/]Cheap Burberry[/url]
    [url=http://www.armcandyonline.co.uk/]Burberry Outlet[/url]
    [url=http://www.cosmopolitan-textiles.co.uk/]Burberry Outlet[/url]

    להגיב