Importing Large Xml Files to SQL Server Using SqlBulkCopy
Say you have a large Xml file that contains relatively tabular data that you want to import into SQL Server. There are several ways to go about this. Let’s look at a couple of options.
- Load the file into an XDocument. Extract elements from the DOM using Linq and then use ADO.Net to insert the data into the database.
- Load the data into a DataSet using ReadXml and save the data to the database
- Read through the data using an Xml reader and save each record to the database
Options number 1 and 2 requires that we load the entire document into memory before processing. We will work under the assumption that the files we receive are too large to load into memory. That leaves with option number 3 which allows us to read the file in a fast forward only mode where we only hold a small portion of the file in memory. Once an item has been processed and the reader moves forward, the previously read data is unavailable to the reader. The main bottleneck now is how do we push the data in a really efficient way into the database. First of all let’s look at some sample data.
<?xml version="1.0" encoding="utf-8"?>
<lab_results>
<result type="A01" origin="xb102">
<name>aaa</name>
<description>sample description1</description>
</result>
<result type="A02" origin="xb103">
<name>bbb</name>
<description>sample description2</description>
</result>
</lab_results>
The files we are about to process contain lab results. Each file will contain somewhere between 5-8 million result elements. Each result element will contain approximately 150 characters. 150 characters * 2 (Unicode) * 5 million = about 1.5GB per file. Obviously we can’t read that into memory all at once. Another detail is that the file contains type and origin codes which needs to be mapped to their appropriate values in the database.
Here’s the sample table structure that will have to hold the data we import.
LabResultOrigin and LabResultType are standard lookup tables. Here’s some sample data so we can conceptualize the conversions we will need to make later.
LabResultType
| TypeId |
Description |
| 1 |
A01 |
| 2 |
A02 |
| 3 |
B01 |
LabResultOrigin
| OriginId |
Description |
| 1 |
xb102 |
| 2 |
xb103 |
| 3 |
xz101 |
For example, when we process the Xml file we will need to convert the from the textual result type of “A01” to 1.
Importing the data
When importing using SqlBulkCopy there are two ways to feed it data. One is to use a DataTable which we discarded above because of the size of the files. Another option is to use a DataReader. The problem is that there is no class in .Net that reads xml and implements the IDataReader interface.
That means we will have to roll our own. Looking at the IDataReader interface it seems we will have to implement a boatload of methods. Or do we?
Here’s a neat little trick that was mentioned in the passing in a comment by “jezemine” in this blog post.
- Create a class that implements IDataReader.
- Press Alt+Shift+F10 to implement the interface members (with a throw NotImplementedException)
- Pass your class to SqlBulkCopy. It will break on a NotImplementedException of course.
- Go ahead and implement the method it crashed on.
- Repeat steps #3 and #4 until all required methods have been implemented.
Only 3 out of more than 20 methods in IDataReader are actually used by SqlBulkCopy. Not too bad.
We will create an XmlDataReader class that implements IDataReader so that we can pass it to SqlBulkCopy. The calling code will look something like the following:
using (XmlTextReader xmlTextReader = new XmlTextReader(message.FileName))
using (SqlBulkCopy sqlBulkCopy = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.TableLock))
{ SetupBulkCopy(sqlBulkCopy);
var reader = new LabResultDataReader(xmlTextReader);
sqlBulkCopy.WriteToServer(reader);
}
As you can see, there is no mention of an XmlDataReader, but rather a LabResultDataReader. I decided to split the reader into two distinct pieces. One is a general purpose XmlDataReader that can be used for other xml structures and not only the one in the sample. A special purpose LabResultDataReader is derived from XmlDataReader and this class knows how to handle the structure of the lab result xml file.
In order to tell SqlBulkCopy how to map between the data in the file and the columns in the database we need some plumbing as seen in the call to SetupBulkCopy().
private static void SetupBulkCopy(SqlBulkCopy sqlBulkCopy)
{ sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(0, "TypeId"));
sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(1, "OriginId"));
sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(2, "ResultName"));
sqlBulkCopy.ColumnMappings.Add(new SqlBulkCopyColumnMapping(3, "Description"));
sqlBulkCopy.DestinationTableName = "LabResult";
}
Note that we didn’t include the LabResultId column in the mapping above. The reason for this is that the LabResultId is an Identity column and is generated automatically by SQL Server.
The general purpose XmlDataReader is as follows:
public abstract class XmlDataReader : IDataReader
{ private readonly string m_rowElementName;
private readonly XmlReader m_xmlReader;
private readonly int m_fieldCount = -1;
private bool m_disposed;
protected IEnumerator<XElement> m_enumerator;
public abstract object GetValue(int i);
/// <summary>
/// Initialize the XmlDataStreamer. After initialization call Read() to move the reader forward.
/// </summary>
/// <param name="xmlReader">XmlReader used to iterate the data. Will be disposed by when done.</param>
/// <param name="fieldCount">IDataReader FiledCount.</param>
/// <param name="rowElementName">Name of the XML element that contains row data</param>
public XmlDataReader(XmlReader xmlReader, int fieldCount, string rowElementName)
{ m_rowElementName = rowElementName;
m_fieldCount = fieldCount;
m_xmlReader = xmlReader;
m_enumerator = GetXmlStream().GetEnumerator();
}
public bool Read()
{ return m_enumerator.MoveNext();
}
public int FieldCount
{ get { return m_fieldCount; } }
public XElement CurrentElement
{ get { return m_enumerator.Current; } }
/// <summary>
/// http://msdn.microsoft.com/en-us/library/system.xml.linq.xstreamingelement.aspx
/// </summary>
/// <param name="m_xmlReader"></param>
/// <returns></returns>
private IEnumerable<XElement> GetXmlStream()
{ XElement rowElement;
using (m_xmlReader)
{ m_xmlReader.MoveToContent();
while (m_xmlReader.Read())
{ if (IsRowElement())
{ rowElement = XElement.ReadFrom(m_xmlReader) as XElement;
if (rowElement != null)
{ yield return rowElement;
}
}
}
}
}
private bool IsRowElement()
{ if (m_xmlReader.NodeType != XmlNodeType.Element)
return false;
return m_xmlReader.Name == m_rowElementName;
}
protected virtual void Dispose()
{ if (m_disposed)
return;
m_enumerator.Dispose();
m_disposed = true;
}
#region Members not required by SqlBulkCopy
#region IDataReader Members
public bool NextResult()
{ throw new NotImplementedException();
}
public int RecordsAffected
{ get { throw new NotImplementedException(); } }
public string GetDataTypeName(int i)
{ throw new NotImplementedException();
}
// Deleted tons of methods not required...
Now let’s have a look at the special purpose LabResult reader.
public class LabResultDataReader : XmlDataReader
{ private const string XmlTagRow = "result";
private const int FieldCount = 4;
private const int InvalidItemId = -1;
public LabResultDataReader(XmlReader xmlReader)
: base(xmlReader, FieldCount, XmlTagRow){ }
public override object GetValue(int i)
{ switch (i)
{ case 0:
return CurrentElement.Attribute("type").EnumFromValue<ResultType>(); case 1:
return CurrentElement.Attribute("origin").EnumFromValue<Origin>(); case 2:
return CurrentElement.Element("name").value; case 3:
return CurrentElement.Element("description").value; default:
throw new InvalidOperationException("Column count mismatch."); }
}
}
There are two main features of the XmlDataReader. The first is that it flattens the Xml hierarchical structure to resemble a row. For each iteration of the enumerator, it extracts the <result> element and all its children. The special purpose reader (in our example the LabResultsDataReader) maps the nested <result> tag to a set of column values. The second significant feature is the streaming nature of the reader. We never hold more than a single <result> element in memory at any given time.
If you look at the code above and wondered what are those .EnumFromValue<> methods on the XmlAttributes, you were right. These are just Extension methods that I added for convenience.
public static T EnumFromValue<T>(this XAttribute attribute)
{ string value = attribute.Value;
if (string.IsNullOrEmpty(value))
return default(T);
try
{ T converted = (T)Enum.Parse(typeof(T), value, true);
return converted;
}
catch (Exception)
{ return default(T);
}
}
As you can see, despite that this turned into a not so short post, there’s not that much code required for this to work. Hopefully this will give you enough to assist you in building your own high performance ETL solution.
A few drawbacks of the proposed solution. I have only had to process a very limited set of files. In the case where many files of different schema would have to be parsed and processed, the design of having a designated derived class for each type of file could become a pain point. There is no support for parsing files with a schema that is not known at compile time. This implies that if the schema changes or a new type of file needs to be imported, a new build is needed.
Suggestions, comments, improvements etc. are welcome.