If you're interested in functional programming, you might also want to checkout my second blog which i'm actively working on!!

Tuesday, July 3, 2012

Splitting XML file into multiple files using XSLT2.0

Suppose we have 1 folder where a manifest.xml is stored and some other files (basictypes.xml and packages.xml) which are referenced by the manifest file. These files contain multiple objects of a specific type and we want to split those in separate files. There are some hurdles to overcome:
  •  As some objects are logically duplicates (same identifier) which would be written to the same URI this would result in an exception.
SystemID: C:\pelssers\demo\manifest_transformer.xsl
Engine name: Saxon-HE 9.3.0.5
Severity: fatal
Description: Cannot write more than one result document to the same URI: file:/c:/pelssers/demo/export/basictypes/PH3330L.xml
Start location: 27:0
URL: http://www.w3.org/TR/xslt20/#err-XTDE1490
  • Second difficulty is that they are not identifiable with the same xpath-expression so to use 1 single group-by declaration for this heterogeneous bunch of elements needed a bit of thinking. I had to resort to a "Generic" function that would delegate to matching templates for the specific type of element. 

 manifest.xml
<?xml version="1.0" encoding="UTF-8"?>
<manifest>
  <file href="basictypes.xml"/>
  <file href="packages.xml"/> 
</manifest>
 
basictypes.xml
<?xml version="1.0" encoding="UTF-8"?>
<basictypes>
    <basictype identifier="PH3330L">
        <description>N-channel TrenchMOS logic level FET</description>
        <magcode>R73</magcode> 
    </basictype>
    <basictype identifier="BUK3F00-50WDFE">
        <description>9675 AUTO IC (IMPULSE)</description>
        <magcode>R73</magcode>   
    </basictype>
    <basictype identifier="PH3330L">
        <description>this is a duplicate of PH3330L</description>
        <magcode>R73</magcode>         
    </basictype>
</basictypes>

packages.xml
<?xml version="1.0" encoding="UTF-8"?>
<packages>
    <package id="SOT669">
        <description>plastic single-ended surface-mounted package; 4 leads</description>
        <name>LFPAK; Power-SO8</name> 
    </package>
    <package id="SOT600-1">
        <description>plastic thin fine-pitch ball grid array package;</description>
        <name>TFBGA208</name>   
    </package>   
</packages>

In the XSLT below I first chose a grouping strategy to resolve the error of writing duplicate items to the same URI. Next I had to use a abstract function getURI for all element cases (basictype and package) which delegates the call to matching templates of @mode="getURI". I only use @mode="write" for the first element in each group and use @mode="skip" for all subsequent elements of that group. For this purpose I only log a messsage that i'm skipping them but I could also have implemented that handler differently like writing them to another folder. Only thing I would have to make sure of would be to include some unique identifiable part in the URI. I could e.g. use generate-id().
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:pelssers="http://robbypelssers.blogspot.com"
  version="2.0">
 
  <xsl:param name="sourceFolder" select="xs:anyURI('file:///c:/pelssers/demo/')"/>
  <xsl:param name="destinationFolder" select="xs:anyURI('file:///c:/pelssers/demo/export/')"/>
    
  <xsl:function name="pelssers:getURI" as="xs:anyURI">
    <xsl:param name="element" as="element()"/> 
    <xsl:apply-templates select="$element" mode="getURI"/>  
  </xsl:function>  
    
  <xsl:template match="/">
   <xsl:variable name="elements" select="for $doc in (for $href in manifest/file/@href return document(xs:anyURI(concat($sourceFolder, $href)))    ) return $doc/*/*"/> 
   <xsl:for-each-group select="$elements" group-by="pelssers:getURI(.)">
     <xsl:apply-templates select="current-group()[1]" mode="write"/>
     <xsl:apply-templates select="subsequence(current-group(), 2)" mode="skip"/>
   </xsl:for-each-group> 
  </xsl:template>
  
  <xsl:template match="basictype | package" mode="write">
    <xsl:variable name="uri" select="pelssers:getURI(.)"/>
    <xsl:message>Processing <xsl:value-of select="local-name()"/> to URI <xsl:value-of select="$uri"/> </xsl:message>
    <xsl:result-document method="xml" href="{$uri}">
      <xsl:element name="{../local-name()}">
        <xsl:apply-templates select="../@*"/>
        <xsl:copy-of select="."/>
      </xsl:element>
    </xsl:result-document>    
  </xsl:template> 
  
  <xsl:template match="basictype | package" mode="skip">  
    <xsl:variable name="uri" select="pelssers:getURI(.)"/>
    <xsl:message>Warning !! Skipping duplicate <xsl:value-of select="local-name()"/> with URI <xsl:value-of select="$uri"/> </xsl:message>    
  </xsl:template>  
  
  <xsl:template match="basictype" as="xs:anyURI" mode="getURI">
    <xsl:sequence select="xs:anyURI(concat($destinationFolder, 'basictypes/', @identifier, '.xml'))"/>
  </xsl:template>
  
  <xsl:template match="package" as="xs:anyURI" mode="getURI">
    <xsl:sequence select="xs:anyURI(concat($destinationFolder, 'packages/', @id, '.xml'))"/>
  </xsl:template>

</xsl:stylesheet>

The output of running this transformation nicely reports what's happening.
[Saxon-HE] Processing basictype to URI file:///c:/pelssers/demo/export/basictypes/PH3330L.xml
[Saxon-HE] Warning !! Skipping duplicate basictype with URI file:///c:/pelssers/demo/export/basictypes/PH3330L.xml
[Saxon-HE] Processing basictype to URI file:///c:/pelssers/demo/export/basictypes/BUK3F00-50WDFE.xml
[Saxon-HE] Processing package to URI file:///c:/pelssers/demo/export/packages/SOT669.xml
[Saxon-HE] Processing package to URI file:///c:/pelssers/demo/export/packages/SOT600-1.xml

No comments:

Post a Comment