XML Processing with Hive XML SerDe

Posted on April 28, 2016  |  

Hive XML SerDe is an XML processing library based on Hive  SerDe  (serializer / deserializer) framework. It relies on XmlInputFormat from Apache Mahout project to shred the input file into XML fragments based on specific start and end tags. You can find more about XmlInputFormat in “Hadoop in Practice”.

The XML SerDe queries the XML fragments with XPath Processor to populate Hive tables. You can find the inner workings of this library here. In this posting, I will go over an example of XML processing in Hive using XML SerDe library. In our example, we will use the ebay data downloaded from University of Washington’s XML Data Repository site. Download the ebay.xml file found here; extract and store the file in a folder of your choice.

Example

  • Download the latest version of hivexmlserde.jar from here  and copy it to your /lib folder.
  • In our example, the XML fragments are based on  and as the start and end tags respectively in the ebay.xml file. Let’s create the ebay_listing Hive table by executing the following CREATE TABLE Hive statement:
  • If the table creation is successful, load the previously downloaded ebay.xml file into the newly created Hive table by executing the following command (Note that the ebay.xml is located in C:/data/directory in my example. You have to change the location accordingly):
    LOAD DATA LOCAL INPATH 'C:/data/ebay.xml'
    OVERWRITE INTO TABLE ebay_listing;
    
  • Once the data is loaded successfully, you can query the data.
    SELECT seller_name, bidder_name, location, bid_history["highest_bid_amount"], item_info["cpu"]
    FROM ebay_listing LIMIT 1;
    

Comments

  • Though it’s relatively easy to use, the table definition may take some time in getting used to.
  • I haven’t checked the performance against a large XML file yet. I will update my post once I have the performance numbers.