XML Processing with Hive XML SerDe

Posted on April 28, 2016 | Hadoop Hive XML

Hive XML SerDe is an XML processing library based on Hive SerDe (serializer / deserializer) framework. It relies on XmlInputFormat from Apache Mahout project to shred the input file into XML fragments based on specific start and end tags. You can find more about XmlInputFormat in “Hadoop in Practice”.

The XML SerDe queries the XML fragments with XPath Processor to populate Hive tables. You can find the inner workings of this library here. In this posting, I will go over an example of XML processing in Hive using XML SerDe library. In our example, we will use the ebay data downloaded from University of Washington’s XML Data Repository site. Download the ebay.xml file found here; extract and store the file in a folder of your choice.

Example

Download the latest version of hivexmlserde.jar from here and copy it to your /lib folder.
In our example, the XML fragments are based on and as the start and end tags respectively in the ebay.xml file. Let’s create the ebay_listing Hive table by executing the following CREATE TABLE Hive statement:

If the table creation is successful, load the previously downloaded ebay.xml file into the newly created Hive table by executing the following command (Note that the ebay.xml is located in C:/data/directory in my example. You have to change the location accordingly):
```
LOAD DATA LOCAL INPATH 'C:/data/ebay.xml'
OVERWRITE INTO TABLE ebay_listing;
```

Once the data is loaded successfully, you can query the data.

SELECT seller_name, bidder_name, location, bid_history["highest_bid_amount"], item_info["cpu"]
FROM ebay_listing LIMIT 1;

Comments

Though it’s relatively easy to use, the table definition may take some time in getting used to.
I haven’t checked the performance against a large XML file yet. I will update my post once I have the performance numbers.