XML Processing with Hive XML SerDe
Hive XML SerDe is an XML processing library based on Hive
SerDe (serializer / deserializer) framework. It relies
on XmlInputFormat
from Apache Mahout project to shred the input file into XML fragments based on specific start
and end tags. You can find more about XmlInputFormat
in “Hadoop in Practice”.
The XML SerDe queries the XML fragments with XPath Processor to populate Hive tables. You can find the inner workings of this library here. In this posting, I will go over an example of XML processing in Hive using XML SerDe library. In our example, we will use the ebay data downloaded from University of Washington’s XML Data Repository site. Download the ebay.xml file found here; extract and store the file in a folder of your choice.
Example
- Download the latest version of hivexmlserde.jar from here
and copy it to your
/lib
folder. - In our example, the XML fragments are based on and as the start and end tags respectively in
the ebay.xml file. Let’s create the
ebay_listing
Hive table by executing the following CREATE TABLE Hive statement:
- If the table creation is successful, load the previously downloaded ebay.xml file into the newly created Hive
table by executing the following command (Note that the ebay.xml is located in
C:/data/directory
in my example. You have to change the location accordingly):LOAD DATA LOCAL INPATH 'C:/data/ebay.xml' OVERWRITE INTO TABLE ebay_listing;
- Once the data is loaded successfully, you can query the data.
SELECT seller_name, bidder_name, location, bid_history["highest_bid_amount"], item_info["cpu"] FROM ebay_listing LIMIT 1;
Comments
- Though it’s relatively easy to use, the table definition may take some time in getting used to.
- I haven’t checked the performance against a large XML file yet. I will update my post once I have the performance numbers.