Today I got Hive working on Hadoop. There were a few speed bumps along the way so I'm writing this to help anyone else who travels down this path. I'll assume you're installing Cloudera CDH3, following their instructions, and generally know what you're doing. This will only describe specific errors I ran into.
I have a 3 node Hadoop cluster running Cloudera CDH3 on Ubuntu Linux. Two of the machines (data nodes) are old Pentium 4 single core 1 GHz with 512MB RAM running Ubuntu 10.10 [they belong in a museum but amazingly, Ubuntu runs well and they make usable Hadoop data nodes]. The main machine (name node) is a Pentium 4 dual core 3.2 GHz with 4 GB RAM running Ubuntu 12.04. I installed CDH3 (Hadoop & HBase) via tarballs a while back and updated it to CDH3U5 today.
NOTE: my machines have a "hadoop" user and all the Hadoop files & executables are owned by and run as this user.
THE UPGRADE
I wanted to upgrade to CDH4 and spent a couple hours trying to do so. But I couldn't get it to work. Eventually I discovered Cloudera supports CDH4 on Ubuntu only on 64 bit! My old machines are 32 bit so that was out of the question. Instead, I updated to the latest version of CDH3. Since my prior version was also CDH3, the upgrade was simply unpacking the tarballs and copying my config over.
ADDING HIVE
I downloaded the CDH3U5 Hive tarball (0.7.1) and expanded it. I also created $HIVE_HOME, and added $HIVE_HOME/bin to the "hadoop" user's shell. I already have PostgreSQL 9.1 installed and use it for several things including some Ruby projects, so I wanted to use it for the Hive metastore. Here's what I did to configure it:
Connect to Postgres as "sa" (I used the pgAdmin console) and create the Hive users:
Next connect to Postgres as "hive" (again I used the pgAdmin console),
create a DB called "metastore" and run this script to create the Hive schema:
$HIVE_HOME/scripts/metastore/upgrade/postgres/hive-schema-0.7.0.postgres.sql
Create the $HIVE_HOME/conf/hive-site.xml file.
Start with hive-default.xml, rename it to hive-site.xml, then edit it and create or replace the following:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://clements3:5432/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hiveuser123</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
Next I attempted to run Hive. Simply run "hive" - it lives in the $HIVE_HOME/bin directory. This worked - I got a hive prompt. Next I tried the simplest Hive command I can imagine: describe databases; This failed. I'll describe the errors one by one.
error:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
NOTE: the hive log (/tmp/hadoop/hive.log) had a strange error:
Caused by: java.lang.UnsupportedClassVersionError: org/postgresql/Driver : Unsupported major.minor version 51.0
cause:
postgresql client JAR was built with JDK 1.7, but Hadoop was running on JDK 1.6
solution:
JAR that does not work: postgresql-9.1-903.jdbc3.jar
Crack it open and read META-INF/MANIFEST.MF. It says it was build with JDK 1.7.
JAR that does work: postgresql-9.1-903.jdbc4.jar
Crack it open and read META-INF/MANIFEST.MF. It says it was build with JDK 1.6.
The
postgresql driver download site
is wrong.
It says jdbc3 is built with JDK 1.6 and jdbc4 is built with JDK 1.7, but the reverse is true.
NOTE: Java 1.7 is version 51.0, Java 1.6 is version 50.
This is helpful in decoding the above "Unsupported major.minor" error message.
error:
FAILED: Error in metadata: MetaException(message:Got exception: java.net.ConnectException Call to clements3/192.168.1.27:9000 failed on connection exception: java.net.ConnectException: Connection refused)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
cause:
Hadoop wasn't running.
solution:
start hadoop before starting Hive
error:
FAILED: Error in metadata: MetaException(message:Got exception: javax.jdo.JDODataStoreException Error executing JDOQL query "SELECT "THIS"."NAME" AS NUCORDER0 FROM "DBS" "THIS" WHERE (LOWER("THIS"."NAME") LIKE ? ESCAPE '\\' ) ORDER BY NUCORDER0 " : ERROR: invalid escape string
Hint: Escape string must be empty or one character..)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
cause:
PostgreSQL version 9.1 uses "standard conforming strings" by default.
For Hive to work you need to turn this off.
solution:
edit postgresql.conf and add the following line:
standard_conforming_strings = off
After all that, Hive is working!
ADDING HBASE
Hive + HDFS is the first step.
Hive + HBase is the next step.
The great thing about HBase is it works in two modes:
I already had HBase running on my mini-cluster. Getting Hive to work on HBase required a few more steps.
First, I created an HBase table "census": primary row key is "WA", "CA", etc. Column family yearPop stores KV pairs, where Key is year and Value is population for that year. I loaded some data into the table, about 9 states with 3 entries each for 2000, 2011 and 2012.
Next I created a Hive table for this HBase table, like this:
create external table census(key string, pop2000 int, pop2011 int, pop2012 int)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" = ":key,yearPop:2000,yearPop:2011,yearPop:2012")
tblproperties("hbase.table.name" = "census");
This ran with no errors. So I tested it with a simple query:
select * from census;
This returned all 9 rows of the table. So I tried another query:
select count(*) from census;
The job fired off but all the Mappers failed. The job error log showed a "cannot instantiate class" exception. Long story short, I tracked down each class to the JAR that contains it. Each was defined in a JAR file in the $HIVE_HOME/lib directory. I ended up adding 3 JARs to the $HADOOP_HOME/lib directory, each from the $HIVE_HOME/lib directory. From the $HADOOP_HOME/lib directory I did this:
ln -s ../../hive/lib/hbase-0.90.6-cdh3u5.jar hbase.jar
ln -s ../../hive/lib/hive-hbase-handler-0.7.1-cdh3u5.jar hive-hbase-handler.jar
ln -s ../../hive/lib/zookeeper-3.3.1-jar zookeeper.jar
After this, the above job succeeded. Next I tried a slightly more complex Hive query:
select key, (pop2011 / pop2000) as grow1, (pop2012 / pop2011) as grow2
from census
order by grow2;
This query succeded too.
NOTE: during the Mapper task the 2 slow nodes in my Hadoop cluster failed to connect to HBase.
They got some kind of timeout.
The log files suggest they simply aren't responding fast enough
because they're old, slow machines that belong in a museum.
The neat thing is, Hadoop automatically rescheduled the task on one of the faster machines and it succeeded.
This is an example of the fault tolerance benefits of Hadoop's distributed processing
and was fun to see in action.
CONCLUSION
I now have CDH3U5 running, with Hadoop, HBase and Hive all interconnected. Next step: Mahout!