Yee's 部落格

Hadoop,hive,pig开发环境搭建

Hadoop的版本推荐选用cloudera的CDH系列,因为这个系列把hadoop,hive,pig等工具的版本已经匹配好了,并且有详细的安装说明文档,细节见http://www.cloudera.com/content/support/en/documentation.html

目前最新的发布版是CDH 5,支持的操作系统有CentOS6.5和Ubuntu12.04,搭建hadoop开发环境我更偏向centos6.5些。

首先安装JDK,官方推荐使用Oracle JDK,不建议使用OpenJDK。

1
2
3
wget http://download.oracle.com/otn-pub/java/jdk/8u11-b12/jdk-8u11-linux-x64.tar.gz
tar xvzf jdk-8u11-linux-x64.tar.gz
sudo mv jdk1.8.0_11/ /opt/

配置JAVA_HOME和PATH

1
2
3
4
5
6
7
8
#sudo vi /etc/environment

export JAVA_HOME=/opt/jdk1.8.0_11

#sudo vi /etc/profile

export JAVA_HOME=/opt/jdk1.8.0_11
export PATH=$JAVA_HOME/bin:$PATH

下载CDH5 1-click安装包

1
2
wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm

安装上面的包会添加cloudera的仓库配置,查询包所安装的文件列表如下

1
2
3
4
5
6
rpm -lq cloudera-cdh-5-0.x86_64
/etc/pki/rpm-gpg
/etc/pki/rpm-gpg/RPM-GPG-KEY-cloudera
/etc/yum.repos.d/cloudera-cdh5.repo     #仓库配置文件
/usr/share/doc/cloudera-cdh-5
/usr/share/doc/cloudera-cdh-5/LICENSE

下面安装Hadoop, 使用第二代的yarn作为mapper/reducer调度器

1
sudo yum install hadoop-conf-pseudo  #这个是开发环境的配置文件,它的依赖里面有hadoop hdfs+yarn

把主机名加入hosts列表

1
127.0.0.1  vm4   #主机名是vm4

格式化NameNode

1
sudo -u hdfs hdfs namenode -format

给hadoop启动脚本添加JAVA_HOME环境变量

1
2
3
#sudo vi /etc/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/opt/jdk1.8.0_11

启动hadoop namenode,datanode,secondarynamenode服务

1
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done

创建hdfs相关的目录结构

1
2
3
4
5
6
7
sudo -u hdfs hadoop fs -rm -r /tmp  #确保/tmp目录不存在

sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging 
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp 
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

验证下创建过的目录结构

1
2
3
4
5
6
7
8
9
10
sudo -u hdfs hadoop fs -ls -R /

drwxrwxrwt   - hdfs supergroup          0 2014-07-31 16:18 /tmp
drwxrwxrwt   - hdfs supergroup          0 2014-07-31 16:18 /tmp/hadoop-yarn
drwxrwxrwt   - mapred mapred              0 2014-07-31 16:18 /tmp/hadoop-yarn/staging
drwxrwxrwt   - mapred mapred              0 2014-07-31 16:18 /tmp/hadoop-yarn/staging/history
drwxrwxrwt   - mapred mapred              0 2014-07-31 16:18 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x   - hdfs   supergroup          0 2014-07-31 16:18 /var
drwxr-xr-x   - hdfs   supergroup          0 2014-07-31 16:18 /var/log
drwxr-xr-x   - yarn   mapred              0 2014-07-31 16:18 /var/log/hadoop-yarn

启动map/reduce相关服务

1
2
3
sudo service hadoop-yarn-resourcemanager start 
sudo service hadoop-yarn-nodemanager start 
sudo service hadoop-mapreduce-historyserver start

产生用户目录,在hdfs下,每个用户有自己的home目录,我的测试机的用户名叫jojo,因此

1
2
sudo -u hdfs hadoop fs -mkdir -p /user/jojo
sudo -u hdfs hadoop fs -chown jojo /user/jojo

到目前为止hadoop hdfs+yarn map/reduce调度器安装完成了,下面测试下:

1
2
3
4
5
6
7
8
9
hadoop fs -mkdir input #创建一个input目录
hadoop fs -put /etc/hadoop/conf/*.xml input  #把hadoop的配置文件copy到input目录
hadoop fs -ls input   #列出input目录下的文件

Found 4 items
-rw-r--r--   1 jojo supergroup       2133 2014-07-31 16:28 input/core-site.xml
-rw-r--r--   1 jojo supergroup       2324 2014-07-31 16:28 input/hdfs-site.xml
-rw-r--r--   1 jojo supergroup       1549 2014-07-31 16:28 input/mapred-site.xml
-rw-r--r--   1 jojo supergroup       2375 2014-07-31 16:28 input/yarn-site.xml

编辑.bashrc设置HADOOP_MAPRED_HOME, 这个变量在运行map/reduce程序的时候需要用到

1
2
3
4
5
#vi ~/.bashrc

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce  #加入这行

source ~/.bashrc  #source下,让export立即生效

运行例子程序

1
2
#把input下符合'dfs[a-z.]+'的项输出到output23目录下
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'

列出 output23目录看看

1
2
3
4
5
hadoop fs -ls output23 

Found 2 items
-rw-r--r--   1 jojo supergroup          0 2014-07-31 16:35 output23/_SUCCESS
-rw-r--r--   1 jojo supergroup        244 2014-07-31 16:35 output23/part-r-00000

看下输出文件的内容

1
2
3
4
5
6
7
8
9
10
11
12
hadoop fs -cat output23/part-r-00000 | head

1       dfs.safemode.min.datanodes
1       dfs.safemode.extension
1       dfs.replication
1       dfs.namenode.name.dir
1       dfs.namenode.checkpoint.dir
1       dfs.domain.socket.path
1       dfs.datanode.hdfs
1       dfs.datanode.data.dir
1       dfs.client.read.shortcircuit
1       dfs.client.file

到这里hadoop已经安装成功了,下面安装hive

1
sudo yum install hive hive-metastore hive-server2

hive的metastore是用来存储hive的表结构的,习惯的做法是把metastore存放到mysql里面,hive-server2是hive-server的改良版,增强了并发性。

首先安装mysql

1
2
3
4
5
6
7
8
9
10
11
12
13
14
sudo yum install mysql-server

sudo service mysqld start #启动mysql服务

#安装mysql jdbc驱动
sudo yum install mysql-connector-java
sudo ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar

#设置mysql root用户的密码
sudo /usr/bin/mysql_secure_installation

#确保mysql开机启动
sudo /sbin/chkconfig mysqld on
sudo /sbin/chkconfig --list mysqld

下面产生metastore需要用的schema和用户

1
2
3
4
5
6
7
8
9
10
11
12
mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql;  #导入schema

#添加hive用户
mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'passwd1234';  #密码passwd1234
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> quit;

编辑hive配置文件,加入如下配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#sudo vi /etc/hive/conf/hive-site.xml

<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://localhost/metastore</value>
	<description>the URL of the MySQL database</description>
</property>

<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
</property>

<property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>hive</value>
</property>

<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>passwd1234</value>
</property>

<property>
	<name>datanucleus.autoCreateSchema</name>
	<value>false</value>
</property>

<property>
	<name>datanucleus.fixedDatastore</name>
	<value>true</value>
</property>

<property>
	<name>datanucleus.autoStartMechanism</name>
	<value>SchemaTable</value>
</property>

<property>
	<name>hive.metastore.uris</name>
	<value>thrift://localhost:9083</value>
	<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

启动metastore服务

1
2
sudo service hive-metastore start
sudo service hive-metastore status #检查下状态,因为配置项比较多,容易出错,出错后检查/var/log/hive/hive-metastore.out对照日志排错

测试下hive是否能连接到metastore

1
2
3
4
5
hive -e 'show tables;'

#没表输出,因为还没创建过
OK
Time taken: 2.836 seconds

下面配置hive-server2, 这个服务依赖zookeeper,因此先在本机部署一个zookeeper测试实例

1
2
3
4
5
6
7
8
sudo yum install zookeeper zookeeper-server

#产生zookeeper的数据目录
sudo mkdir -p /var/lib/zookeeper
sudo chown -R zookeeper /var/lib/zookeeper/

sudo service zookeeper-server init #初始化zookeeper
sudo service zookeeper-server start #启动zookeeper服务

编辑hive配置文件,增加如下配置

1
2
3
4
5
6
7
8
9
10
11
12
#sudo vi /etc/hive/conf/hive-site.xml
<property>
	<name>hive.support.concurrency</name>
	<description>Enable Hive's Table Lock Manager Service</description>
	<value>true</value>
</property>

<property>
	<name>hive.zookeeper.quorum</name>
	<description>Zookeeper quorum used by Hive's Table Lock Manager</description>
	<value>localhost</value>
</property>

启动hive-server2服务

1
sudo service hive-server2 start

连接上hive-server2看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[jojo@vm4 ~]$ beeline
Beeline version 0.12.0-cdh5.1.0 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 0.12.0-cdh5.1.0)
Driver: Hive JDBC (version 0.12.0-cdh5.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> SHOW TABLES;
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (2.547 seconds)
0: jdbc:hive2://localhost:10000> !quit
Closing: org.apache.hive.jdbc.HiveConnection

可以看到现在还没hive的数据表,在产生hive数据表之前,先要生成hive的数据目录

1
2
sudo -u hdfs hadoop fs -mkdir -p /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod -R 1777 /user/hive/warehouse  #让任何用户都能操作hive的数据

下面来测试下hive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0: jdbc:hive2://localhost:10000> CREATE TABLE pokes (foo INT, bar STRING);  #产生一张表
No rows affected (1.166 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+
| tab_name  |
+-----------+
| pokes     |
+-----------+
1 row selected (0.589 seconds)
0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM pokes;  #count下
+------+
| _c0  |
+------+
| 0    |
+------+
1 row selected (56.58 seconds)

#检查下hdfs下刚才产生的表
[jojo@vm4 ~]$ sudo -u hdfs hadoop fs -ls -R /user/hive
drwxrwxrwt   - hdfs supergroup          0 2014-07-31 17:53 /user/hive/warehouse
drwxrwxrwt   - hive supergroup          0 2014-07-31 17:53 /user/hive/warehouse/pokes

到这里hive已经安装完成了,下面安装pig

1
sudo yum install pig

pig安装完成啦~ 下面来测试下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
pig 
grunt> ls  #列出当前用户目录下的文件

grunt> A = LOAD 'input';  #input目录下有hadoop的配置文件
grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*';   #匹配dfs的项
grunt> DUMP B;  #查看B的内容

#输出如下
(    <name>dfs.replication</name>)
(    <name>dfs.safemode.extension</name>)
(     <name>dfs.safemode.min.datanodes</name>)
(     <name>dfs.namenode.name.dir</name>)
(     <name>dfs.namenode.checkpoint.dir</name>)
(     <name>dfs.datanode.data.dir</name>)
(    <name>dfs.client.read.shortcircuit</name>)
(    <name>dfs.client.file-block-storage-locations.timeout.millis</name>)
(    <name>dfs.domain.socket.path</name>)
(    <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>)

pig测试完成,到这里hadoop+hive+pig的开发环境配置完成。