Hadoop,hive,pig开发环境搭建
Hadoop的版本推荐选用cloudera的CDH系列,因为这个系列把hadoop,hive,pig等工具的版本已经匹配好了,并且有详细的安装说明文档,细节见http://www.cloudera.com/content/support/en/documentation.html
目前最新的发布版是CDH 5,支持的操作系统有CentOS6.5和Ubuntu12.04,搭建hadoop开发环境我更偏向centos6.5些。
首先安装JDK,官方推荐使用Oracle JDK,不建议使用OpenJDK。
1 2 3 | wget http://download.oracle.com/otn-pub/java/jdk/8u11-b12/jdk-8u11-linux-x64.tar.gz tar xvzf jdk-8u11-linux-x64.tar.gz sudo mv jdk1.8.0_11/ /opt/ |
配置JAVA_HOME和PATH
1 2 3 4 5 6 7 8 | #sudo vi /etc/environment export JAVA_HOME=/opt/jdk1.8.0_11 #sudo vi /etc/profile export JAVA_HOME=/opt/jdk1.8.0_11 export PATH=$JAVA_HOME/bin:$PATH |
下载CDH5 1-click安装包
1 2 | wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm |
安装上面的包会添加cloudera的仓库配置,查询包所安装的文件列表如下
1 2 3 4 5 6 | rpm -lq cloudera-cdh-5-0.x86_64 /etc/pki/rpm-gpg /etc/pki/rpm-gpg/RPM-GPG-KEY-cloudera /etc/yum.repos.d/cloudera-cdh5.repo #仓库配置文件 /usr/share/doc/cloudera-cdh-5 /usr/share/doc/cloudera-cdh-5/LICENSE |
下面安装Hadoop, 使用第二代的yarn作为mapper/reducer调度器
1 | sudo yum install hadoop-conf-pseudo #这个是开发环境的配置文件,它的依赖里面有hadoop hdfs+yarn |
把主机名加入hosts列表
1 | 127.0.0.1 vm4 #主机名是vm4 |
格式化NameNode
1 | sudo -u hdfs hdfs namenode -format
|
给hadoop启动脚本添加JAVA_HOME环境变量
1 2 3 | #sudo vi /etc/hadoop/conf/hadoop-env.sh export JAVA_HOME=/opt/jdk1.8.0_11 |
启动hadoop namenode,datanode,secondarynamenode服务
1 | for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done |
创建hdfs相关的目录结构
1 2 3 4 5 6 7 | sudo -u hdfs hadoop fs -rm -r /tmp #确保/tmp目录不存在 sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging sudo -u hdfs hadoop fs -chmod -R 1777 /tmp sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn |
验证下创建过的目录结构
1 2 3 4 5 6 7 8 9 10 | sudo -u hdfs hadoop fs -ls -R / drwxrwxrwt - hdfs supergroup 0 2014-07-31 16:18 /tmp drwxrwxrwt - hdfs supergroup 0 2014-07-31 16:18 /tmp/hadoop-yarn drwxrwxrwt - mapred mapred 0 2014-07-31 16:18 /tmp/hadoop-yarn/staging drwxrwxrwt - mapred mapred 0 2014-07-31 16:18 /tmp/hadoop-yarn/staging/history drwxrwxrwt - mapred mapred 0 2014-07-31 16:18 /tmp/hadoop-yarn/staging/history/done_intermediate drwxr-xr-x - hdfs supergroup 0 2014-07-31 16:18 /var drwxr-xr-x - hdfs supergroup 0 2014-07-31 16:18 /var/log drwxr-xr-x - yarn mapred 0 2014-07-31 16:18 /var/log/hadoop-yarn |
启动map/reduce相关服务
1 2 3 | sudo service hadoop-yarn-resourcemanager start sudo service hadoop-yarn-nodemanager start sudo service hadoop-mapreduce-historyserver start |
产生用户目录,在hdfs下,每个用户有自己的home目录,我的测试机的用户名叫jojo,因此
1 2 | sudo -u hdfs hadoop fs -mkdir -p /user/jojo sudo -u hdfs hadoop fs -chown jojo /user/jojo |
到目前为止hadoop hdfs+yarn map/reduce调度器安装完成了,下面测试下:
1 2 3 4 5 6 7 8 9 | hadoop fs -mkdir input #创建一个input目录 hadoop fs -put /etc/hadoop/conf/*.xml input #把hadoop的配置文件copy到input目录 hadoop fs -ls input #列出input目录下的文件 Found 4 items -rw-r--r-- 1 jojo supergroup 2133 2014-07-31 16:28 input/core-site.xml -rw-r--r-- 1 jojo supergroup 2324 2014-07-31 16:28 input/hdfs-site.xml -rw-r--r-- 1 jojo supergroup 1549 2014-07-31 16:28 input/mapred-site.xml -rw-r--r-- 1 jojo supergroup 2375 2014-07-31 16:28 input/yarn-site.xml |
编辑.bashrc设置HADOOP_MAPRED_HOME, 这个变量在运行map/reduce程序的时候需要用到
1 2 3 4 5 | #vi ~/.bashrc export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce #加入这行 source ~/.bashrc #source下,让export立即生效 |
运行例子程序
1 2 | #把input下符合'dfs[a-z.]+'的项输出到output23目录下 hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+' |
列出 output23目录看看
1 2 3 4 5 | hadoop fs -ls output23 Found 2 items -rw-r--r-- 1 jojo supergroup 0 2014-07-31 16:35 output23/_SUCCESS -rw-r--r-- 1 jojo supergroup 244 2014-07-31 16:35 output23/part-r-00000 |
看下输出文件的内容
1 2 3 4 5 6 7 8 9 10 11 12 | hadoop fs -cat output23/part-r-00000 | head 1 dfs.safemode.min.datanodes 1 dfs.safemode.extension 1 dfs.replication 1 dfs.namenode.name.dir 1 dfs.namenode.checkpoint.dir 1 dfs.domain.socket.path 1 dfs.datanode.hdfs 1 dfs.datanode.data.dir 1 dfs.client.read.shortcircuit 1 dfs.client.file |
到这里hadoop已经安装成功了,下面安装hive
1 | sudo yum install hive hive-metastore hive-server2
|
hive的metastore是用来存储hive的表结构的,习惯的做法是把metastore存放到mysql里面,hive-server2是hive-server的改良版,增强了并发性。
首先安装mysql
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | sudo yum install mysql-server sudo service mysqld start #启动mysql服务 #安装mysql jdbc驱动 sudo yum install mysql-connector-java sudo ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar #设置mysql root用户的密码 sudo /usr/bin/mysql_secure_installation #确保mysql开机启动 sudo /sbin/chkconfig mysqld on sudo /sbin/chkconfig --list mysqld |
下面产生metastore需要用的schema和用户
1 2 3 4 5 6 7 8 9 10 11 12 | mysql -u root -p Enter password: mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql; #导入schema #添加hive用户 mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'passwd1234'; #密码passwd1234 mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost'; mysql> FLUSH PRIVILEGES; mysql> quit; |
编辑hive配置文件,加入如下配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | #sudo vi /etc/hive/conf/hive-site.xml <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/metastore</value> <description>the URL of the MySQL database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>passwd1234</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> |
启动metastore服务
1 2 | sudo service hive-metastore start sudo service hive-metastore status #检查下状态,因为配置项比较多,容易出错,出错后检查/var/log/hive/hive-metastore.out对照日志排错 |
测试下hive是否能连接到metastore
1 2 3 4 5 | hive -e 'show tables;' #没表输出,因为还没创建过 OK Time taken: 2.836 seconds |
下面配置hive-server2, 这个服务依赖zookeeper,因此先在本机部署一个zookeeper测试实例
1 2 3 4 5 6 7 8 | sudo yum install zookeeper zookeeper-server #产生zookeeper的数据目录 sudo mkdir -p /var/lib/zookeeper sudo chown -R zookeeper /var/lib/zookeeper/ sudo service zookeeper-server init #初始化zookeeper sudo service zookeeper-server start #启动zookeeper服务 |
编辑hive配置文件,增加如下配置
1 2 3 4 5 6 7 8 9 10 11 12 | #sudo vi /etc/hive/conf/hive-site.xml <property> <name>hive.support.concurrency</name> <description>Enable Hive's Table Lock Manager Service</description> <value>true</value> </property> <property> <name>hive.zookeeper.quorum</name> <description>Zookeeper quorum used by Hive's Table Lock Manager</description> <value>localhost</value> </property> |
启动hive-server2服务
1 | sudo service hive-server2 start
|
连接上hive-server2看看
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | [jojo@vm4 ~]$ beeline Beeline version 0.12.0-cdh5.1.0 by Apache Hive beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://localhost:10000 Connected to: Apache Hive (version 0.12.0-cdh5.1.0) Driver: Hive JDBC (version 0.12.0-cdh5.1.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000> SHOW TABLES; +-----------+ | tab_name | +-----------+ +-----------+ No rows selected (2.547 seconds) 0: jdbc:hive2://localhost:10000> !quit Closing: org.apache.hive.jdbc.HiveConnection |
可以看到现在还没hive的数据表,在产生hive数据表之前,先要生成hive的数据目录
1 2 | sudo -u hdfs hadoop fs -mkdir -p /user/hive/warehouse sudo -u hdfs hadoop fs -chmod -R 1777 /user/hive/warehouse #让任何用户都能操作hive的数据 |
下面来测试下hive
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | 0: jdbc:hive2://localhost:10000> CREATE TABLE pokes (foo INT, bar STRING); #产生一张表 No rows affected (1.166 seconds) 0: jdbc:hive2://localhost:10000> show tables; +-----------+ | tab_name | +-----------+ | pokes | +-----------+ 1 row selected (0.589 seconds) 0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM pokes; #count下 +------+ | _c0 | +------+ | 0 | +------+ 1 row selected (56.58 seconds) #检查下hdfs下刚才产生的表 [jojo@vm4 ~]$ sudo -u hdfs hadoop fs -ls -R /user/hive drwxrwxrwt - hdfs supergroup 0 2014-07-31 17:53 /user/hive/warehouse drwxrwxrwt - hive supergroup 0 2014-07-31 17:53 /user/hive/warehouse/pokes |
到这里hive已经安装完成了,下面安装pig
1 | sudo yum install pig
|
pig安装完成啦~ 下面来测试下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | pig grunt> ls #列出当前用户目录下的文件 grunt> A = LOAD 'input'; #input目录下有hadoop的配置文件 grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*'; #匹配dfs的项 grunt> DUMP B; #查看B的内容 #输出如下 ( <name>dfs.replication</name>) ( <name>dfs.safemode.extension</name>) ( <name>dfs.safemode.min.datanodes</name>) ( <name>dfs.namenode.name.dir</name>) ( <name>dfs.namenode.checkpoint.dir</name>) ( <name>dfs.datanode.data.dir</name>) ( <name>dfs.client.read.shortcircuit</name>) ( <name>dfs.client.file-block-storage-locations.timeout.millis</name>) ( <name>dfs.domain.socket.path</name>) ( <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>) |
pig测试完成,到这里hadoop+hive+pig的开发环境配置完成。