本文共 24687 字,大约阅读时间需要 82 分钟。
参考文档:http://www.shareditor.com/blogshow/?blogId=96,在这里表示感谢!
以下内容用到的测试软件版本及下载地址:
操作系统:Centos 6.9
Hadoop2.7.5 下载地址:http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz
Hbase1.2.3 下载地址:http://mirrors.cnnic.cn/apache/hbase/1.2.3/hbase-1.2.3-bin.tar.gz
Hive2.3.2 下载地址:http://apache.fayea.com/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
Spark2.3.0 下载地址:http://apache.fayea.com/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
JDK1.8 下载地址:http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz?AuthParam=1521177381_8a2158cd0203b4be7eeb443b28ae53b5
注意:因为是测试环境,所以把以上的软件都部署到同一台机器上
机器学习、数据挖掘等各种大数据处理都离不开各种开源分布式系统,hadoop用于分布式存储和map-reduce计算,spark用于分布式机器学习,hive是分布式数据库,hbase是分布式kv系统,看似互不相关的他们却都是基于相同的hdfs存储和yarn资源管理,下面将通过实验进行全套部署,一起来探讨系统内部以及充分理解分布式系统架构和他们之间的关系。
测试过程:首先,我们来分别部署一套hadoop、hbase、hive、spark,在讲解部署方法过程中会特殊说明一些重要配置,以及一些架构图以帮我们理解,目的是为后面讲解系统架构和关系打基础。之后,我们会通过运行一些程序来分析一下这些系统的功能。
一、Hdoop部署 下载 Hadoop hadoop-2.7.5 (http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz) 上传到到/test/hadoop并解压 hadoop分为几大部分:yarn负责资源和任务管理、hdfs负责分布式存储、map-reduce负责分布式计算先来了解一下yarn的架构:
yarn的两个部分:资源管理、任务调度。资源管理需要一个全局的ResourceManager(RM)和分布在每台机器上的NodeManager协同工作,RM负责资源的仲裁,NodeManager负责每个节点的资源监控、状态汇报和Container的管理
任务调度也需要ResourceManager负责任务的接受和调度,在任务调度中,在Container中启动的ApplicationMaster(AM)负责这个任务的管理,当任务需要资源时,会向RM申请,分配到的Container用来起任务,然后AM和这些Container做通信,AM和具体执行的任务都是在Container中执行的 yarn区别于第一代hadoop的部署(namenode、jobtracker、tasktracker) 然后再看一下hdfs的架构:hdfs部分由NameNode、SecondaryNameNode和DataNode组成。DataNode是真正的在每个存储节点上管理数据的模块,NameNode是对全局数据的名字信息做管理的模块,SecondaryNameNode是它的从节点,以防挂掉。 最后再说map-reduce:Map-reduce依赖于yarn和hdfs,另外还有一个JobHistoryServer用来看任务运行历史 hadoop虽然有多个模块分别部署,但是所需要的程序都在同一个tar包中,所以不同模块用到的配置文件都在一起,让我们来看几个最重要的配置文件: 各种默认配置:core-default.xml, hdfs-default.xml, yarn-default.xml, mapred-default.xml 各种web页面配置:core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml 从这些配置文件也可以看出hadoop的几大部分是分开配置的。除上面这些之外还有一些重要的配置:hadoop-env.sh、mapred-env.sh、yarn-env.sh,他们用来配置程序运行时的java虚拟机参数以及一些二进制、配置、日志等的目录配置。
下面我们真正的来修改必须修改的配置文件。
1.修改etc/hadoop/core-site.xml,把配置改成:
fs.defaultFS hdfs://192.168.0.202:8000 io.file.buffer.size 131072
2.修改etc/hadoop/hdfs-site.xml,把配置改成:
dfs.namenode.name.dir file:/data/hadoop/dfs/name dfs.datanode.data.dir file:/data/hadoop/dfs/data dfs.datanode.fsdataset.volume.choosing.policy org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy dfs.namenode.http-address 192.168.0.202:50070 dfs.namenode.secondary.http-address 192.168.0.202:8001
如果是NameService那中种高可用的ha方式,那么hdfs-site.xml中还要配置nameservice的多个namenode节点,以及配置namenode元数据在各个journalnode存储地点等内容,这个HA的方式,在另外的博文中再详细探讨,此处我们采用namenode和secondnamenode全部放在singlehost主节点上的方式。
从上面的配置,我们可以看到:主namenode有个hdfs协议的访问地址:singlehost:8000secondNamenode有个http协议的访问地址:singlehsot:8001hdfs开启了web监视后,主namenode有个默认的http访问地址:singlehost:50070 (通过他来查看hdfs状况)3.修改etc/hadoop/yarn-site.xml,把配置改成:
yarn.resourcemanager.hostname 192.168.0.202 yarn.resourcemanager.webapp.address 192.168.0.202:8088 yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 864000 yarn.log-aggregation.retain-check-interval-seconds 86400 yarn.nodemanager.remote-app-log-dir /yarnapp/logs yarn.log.server.url http://192.168.0.202:19888/jobhistory/logs/ yarn.nodemanager.local-dirs /data/apache/tmp/ yarn.scheduler.maximum-allocation-mb 5000 yarn.scheduler.minimum-allocation-mb 1024 yarn.nodemanager.vmem-pmem-ratio 4.1 yarn.nodemanager.vmem-check-enabled false
4.通过cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml创建etc/hadoop/mapred-site.xml,内容改为如下:
mapreduce.framework.name yarn Execution framework set to Hadoop YARN. yarn.app.mapreduce.am.staging-dir /tmp/hadoop-yarn/staging mapreduce.jobhistory.address 192.168.0.202:10020 mapreduce.jobhistory.webapp.address 192.168.0.202:19888 mapreduce.jobhistory.done-dir ${yarn.app.mapreduce.am.staging-dir}/history/done mapreduce.jobhistory.intermediate-done-dir ${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate mapreduce.jobhistory.joblist.cache.size 1000 mapreduce.tasktracker.map.tasks.maximum 8 mapreduce.tasktracker.reduce.tasks.maximum 8 mapreduce.jobtracker.maxtasks.perjob 5 The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum.
如果你的hadoop部署在多台机器,那么需要修改etc/hadoop/slaves,把其他slave机器ip加到里面,如果只部署在这一台,那么就留一个localhost即可。
5.下面我们启动hadoop,启动之前我们配置好必要的环境变量,在hadoop-env.sh里写JAVA_HOME
export JAVA_HOME=/test/hadoop/jdk1.8.0_161至此hadopp配置已经完成!下面启动hadoop
1.先启动hdfs,在此之前要格式化分布式文件系统,执行:[root@test hadoop-2.7.5]# ./bin/hdfs namenode -format myclustername
如果格式化正常可以看到/data/apache/dfs下生成了name目录
[root@test dfs]# pwd/data/hadoop/dfs[root@test dfs]# lsname
2.然后启动namenode,执行
[root@test hadoop-2.7.5]# ./sbin/hadoop-daemon.sh --script hdfs start namenodestarting namenode, logging to /test/hadoop/hadoop-2.7.5/logs/hadoop-root-namenode-test.out
如果正常启动,可以看到启动了相应的进程,并且logs目录下生成了相应的日志
[root@test hadoop-2.7.5]# ps -ef |grep namenoderoot 2677 1 20 14:25 pts/0 00:00:04 /test/hadoop/jdk1.8.0_161/bin/java -Dproc_namenode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/test/hadoop/hadoop-2.7.5/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/test/hadoop/hadoop-2.7.5 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/test/hadoop/hadoop-2.7.5/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/test/hadoop/hadoop-2.7.5/logs -Dhadoop.log.file=hadoop-root-namenode-test.log -Dhadoop.home.dir=/test/hadoop/hadoop-2.7.5 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/test/hadoop/hadoop-2.7.5/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNoderoot 2748 2390 0 14:25 pts/0 00:00:00 grep namenode
3.然后启动datanode,执行:
[root@test hadoop-2.7.5]# ./sbin/hadoop-daemon.sh --script hdfs start datanode starting datanode, logging to /test/hadoop/hadoop-2.7.5/logs/hadoop-root-datanode-test.out如果考虑启动secondary namenode,可以用同样的方法启动
4.下面我们启动yarn,先启动resourcemanager,执行:
[root@test hadoop-2.7.5]# ./sbin/yarn-daemon.sh start resourcemanagerstarting resourcemanager, logging to /test/hadoop/hadoop-2.7.5/logs/yarn-root-resourcemanager-test.out如果正常启动,可以看到启动了相应的进程,并且logs目录下生成了相应的日志
5.然后启动nodemanager,执行:
[root@test hadoop-2.7.5]# ./sbin/yarn-daemon.sh start nodemanagerstarting nodemanager, logging to /test/hadoop/hadoop-2.7.5/logs/yarn-root-nodemanager-test.out如果正常启动,可以看到启动了相应的进程,并且logs目录下生成了相应的日志
6.然后启动MapReduce JobHistory Server,执行:
[root@test hadoop-2.7.5]# ./sbin/mr-jobhistory-daemon.sh start historyserverstarting historyserver, logging to /test/hadoop/hadoop-2.7.5/logs/mapred-root-historyserver-test.out如果正常启动,可以看到启动了相应的进程,并且logs目录下生成了相应的日志
至此服务已经全部启动完成!
下面我们看下web界面
打开http://192.168.0.202:8088/cluster看下yarn管理的集群资源情况(因为在yarn-site.xml中我们配置了yarn.resourcemanager.webapp.address是192.168.0.202:8088)
打开http://192.168.0.202:19888/jobhistory看下map-reduce任务的执行历史情况(因为在mapred-site.xml中我们配置了mapreduce.jobhistory.webapp.address是192.168.0.202:19888)
打开http://192.168.0.202:50070/dfshealth.html看下namenode的存储系统情况(因为在hdfs-site.xml中我们配置了dfs.namenode.http-address是192.168.0.202:50070
到此为止我们对hadoop的部署完成。下面试验一下hadoop的功能
先验证一下hdfs分布式文件系统,执行以下命令看是否有输出:
root@test hadoop-2.7.5]# ./bin/hadoop fs -mkdir /input[root@test hadoop-2.7.5]# cat data shiyushiyu[root@test hadoop-2.7.5]# ./bin/hadoop fs -put data /input[root@test hadoop-2.7.5]# ./bin/hadoop fs -ls /inputFound 1 items-rw-r--r-- 3 root supergroup 17 2018-03-19 14:38 /input/data这时通过http://192.168.0.202:50070/dfshealth.html可以看到存储系统的一些变化
下面我们以input为输入启动一个mapreduce任务
[root@test hadoop-2.7.5]# ./bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar -input /input -output /output -mapper cat -reducer wcpackageJobJar: [/tmp/hadoop-unjar3569605880705175866/] [] /tmp/streamjob5038504562010472906.jar tmpDir=null18/03/19 14:39:40 INFO client.RMProxy: Connecting to ResourceManager at /192.168.0.202:803218/03/19 14:39:40 INFO client.RMProxy: Connecting to ResourceManager at /192.168.0.202:803218/03/19 14:39:41 INFO mapred.FileInputFormat: Total input paths to process : 118/03/19 14:39:41 INFO mapreduce.JobSubmitter: number of splits:318/03/19 14:39:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1521440829162_000118/03/19 14:39:42 INFO impl.YarnClientImpl: Submitted application application_1521440829162_000118/03/19 14:39:42 INFO mapreduce.Job: The url to track the job: http://test:8088/proxy/application_1521440829162_0001/18/03/19 14:39:42 INFO mapreduce.Job: Running job: job_1521440829162_000118/03/19 14:39:58 INFO mapreduce.Job: Job job_1521440829162_0001 running in uber mode : false18/03/19 14:39:58 INFO mapreduce.Job: map 0% reduce 0%18/03/19 14:40:21 INFO mapreduce.Job: map 100% reduce 0%18/03/19 14:40:31 INFO mapreduce.Job: map 100% reduce 100%18/03/19 14:40:32 INFO mapreduce.Job: Job job_1521440829162_0001 completed successfully18/03/19 14:40:34 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=44 FILE: Number of bytes written=496129 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=291 HDFS: Number of bytes written=25 HDFS: Number of read operations=12 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=3 Launched reduce tasks=1 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=61150 Total time spent by all reduces in occupied slots (ms)=7345 Total time spent by all map tasks (ms)=61150 Total time spent by all reduce tasks (ms)=7345 Total vcore-milliseconds taken by all map tasks=61150 Total vcore-milliseconds taken by all reduce tasks=7345 Total megabyte-milliseconds taken by all map tasks=62617600 Total megabyte-milliseconds taken by all reduce tasks=7521280 Map-Reduce Framework Map input records=7 Map output records=7 Map output bytes=24 Map output materialized bytes=56 Input split bytes=264 Combine input records=0 Combine output records=0 Reduce input groups=7 Reduce shuffle bytes=56 Reduce input records=7 Reduce output records=1 Spilled Records=14 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=3 GC time elapsed (ms)=983 CPU time spent (ms)=2980 Physical memory (bytes) snapshot=632729600 Virtual memory (bytes) snapshot=8259186688 Total committed heap usage (bytes)=383266816 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=27 File Output Format Counters Bytes Written=2518/03/19 14:40:34 INFO streaming.StreamJob: Output directory: /output
之后看是否产生了/output的输出:
[root@test hadoop-2.7.5]# ./bin/hadoop fs -ls /outputFound 2 items-rw-r--r-- 3 root supergroup 0 2018-03-19 14:40 /output/_SUCCESS-rw-r--r-- 3 root supergroup 25 2018-03-19 14:40 /output/part-00000[root@test hadoop-2.7.5]# ./bin/hadoop fs -cat /output/part-00000 7 7 24 wc的输出格式:行数 单词数 字节数
这时通过http://192.168.0.202:19888/jobhistory可以看到mapreduce任务历史:
也可以通过http://192.168.0.202:8088/cluster看到任务历史
为什么两处都有历史呢?他们的区别是什么呢?
我们看到cluster显示的其实是每一个application的历史信息,他是yarn(ResourceManager)的管理页面,也就是不管是mapreduce还是其他类似mapreduce这样的任务,都会在这里显示,mapreduce任务的Application Type是MAPREDUCE,其他任务的类型就是其他了,但是jobhistory是专门显示mapreduce任务的
二、Hbase部署
首先从http://www.apache.org/dyn/closer.cgi/hbase/下载稳定版安装包,我下的是http://mirrors.cnnic.cn/apache/hbase/1.2.3/hbase-1.2.3-bin.tar.gz
)上传到 /test/hbase 并进行解压
解压后修改conf/hbase-site.xml,改成:
其中hbase.rootdir配置的是hdfs地址,ip:port要和hadoop/core-site.xml中的fs.defaultFS一定要保持一致 其中hbase.zookeeper.quorum是zookeeper的地址,可以配多个,我们试验用就先配一个hbase.cluster.distributed true hbase.rootdir hdfs://192.168.0.202:8000/hbase hbase.zookeeper.quorum 192.168.0.202
配置JAVA_HOME环境变量,修改 hbase-env.sh
export JAVA_HOME=/test/hadoop/jdk1.8.0_161
然后启动命令(要输密码)
[root@test hbase-1.2.3]# ./bin/start-hbase.sh root@192.168.0.202's password: 192.168.0.202: starting zookeeper, logging to /test/hbase/hbase-1.2.3/bin/../logs/hbase-root-zookeeper-test.outstarting master, logging to /test/hbase/hbase-1.2.3/bin/../logs/hbase-root-master-test.outJava HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0starting regionserver, logging to /test/hbase/hbase-1.2.3/bin/../logs/hbase-root-1-regionserver-test.out
然后启动hbase的shell,并输入一些命令,创建一个表输入一条数据:
[root@test hbase-1.2.3]# ./bin/hbase shell2018-03-20 08:58:13,732 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableHBase Shell; enter 'help' for list of supported commands.Type "exit " to leave the HBase ShellVersion 1.2.3, rbd63744624a26dc3350137b564fe746df7a721a4, Mon Aug 29 15:13:42 PDT 2016hbase(main):001:0> status1 active master, 0 backup masters, 1 servers, 0 dead, 3.0000 average load创建一张表hbase(main):006:0> create 'table1' , 'field1'0 row(s) in 2.2330 seconds=> Hbase::Table - table1获取一张表hbase(main):007:0> t1 = get_table('table1')0 row(s) in 0.0110 seconds=> Hbase::Table - table1添加一行hbase(main):008:0> t1.put 'row1' , 'field1:qualifier1', 'value1'0 row(s) in 0.1780 seconds读取全部hbase(main):009:0> t1.scanROW COLUMN+CELL row1 column=field1:qualifier1, timestamp=1521507576095, value=value1 1 row(s) in 0.0340 seconds
同时也看到hdfs中多出了hbase存储的目录:
[root@test hadoop-2.7.5]# ./bin/hadoop fs -ls /hbaseFound 8 itemsdrwxr-xr-x - root supergroup 0 2018-03-20 08:59 /hbase/.tmpdrwxr-xr-x - root supergroup 0 2018-03-20 08:58 /hbase/MasterProcWALsdrwxr-xr-x - root supergroup 0 2018-03-20 08:58 /hbase/WALsdrwxr-xr-x - root supergroup 0 2018-03-20 08:54 /hbase/archivedrwxr-xr-x - root supergroup 0 2018-03-19 16:43 /hbase/data-rw-r--r-- 3 root supergroup 42 2018-03-19 16:43 /hbase/hbase.id-rw-r--r-- 3 root supergroup 7 2018-03-19 16:43 /hbase/hbase.versiondrwxr-xr-x - root supergroup 0 2018-03-20 08:59 /hbase/oldWALs这说明hbase是以hdfs为存储介质的,因此它具有分布式存储拥有的所有优点
hbase的架构如下:
其中HMaster负责管理HRegionServer以实现负载均衡,负责管理和分配HRegion(数据分片),还负责管理命名空间和table元数据,以及权限控制
HRegionServer负责管理本地的HRegion、管理数据以及和hdfs交互。
Zookeeper负责集群的协调(如HMaster主从的failover)以及集群状态信息的存储
客户端传输数据直接和HRegionServer通信
三、Hive部署
再下载Hive进行安装:
Hive的下载地址:http://apache.fayea.com/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
解压后,我们先准备hdfs,执行:
[root@test hadoop-2.7.5]# ./bin/hadoop fs -mkdir /tmp [root@test hadoop-2.7.5]# ./bin/hadoop fs -mkdir /user [root@test hadoop-2.7.5]# ./bin/hadoop fs -mkdir /user/hive [root@test hadoop-2.7.5]# ./bin/hadoop fs -mkdir /user/hive/warehourse [root@test hadoop-2.7.5]# ./bin/hadoop fs -chmod g+w /tmp/[root@test hadoop-2.7.5]# ./bin/hadoop fs -chmod g+w /user/hive/warehourse
使用hive必须提前设置好HADOOP_HOME环境变量,这样它可以自动找到我们的hdfs作为存储,不妨我们把各种HOME和各种PATH都配置好,如:
[root@test ~]# vi .bashrc 添加如下内容:HADOOP_HOME=/test/hadoop/hadoop-2.7.5HBASE_HOME=/test/hbase/hbase-1.2.3HIVE_HOME=/test/hive/apache-hive-2.3.2-binexport HADOOP_HOMEexport HBASE_HOMEexport HIVE_HOMEPATH=$PATH:$HOME/binPATH=$PATH:$HBASE_HOME/binPATH=$PATH:$HIVE_HOME/binPATH=$PATH:$HADOOP_HOME/binexport PATH[root@test ~]# source .bashrc
拷贝创建hive-site.xml、hive-log4j2.properties、hive-exec-log4j2.properties,执行
[root@test apache-hive-2.3.2-bin]# cp conf/hive-default.xml.template conf/hive-site.xml[root@test apache-hive-2.3.2-bin]# cp conf/hive-log4j2.properties.template conf/hive-log4j2.properties[root@test apache-hive-2.3.2-bin]# cp conf/hive-exec-log4j2.properties.template conf/hive-exec-log4j2.properties
修改hive-site.xml,把其中的${system:java.io.tmpdir}都修改成/data/apache/tmp,你也可以自己设置成自己的tmp目录,把${system:user.name}都换成用户名
:%s/${system:java.io.tmpdir}/\/test\/hive\/apache-hive-2.3.2-bin\/tmp/gc:%s/${system:user.name}/work/gc实际以上目录都换成 /test/hive/apache-hive-2.3.2-bin/tmp/work
初始化元数据数据库(默认保存在本地的derby数据库,也可以配置成mysql),注意,不要先执行hive命令,否则这一步会出错,具体见http://stackoverflow.com/questions/35655306/hive-installation-issues-hive-metastore-database-is-not-initialized,下面执行:
[root@test apache-hive-2.3.2-bin]# schematool -dbType derby -initSchemaSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/test/hive/apache-hive-2.3.2-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/test/hadoop/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=trueMetastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriverMetastore connection User: APPStarting metastore schema initialization to 2.3.0Initialization script hive-schema-2.3.0.derby.sqlInitialization script completedschemaTool completed
成功之后我们可以以客户端形式直接启动hive,如:
[root@test apache-hive-2.3.2-bin]# ./bin/hiveSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/test/hive/apache-hive-2.3.2-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/test/hadoop/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in file:/test/hive/apache-hive-2.3.2-bin/conf/hive-log4j2.properties Async: trueHive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.hive> show databases;OKdefaultTime taken: 6.41 seconds, Fetched: 1 row(s)
试着创建个数据库是否可以:
hive> create database mydatabase;OKTime taken: 0.292 secondshive> show databases;OKdefaultmydatabaseTime taken: 0.022 seconds, Fetched: 2 row(s)
这样我们还是单机的hive,不能在其他机器登陆,所以我们要以server形式启动:
[root@test apache-hive-2.3.2-bin]# mkdir log[root@test apache-hive-2.3.2-bin]# nohup ./bin/hiveserver2 &>log/hive.log &[1] 25443默认会监听10000端口,这时可以通过jdbc客户端连接这个服务访问hive
四、Spark部署
http://apache.fayea.com/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
上传到/test/spark 目录下并解压
配置环境变量:
[root@test spark-2.3.0-bin-hadoop2.7]# cp conf/spark-env.sh.template conf/spark-env.sh[root@test spark-2.3.0-bin-hadoop2.7]# vi conf/spark-env.sh添加JAVA_HOME:export JAVA_HOME=/test/hadoop/jdk1.8.0_161
spark有多种部署方式,首先支持单机直接跑,如执行样例程序:
[root@test spark-2.3.0-bin-hadoop2.7]# ./bin/spark-submit examples/src/main/python/pi.py 10执行结果输出很多日志,其中有如下几条重要日志:2018-03-20 09:52:25 INFO DAGScheduler:54 - ResultStage 0 (reduce at /test/spark/spark-2.3.0-bin-hadoop2.7/examples/src/main/python/pi.py:44) finished in 2.781 s2018-03-20 09:52:25 INFO DAGScheduler:54 - Job 0 finished: reduce at /test/spark/spark-2.3.0-bin-hadoop2.7/examples/src/main/python/pi.py:44, took 2.931627 sPi is roughly 3.140816它可以直接运行得出结果
下面我们说下spark集群部署方法:
解压安装包后直接执行:
[root@test spark-2.3.0-bin-hadoop2.7]# ./sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /test/spark/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-test.out
这时可以打开http://192.168.0.202/看到web界面如下:
根据上面的url:spark://test:7077,我们再启动slave:
[root@test spark-2.3.0-bin-hadoop2.7]# ./sbin/start-slave.sh spark://test:7077starting org.apache.spark.deploy.worker.Worker, logging to /test/spark/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-test.out
刷新web界面如下:
出现了一个worker,我们可以根据需要启动多个worker看slave 的UI 界面 http://192.168.0.202:8081/ 也能正常看到
下面我们把上面执行过的任务部署到spark集群上执行:
[root@test spark-2.3.0-bin-hadoop2.7]# ./sbin/start-slave.sh spark://test:7077结果日志包含了如下内容:2018-03-20 10:10:07 INFO DAGScheduler:54 - Job 0 finished: reduce at /test/spark/spark-2.3.0-bin-hadoop2.7/examples/src/main/python/pi.py:44, took 8.167068 sPi is roughly 3.1418562018-03-20 10:10:07 INFO SparkUI:54 - Stopped Spark web UI at http://test:4040
此时web界面如下:
spark程序也可以部署到yarn集群上执行,也就是我们部署hadoop时启动的yarn我们需要提前配置好HADOOP_CONF_DIR,如下:
[root@test spark-2.3.0-bin-hadoop2.7]# vi ~/.bashrc添加如下:HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/export HADOOP_CONF_DIR[root@test spark-2.3.0-bin-hadoop2.7]# source ~/.bashrc
然后把任务提交到Yarn集群上:
[root@test spark-2.3.0-bin-hadoop2.7]# ./bin/spark-submit --master yarn --deploy-mode cluster examples/src/main/python/pi.py 10能看到打印了很多日志,最后显示结果如下:2018-03-20 10:19:15 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.0.202 ApplicationMaster RPC port: 0 queue: default start time: 1521512297419 final status: SUCCEEDED tracking URL: http://test:8088/proxy/application_1521440829162_0002/ user: root2018-03-20 10:19:15 INFO ShutdownHookManager:54 - Shutdown hook called2018-03-20 10:19:15 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-a867e50c-ce79-4933-8744-c934e1275b392018-03-20 10:19:15 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-93634e06-5b0b-442f-a41f-e93fa2b038fb
在Hadoop任务的管理界面 http://192.168.0.202:8088/cluster 能看到跑了这个任务:
五、总结
hdfs是所有hadoop生态的底层存储架构,它主要完成了分布式存储系统的逻辑,凡是需要存储的都基于其上构建
yarn是负责集群资源管理的部分,这个资源主要指计算资源,因此它支撑了各种计算模块map-reduce组件主要完成了map-reduce任务的调度逻辑,它依赖于hdfs作为输入输出及中间过程的存储,因此在hdfs之上,它也依赖yarn为它分配资源,因此也在yarn之上hbase基于hdfs存储,通过独立的服务管理起来,因此仅在hdfs之上hive基于hdfs存储,通过独立的服务管理起来,因此仅在hdfs之上spark基于hdfs存储,即可以依赖yarn做资源分配计算资源也可以通过独立的服务管理,因此在hdfs之上也在yarn之上,从结构上看它和mapreduce一层比较像总之,每一个系统负责了自己擅长的一部分,同时相互依托,形成了整个hadoop生态。