一、概要
由于工作需要,最近一段时间开始接触学习Hadoop相关的东西,目前公司的实时任务和离线任务都跑在一个Hadoop集群,离线任务的特点就是每天定时跑,任务跑完了资源就空闲了,为了合理的利用资源,我们打算在搭一个集群用于跑离线任务,计算节点和储存节点分离,计算节点结合aws的Auto Scaling(自动扩容、缩容服务)以及竞价实例,动态调整,在跑任务的时候拉起一批实例,任务跑完就自动释放掉服务器,本文记录下Hadoop集群的搭建过程,方便自己日后查看,也希望能帮到初学者,本文所有软件都是通过yum安装,大家也可以下载相应的二进制文件进行安装,使用哪种方式安装,从属个人习惯。
二、环境
1、角色介绍
10.10.103.246 NameNode zkfc journalNode QuorumaPeerMain DataNode ResourceManager NodeManager WebAppProxyServer JobHistoryServer
10.10.103.144 NameNode zkfc journalNode QuorumaPeerMain DataNode ResourceManager NodeManager WebAppProxyServer
10.10.103.62 zkfc journalNode QuorumaPeerMain DataNode NodeManager
2、基础环境说明
a、系统版本
我们用的是aws的ec2,用的aws自己定制过的系统,不过和RedHat基本相同,内核版本:4.9.20-10.30.amzn1.x86_64
b、Java版本
java version "1.8.0_121"
c、hadoop版本
hadoop-2.6.0
d、cdh版本
cdh5.11.0
e、关于主机名,因为我这里用的aws的ec2,默认已有主机名,并且内网可以解析,故就不单独做主机名的配置了,如果你的主机名内网不能解析,请一定要配置主机名,集群内部通讯很多组件使用的是主机名
三、配置部署
1、设置yum源
vim
/etc/yum
.repos.d
/cloudera
.repo
[cloudera-cdh5-11-0]
# Packages for Cloudera's Distribution for Hadoop, Version 5.11.0, on RedHat or CentOS 6 x86_64
name=Cloudera's Distribution
for
Hadoop, Version 5.11.0
baseurl=http:
//archive
.cloudera.com
/cdh5/redhat/6/x86_64/cdh/5
.11.0/
gpgkey=http:
//archive
.cloudera.com
/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
[cloudera-gplextras5b2]
# Packages for Cloudera's GPLExtras, Version 5.11.0, on RedHat or CentOS 6 x86_64
name=Cloudera's GPLExtras, Version 5.11.0
baseurl=http:
//archive
.cloudera.com
/gplextras5/redhat/6/x86_64/gplextras/5
.11.0/
gpgkey=http:
//archive
.cloudera.com
/gplextras5/redhat/6/x86_64/gplextras/RPM-GPG-KEY-cloudera
gpgcheck=1
PS:我这里安装的5.11.0,如果想安装低版本或者高版本,根据自己的需求修改版本号即可
2、安装配置zookeeper集群
yum -y
install
zookeeper zookeeper-server
vi
/etc/zookeeper/conf/zoo
.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=
/data/zookeeper
clientPort=2181
maxClientCnxns=0
server.1=10.10.103.144:2888:3888
server.2=10.10.103.226:2888:3888
server.3=10.10.103.62:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
mkdir
/data/zookeeper
#创建datadir目录
/etc/init
.d
/zookeeper-server
init
#所有节点先初始化
echo
1 >
/data/zookeeper/myid
#10.10.103.144上操作
echo
2 >
/data/zookeeper/myid
#10.10.103.226上操作
echo
3 >
/data/zookeeper/myid
#10.10.103.62上操作
/etc/init
.d
/zookeeper-server
#启动服务
/usr/lib/zookeeper/bin/zkServer
.sh status
#查看所有节点状态,其中只有一个节点是Mode: leader就正常了
3、安装
a、10.10.103.246和10.10.103.144安装
yum -y
install
hadoop hadoop-client hadoop-hdfs hadoop-hdfs-namenode hadoop-hdfs-zkfc hadoop-hdfs-journalnode hadoop-hdfs-datanode hadoop-mapreduce-historyserver hadoop-yarn-nodemanager hadoop-yarn-proxyserver hadoop-yarn hadoop-mapreduce hadoop-yarn-resourcemanager hadoop-lzo* impala-lzo
b、10.10.103.62上安装
yum -y
install
hadoop hadoop-client hadoop-hdfs hadoop-hdfs-journalnode hadoop-hdfs-datanode hadoop-lzo* impala-lzo hadoop-yarn hadoop-mapreduce hadoop-yarn-nodemanager
PS:
1、一般小公司,计算节点(ResourceManager)和储存节点(NameNode)的主节点部署在两台服务器上做HA,计算节点(NodeManager)和储存节点(DataNode)部署在多台服务器上,每台服务器上都启动NodeManager和DataNode服务。
2、如果大集群,可能需要计算资源和储存资源分离,集群的各个角色都有服务器单独部署,个人建议划分如下:
a、储存节点
NameNode:
需要安装hadoop hadoop-client hadoop-hdfs hadoop-hdfs-namenode hadoop-hdfs-zkfc hadoop-lzo* impala-lzo
DataNode:
需要安装hadoop hadoop-client hadoop-hdfs hadoop-hdfs-datanode hadoop-lzo* impala-lzo
QJM集群:
需要安装hadoop hadoop-hdfs hadoop-hdfs-journalnode zookeeper zookeeper-server
b、计算节点
ResourceManager:
需要安装hadoop hadoop-client hadoop-yarn hadoop-mapreduce hadoop-yarn-resourcemanager
WebAppProxyServer:
需要安装 hadoop hadoop-yarn hadoop-mapreduce hadoop-yarn-proxyserver
JobHistoryServer:
需要安装 hadoop hadoop-yarn hadoop-mapreduce hadoop-mapreduce-historyserver
NodeManager:
需要安装hadoop hadoop-client hadoop-yarn hadoop-mapreduce hadoop-yarn-nodemanager
4、配置
a、创建目录并设置权限
mkdir
-p
/data/hadoop/dfs/nn
#datanode上操作
chown
hdfs:hdfs
/data/hadoop/dfs/nn/
-R
#datanode上操作
mkdir
-p
/data/hadoop/dfs/dn
#namenode上操作
chown
hdfs:hdfs
/data/hadoop/dfs/dn/
-R
#namenode上操作
mkdir
-p
/data/hadoop/dfs/jn
#journalnode上操作
chown
hdfs:hdfs
/data/hadoop/dfs/jn/
-R
#journalnode上操作
mkdir
/data/hadoop/yarn
-p
#nodemanager上操作
chown
yarn:yarn
/data/hadoop/yarn
-R
#nodemanager上操作
b、撰写配置文件
vim
/etc/hadoop/conf/capacity-scheduler
.xml
<?xml version=
"1.0"
?>
<?xml-stylesheet
type
=
"text/xsl"
href=
"configuration.xsl"
?>
<configuration>
<property><name>yarn.scheduler.capacity.maximum-applications<
/name
><value>10000<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.maximum-am-resource-percent<
/name
><value>0.4<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.resource-calculator<
/name
><value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.node-locality-delay<
/name
><value>30<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.queues<
/name
><value>default,server,offline<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.default.capacity<
/name
><value>95<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.default.maximum-capacity<
/name
><value>100<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.default.user-limit-factor<
/name
><value>100<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.default.state<
/name
><value>running<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.default.acl_submit_applications<
/name
><value>*<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.default.acl_administer_queue<
/name
><value>*<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.capacity<
/name
><value>0<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.maximum-capacity<
/name
><value>5<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.user-limit-factor<
/name
><value>100<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.acl_submit_applications<
/name
><value>haijun.zhao<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.acl_administer_queue<
/name
><value>haijun.zhao<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.maximum-am-resource-percent<
/name
><value>0.05<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.server.state<
/name
><value>running<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.capacity<
/name
><value>5<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.maximum-capacity<
/name
><value>100<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.user-limit-factor<
/name
><value>100<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.acl_submit_applications<
/name
><value>hadoop,haifeng.huang,hongan.pan,rujing.zhang,lingjing.li<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.acl_administer_queue<
/name
><value>hadoop,haifeng.huang,hongan.pan,rujing.zhang,linjing.li<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.maximum-am-resource-percent<
/name
><value>0.8<
/value
><
/property
>
<property><name>yarn.scheduler.capacity.root.offline.state<
/name
><value>running<
/value
><
/property
>
<
/configuration
>
完整PDF文档可以到Linux公社资源站下载:
------------------------------------------分割线------------------------------------------
免费下载地址在 http://linux.linuxidc.com/
用户名与密码都是www.linuxidc.com
具体下载目录在 /2017年资料/5月/11日/CDH5发行版Hadoop集群部署实例/
下载方法见 http://www.linuxidc.com/Linux/2013-07/87684.htm
------------------------------------------分割线------------------------------------------
下面关于Hadoop的文章您也可能喜欢,不妨看看:
Ubuntu14.04下Hadoop2.4.1单机/伪分布式安装配置教程 http://www.linuxidc.com/Linux/2015-02/113487.htm
CentOS 6.3下Hadoop伪分布式平台搭建 http://www.linuxidc.com/Linux/2016-11/136789.htm
Ubuntu 14.04 LTS下安装Hadoop 1.2.1(伪分布模式) http://www.linuxidc.com/Linux/2016-09/135406.htm
Ubuntu上搭建Hadoop环境(单机模式+伪分布模式) http://www.linuxidc.com/Linux/2013-01/77681.htm
实战CentOS系统部署Hadoop集群服务 http://www.linuxidc.com/Linux/2016-11/137246.htm
Hadoop 2.6.0 HA高可用集群配置详解 http://www.linuxidc.com/Linux/2016-08/134180.htm
Spark 1.5、Hadoop 2.7 集群环境搭建 http://www.linuxidc.com/Linux/2016-09/135067.htm
在Ubuntu X64上编译安装Hadoop http://www.linuxidc.com/Linux/2016-12/138568.htm
CentOS 6.7安装Hadoop 2.7.3 http://www.linuxidc.com/Linux/2017-01/139089.htm
CentOS7+Hadoop2.5.2+Spark1.5.2环境搭建 http://www.linuxidc.com/Linux/2017-01/139364.htm
更多Hadoop相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13