7. 设置环境变量
在hw024上设置好环境变量,一次性拷贝到各个节点上:
打开/etc/profile加入以下内容:
#scala
export SCALA_HOME=/usr/lib/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
#spark
export SPARK_HOME=/home/xxx/spark-0.8.0-incubating-bin-Hadoop1
export PATH=$PATH:$SPARK_HOME/bin
#Java
export JAVA_HOME=/usr/java/jdk1.7.0
export JRE_HOME=/usr/java/jdk1.7.0/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
用for循环和scp命令拷贝到各个节点:
#!/bin/bash
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
do
echo cping to $dir
scp -r /etc/profile root@$dir:/etc/profile
done
分别在各个节点执行source /etc/profile,让修改的环境变量生效
8. 安装hadoop
spark的cluster模式需要使用hadoop的HDFS文件系统,需要先安装hadoop。
(1)在24节点上解压hadoop-1.2.1.tar.gz,并将解压后的目录放到/usr/local/hadoop/目录下
(2)配置hadoop:
修改/conf/hadoop-env.sh文件如下:
export JAVA_HOME=/usr/java/jdk1.7.0
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
修改core-site.xml如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>master.node</name>
<value>hw024</value>
<description>master</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>local dir</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://${master.node}:9000</value>
<description> </description>
</property>
</configuration>
修改hdfs-site.xml如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/hdfs/name</value>
<description>local dir</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/hdfs/data</value>
<description> </description>
</property>
</configuration>
修改mapred-siter.xml如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>${master.node}:9001</value>
<description> </description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description> </description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/mapred/system</value>
<description>hdfs dir</description>
</property>
</configuration>
(3)配置conf/master文件,加入namenode的主机名:
hw024
配置slaves文件,加入所有datanode的主机名:
hw016
hw017
hw018
hw020
hw021
hw023
hw024
hw025
hw026
hw027
hw028
hw029
hw030
hw031
hw032
hw033
hw034
(4)复制以上配置好的内容到所有节点上
#!/bin/bash
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
do
echo cping to $dir
#scp -r jdk-7-linux-x64.rpm root@$dir:/home/xxx
#scp -r spark-0.8.0-incubating-bin-hadoop1 root@$dir:/home/xxx
#scp -r profile root@$dir:/etc/profile
#scp -r scala-2.9.3.tgz root@$dir:/home/xxx
#scp -r slaves spark-env.sh root@$dir:/home/xxx/spark-0.8.0-incubating-bin-hadoop1/conf
#scp -r hosts root@$dir:/etc
scp -r /usr/local/hadoop/hadoop-1.2.1 root@$dir:/usr/local/hadoop/
#scp -r scala-2.9.3 root@$dir:/usr/lib
done
(5)在每个节点上创建hadoop使用的临时文件如下:
mkdir -p /usr/local/hadoop/tmp/hdfs/name
mkdir -p /usr/local/hadoop/tmp/hdfs/data
mkdir -p /usr/local/hadoop/tmp/mapred/local
mkdir -p /tmp/mapred/system
(6)关闭所有节点的防火墙:
/etc/init.d/iptables stop
9. spark测试
(1)启动spark集群
在hw024的spark源码目录下执行:
./bin/start-all.sh
(2)检测进程是否启动:
[root@hw024 xxx]# jps
30435 Worker
16223 Jps
9032 SecondaryNameNode
9152 JobTracker
10283 Master
3075 TaskTracker
8811 NameNode
如果spark正常启动,jps运行结果如上所示。
浏览hw024的web UI(http://172.18.11.24:8080),此时应该可以看到所有的work节点,以及他们的CPU个数和内存信息。
(3)运行spark自带的例子:
运行SparkPi
$ ./run-example org.apache.spark.examples.SparkPi spark://hw024:7077
运行SparkKMeans
$ ./run-example org.apache.spark.examples.SparkKMeans spark://hw024:7077 ./kmeans_data.txt 2 1
运行wordcount
$ cd /home/xxx/spark-0.8.0-incubating-bin-hadoop1
$ hadoop fs -put README.md ./
$ MASTER=spark://master:7077 ./spark-shell
scala> val file = sc.textFile("hdfs://master:9000/user/dev/README.md")
scala> val count = file.flatMap(line => line.split(" ")).map(map => (word, 1)).reduceByKey( +)
scala> count.collect()
注意:wordcount例子中有两个地方需要根据之前hadoop的设置进行修改:
(1) MASTER=spark://master:7077 ./spark-shell:
其中斜体需要改为你的namenode的hostname
(2)scala> val file = sc.textFile("hdfs://master:9000/user/dev/README.md"):其中master要修改,而且后面的9000也要根据你的具体配置修改
Spark 的详细介绍:请点这里
Spark 的下载地址:请点这里
更多CentOS相关信息见CentOS 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=14
本文永久更新链接地址:http://www.linuxidc.com/Linux/2014-06/102583.htm