欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

[Hadoop](三) Helloworld:WordCont

程序员文章站 2022-07-15 08:04:10
...

1. 安装eclipse

去Eclipse官网,下载Eclipse,安装,可以配置一下eclipse的工作目录
[Hadoop](三) Helloworld:WordCont

2. 安装和配置maven

去maven官网,下载maven包,解压安装。修改D:\apache-maven-3.5.0\conf\setting.xml文件,配置maven的本地仓库地址和远程镜像源地址

  • localRepository: 本地仓库的路径,之后maven从远程镜像源下载的所有jar包会存储在该路径下
  • mirror:远程仓库的地址,配置阿里的镜像源,在国内下载会快一些
    [Hadoop](三) Helloworld:WordCont

[Hadoop](三) Helloworld:WordCont
[Hadoop](三) Helloworld:WordCont

3. 在Eclipse里配置maven环境

Eclipse里菜单栏选择
Window 》Preferences 》Maven》Installations 》add,选择maven的安装目录
Window 》Preferences 》Maven 》User Settings,修改使用本地maven的setting.xml

[Hadoop](三) Helloworld:WordCont
[Hadoop](三) Helloworld:WordCont

4. WordCount

在Eclipse中新建一个maven项目,File 》new Project 》maven project,填写项目的group id和artifact id 》 finish

[Hadoop](三) Helloworld:WordCont
[Hadoop](三) Helloworld:WordCont
eclipse会在workspace下新建WordCount目录作为项目的工作目录,maven会在WordCount目录下创建一系列目录和文件:

  • src/main/java: 是项目源代码的根目录
  • src/main/resources: 是项目放置配置文件等资源文件的目录
  • src/test/java:是单元测试的源代码目录
  • src/test/resources: 是单元测试的资源文件目录
  • pom.xml: 是项目的maven配置文件

在pom.xml中加入jar的依赖,直接将下面的内容粘贴到project标签内:

  <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.5.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.5.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.5.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.5.1</version>
        </dependency>
    </dependencies>

之后,maven会自动下载依赖包,接着,可以写WordCount的代码了。
在src/main/java下创建WordCount类,写代码:

package com.demo.hadoop;

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {
	
	
    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        @Override
        public void map(LongWritable longWritable, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

        @Override
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }

    }

    public static void main(String[] args) throws IOException {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");

        conf.set("yarn.resourcemanager.scheduler.address", "master:8030");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }

}

5. 打包WordCount.jar

确保以上步骤顺利完成,eclipse没有任何编译报错。之后,可以将WordCount打包了。
项目上右键 》export 》java 》jar file 》选择保存jar包的路径》finish
[Hadoop](三) Helloworld:WordCont

[Hadoop](三) Helloworld:WordCont

6. hadoop集群运行WordCount任务

把WordCount.jar上传到hadoop集群,创建用于测试的file01和file02,在hdfs上创建input目录

echo "Hello World Bye World" > file01
echo "Hello Hadoop Goodbye Hadoop" > file02
hadoop dfs -mkdir input
hadoop dfs -put ./file0* input
hadoop jar WordCount.jar com.demo.hadoop.WordCount input output

注意:因为我用的是hadoop-2.7,所以要修改一下yarn-site.xml,配置相关yarn的相关端口

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>master:8032</value>
	</property>
	<property>
		<name>yarn.resourcemanager.schedular.address</name>
		<value>master:8030</value>
	</property>
	<property>
		<name>yarn.resourcemanager.resource-tracker.address</name>
		<value>master:8031</value>
	</property>
	<property>
		<name>yarn.resourcemanager.admin.address</name>
		<value>master:8033</value>
	</property>
	<property>
		<name>yarn.resourcemanager.webapp.address</name>
		<value>master:18088</value>
	</property>

</configuration>

7. 检查WordCount运行结果

[aaa@qq.com hadoop]$ hadoop dfs -cat output/part-00000

输出结果:

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/zkpk/software/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/zkpk/software/hbase-1.2.5/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Bye	1
Goodbye	1
Hadoop	2
Hello	2
World	2