本文共 4040 字,大约阅读时间需要 13 分钟。
在MapReduce(MR)框架中,OutputFormat是输出的基类,所有实现该接口的类都可以作为MR任务的输出格式。不同输出格式有不同的适用场景和特点。
以下是几种常见的OutputFormat实现及其特点:
TestOutputFormat是默认的输出格式,它将每条记录写为文本行。其特点如下:
toString()方法将键值转换为字符串SequenceFileOutputFormat是一种紧凑的输出格式,通常用于作为后续MR任务的输入。其特点如下:
根据具体需求,可以实现自定义的OutputFormat。以下是自定义OutputFormat的实现步骤:
FileOutputFormat,并设置通用类型参数RecordWriter方法,创建自定义输出流以下是一个自定义FilterOutputFormat的实现示例:
public class FilterOutputFormat extends FileOutputFormat { @Override public RecordWriter getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException { return new FilterRecordWriter(job); } } FilterRecordWriter的实现如下:
public class FilterRecordWriter extends RecordWriter { private FSDataOutputStream fosAtguigu; private FSDataOutputStream fosOther; public FilterRecordWriter(TaskAttemptContext job) throws IOException { try { FileSystem fs = FileSystem.get(job.getConfiguration()); Path pathAtguigu = new Path("G:\\Projects\\IdeaProject-C\\MapReduce\\src\\main\\java\\第三章_MR框架原理\\OutputFormat数据输出\\atguigu.log"); fosAtguigu = fs.create(pathAtguigu); Path pathOther = new Path("G:\\Projects\\IdeaProject-C\\MapReduce\\src\\main\\java\\第三章_MR框架原理\\OutputFormat数据输出\\other.log"); fosOther = fs.create(pathOther); } catch (Exception e) { e.printStackTrace(); } } @Override public void write(Text key, NullWritable value) throws IOException, InterruptedException { String keyStr = key.toString(); if (keyStr.contains("atguigu")) { fosAtguigu.write(keyStr.getBytes()); } else { fosOther.write(keyStr.getBytes()); } } @Override public void close(TaskAttemptContext context) throws IOException, InterruptedException { IOUtils.closeStream(fosAtguigu); IOUtils.closeStream(fosOther); } } 在Mapper阶段,主要任务是读取数据并输出。以下是FilterMapper的实现:
public class FilterMapper extends Mapper { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(value, NullWritable.get()); } } 在Reducer阶段,主要任务是将读取的数据进行聚合并输出。以下是FilterReducer的实现:
public class FilterReducer extends Reducer, Text, NullWritable> { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { String line = key.toString() + "\r\n"; Text k = new Text(); k.set(line); for (NullWritable value : values) { context.write(k, NullWritable.get()); } } } Driver阶段负责配置任务并提交Job。以下是FilterDriver的实现:
public class FilterDriver { public static void main(String[] args) throws IOException, InterruptedException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setMapperClass(FilterMapper.class); job.setReducerClass(FilterReducer.class); job.setJarByClass(FilterDriver.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); job.setOutputFormatClass(FilterOutputFormat.class); FileInputFormat.setInputPaths(job, new Path("G:\\Projects\\IdeaProject-C\\MapReduce\\src\\main\\java\\第三章_MR框架原理\\OutputFormat数据输出\\log.txt")); FileOutputFormat.setOutputPath(job, new Path("G:\\Projects\\IdeaProject-C\\MapReduce\\src\\main\\java\\第三章_MR框架原理\\Filteroutput")); boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } } 在配置Job时,虽然我们自定义了OutputFormat,但由于FileOutputFormat会默认输出_SUCCESS文件,因此需要手动指定输出目录。
转载地址:http://vueq.baihongyu.com/