Python API 操作Hadoop hdfs详解

程序员文章站 2023-11-13 12:37:10

1：安装由于是windows环境（linux其实也一样），只要有pip或者setup_install安装起来都是很方便的>pip install hdfs2：client——创建集群连接>...

1：安装

由于是windows环境（linux其实也一样），只要有pip或者setup_install安装起来都是很方便的

>pip install hdfs

2：client——创建集群连接

> from hdfs import *
> client = client("http://s100:50070")

其他参数说明：

classhdfs.client.client(url, root=none, proxy=none, timeout=none, session=none)

url：ip：端口

root：制定的hdfs根目录

proxy：制定登陆的用户身份

timeout：设置的超时时间

session:连接标识

client = client("http://127.0.0.1:50070",root="/",timeout=100,session=false)
>>> client.list("/")
[u'home',u'input', u'output', u'tmp']

3：dir——查看支持的方法

>dir(client)

4：status——获取路径的具体信息

其他参数：

status(hdfs_path, strict=true)

hdfs_path：就是hdfs路径

strict：设置为true时，如果hdfs_path路径不存在就会抛出异常，如果设置为false，如果路径为不存在，则返回none

5：list——获取指定路径的子目录信息

>client.list("/")
[u'home',u'input', u'output', u'tmp']

其他参数：

list(hdfs_path, status=false)

status：为true时，也返回子目录的状态信息，默认为flase

6：makedirs——创建目录

>client.makedirs("/123")

其他参数：makedirs(hdfs_path, permission=none)

permission：设置权限

>client.makedirs("/test",permission=777)

7: rename—重命名

>client.rename("/123","/test")

8：delete—删除

>client.delete("/test")

其他参数：

delete(hdfs_path, recursive=false)

recursive：删除文件和其子目录，设置为false如果不存在，则会抛出异常，默认为false

9：upload——上传数据

>client.upload("/test","f:\[ppt]google protocol buffers.pdf");

其他参数：

upload(hdfs_path, local_path, overwrite=false, n_threads=1, temp_dir=none,

chunk_size=65536,progress=none, cleanup=true, **kwargs)

overwrite：是否是覆盖性上传文件

n_threads：启动的线程数目

temp_dir：当overwrite=true时，远程文件一旦存在，则会在上传完之后进行交换

chunk_size：文件上传的大小区间

progress：回调函数来跟踪进度，为每一chunk_size字节。它将传递两个参数，文件上传的路径和传输的字节数。一旦完成，-1将作为第二个参数

cleanup：如果在上传任何文件时发生错误，则删除该文件

10：download——下载

>client.download("/test/notice.txt","/home")

11：read——读取文件

withclient.read("/test/[ppt]google protocol buffers.pdf") as reader:
print reader.read()

其他参数：

read(*args, **kwds)

hdfs_path：hdfs路径

offset：设置开始的字节位置

length：读取的长度（字节为单位）

buffer_size：用于传输数据的字节的缓冲区的大小。默认值设置在hdfs配置。

encoding：制定编码

chunk_size：如果设置为正数，上下文管理器将返回一个发生器产生的每一chunk_size字节而不是一个类似文件的对象

delimiter：如果设置，上下文管理器将返回一个发生器产生每次遇到分隔符。此参数要求指定的编码。

progress：回调函数来跟踪进度，为每一chunk_size字节（不可用，如果块大小不是指定）。它将传递两个参数，文件上传的路径和传输的字节数。称为一次与- 1作为第二个参数。

问题：

hdfs.util.hdfserror: permission denied: user=dr.who, access=write, inode="/test":root:supergroup:drwxr-xr-x

解决办法是：在配置文件hdfs-site.xml中加入

<property> 
 <name>dfs.permissions</name> 
 <value>false</value> 
</property>

/usr/local/hadoop-2.6.4/bin/hadoopjar /usr/local/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar\-input <输入目录> \ # 可以指定多个输入路径，例如：-input '/user/foo/dir1' -input '/user/foo/dir2'

-inputformat<输入格式 javaclassname> \-output <输出目录>\-outputformat <输出格式 javaclassname> \-mapper <mapper executable orjavaclassname> \-reducer <reducer executable or javaclassname>\-combiner <combiner executable or javaclassname> \-partitioner<javaclassname> \-cmdenv <name=value> \ # 可以传递环境变量，可以当作参数传入到任务中，可以配置多个

-file <依赖的文件> \ #配置文件，字典等依赖

-d<name=value> \ # 作业的属性配置

map.py:

#!/usr/local/bin/python
import sys
for line in sys.stdin:
 ss = line.strip().split(' ')
 for s in ss:
 if s.strip()!= "":
  print "%s\t%s"% (s, 1)

reduce.py:

#!/usr/local/bin/python

import sys
current_word = none
count_pool = []
sum = 0
for line in sys.stdin:
 word, val = line.strip().split('\t')
 if current_word== none:
 current_word = word
 if current_word!= word:
 for count in count_pool:
  sum += count
 print "%s\t%s"% (current_word, sum)
 current_word = word
 count_pool = []
 sum = 0
 count_pool.append(int(val))
for count in count_pool:
 sum += count
print "%s\t%s"% (current_word, str(sum))

run.sh:

hadoop_cmd="/data/hadoop-2.7.0/bin/hadoop"
stream_jar_path="/data/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar"
input_file_path_1="/the_man_of_property.txt"
output_path="/output"
$hadoop_cmd fs -rmr-skiptrash $output_path

# step 1.

$hadoop_cmd jar$stream_jar_path \
 -input $input_file_path_1 \
 -output $output_path \
 -mapper"python map.py" \
 -reducer "pythonred.py" \
 -file ./map.py \
 -file ./red.py

目的：通过python模拟mr，计算每年的最高气温。

1. 查看数据文件，需要截取年份和气温，生成key-value对。

[tianyc@teletekhbase python]$ cat test.dat 
0067011990999991950051507004...9999999n9+00001+99999999999... 
0043011990999991950051512004...9999999n9+00221+99999999999... 
0043011990999991950051518004...9999999n9-00111+99999999999... 
0043012650999991949032412004...0500001n9+01111+99999999999... 
0043012650999991949032418004...0500001n9+00781+99999999999...

2. 编写map，打印key-value对

[tianyc@teletekhbase python]$ cat map.py 
import re
import sys
for line in sys.stdin:
 val=line.strip()
 (year,temp)=(val[15:19],val[40:45])
 print "%s\t%s" % (year,temp)

[tianyc@teletekhbase python]$ cat test.dat|python map.py 
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078

3. 将结果排序

[tianyc@teletekhbase python]$ cat test.dat|python map.py |sort
1949 +0078
1949 +0111
1950 +0000
1950 -0011
1950 +0022

4. 编写redurce，对map中间结果进行处理，生成最终结果

[tianyc@teletekhbase python]$ cat red.py 
import sys
(last_key,max_val)=(none,0)
for line in sys.stdin:
 (key,val)=line.strip().split('\t')
 if last_key and last_key!=key:
 print '%s\t%s' % (last_key, max_val)
 (last_key, max_val)=(key,int(val))
else:
 (last_key, max_val)=(key,max(max_val,int(val)))
if last_key:
 print '%s\t%s' % (last_key, max_val)

5. 执行。

[tianyc@teletekhbase python]$ cat test.dat|python map.py |sort|python red.py 
1949 111
1950 22

使用python语言进行mapreduce程序开发主要分为两个步骤，一是编写程序，二是用hadoop streaming命令提交任务。

还是以词频统计为例

一、程序开发

1、mapper

for line in sys.stdin:
 filelds = line.strip.split(' ')
 for item in fileds:
 print item+' '+'1'

2、reducer

import sys
result={}
for line in sys.stdin:
 kvs = line.strip().split(' ')
 k = kvs[0]
 v = kvs[1]
 if k in result:
  result[k]+=1
 else:
  result[k] = 1
 for k,v in result.items():
 print k+' '+v
....

写完发现其实只用map就可以处理了...reduce只用cat就好了

3、运行脚本

1）streaming简介

hadoop的mapreduce和hdfs均采用java进行实现，默认提供java编程接口，用户通过这些编程接口，可以定义map、reduce函数等等。　

但是如果希望使用其他语言编写map、reduce函数怎么办呢？

hadoop提供了一个框架streaming，streaming的原理是用java实现一个包装用户程序的mapreduce程序，该程序负责调用hadoop提供的java编程接口。

2）运行命令

/.../bin/hadoop streaming
-input /..../input
-output /..../output
-mapper "mapper.py"
-reducer "reducer.py"
-file mapper.py
-file reducer.py
-d mapred.job.name ="wordcount"
-d mapred.reduce.tasks = "1"

3）streaming常用命令

（1）-input <path>：指定作业输入，path可以是文件或者目录，可以使用*通配符，-input选项可以使用多次指定多个文件或目录作为输入。

（2）-output <path>：指定作业输出目录，path必须不存在，而且执行作业的用户必须有创建该目录的权限，-output只能使用一次。

（3）-mapper：指定mapper可执行程序或java类，必须指定且唯一。

（4）-reducer：指定reducer可执行程序或java类，必须指定且唯一。

（5）-file, -cachefile, -cachearchive：分别用于向计算节点分发本地文件、hdfs文件和hdfs压缩文件，具体使用方法参考文件分发与打包。

（6）numreducetasks：指定reducer的个数，如果设置-numreducetasks 0或者-reducer none则没有reducer程序，mapper的输出直接作为整个作业的输出。

（7）-jobconf | -d name=value：指定作业参数，name是参数名，value是参数值，可以指定的参数参考hadoop-default.xml。

-jobconf mapred.job.name='my job name'设置作业名

-jobconf mapred.job.priority=very_high | high | normal | low | very_low设置作业优先级

-jobconf mapred.job.map.capacity=m设置同时最多运行m个map任务

-jobconf mapred.job.reduce.capacity=n设置同时最多运行n个reduce任务

-jobconf mapred.map.tasks 设置map任务个数

-jobconf mapred.reduce.tasks 设置reduce任务个数

-jobconf mapred.compress.map.output 设置map的输出是否压缩

-jobconf mapred.map.output.compression.codec 设置map的输出压缩方式

-jobconf mapred.output.compress 设置reduce的输出是否压缩

-jobconf mapred.output.compression.codec 设置reduce的输出压缩方式

-jobconf stream.map.output.field.separator 设置map输出分隔符

例子：

-d stream.map.output.field.separator=: \ 以冒号进行分隔

-d stream.num.map.output.key.fields=2 \ 指定在第二个冒号处进行分隔，也就是第二个冒号之前的作为key，之后的作为value

（8）-combiner：指定combiner java类，对应的java类文件打包成jar文件后用-file分发。

（9）-partitioner：指定partitioner java类，streaming提供了一些实用的partitioner实现，参考keybasedfiledpartitoner和inthashpartitioner。

（10）-inputformat, -outputformat：指定inputformat和outputformat java类，用于读取输入数据和写入输出数据，分别要实现inputformat和outputformat接口。如果不指定，默认使用textinputformat和textoutputformat。

（11）cmdenv name=value：给mapper和reducer程序传递额外的环境变量，name是变量名，value是变量值。

（12）-mapdebug, -reducedebug：分别指定mapper和reducer程序失败时运行的debug程序。

（13）-verbose：指定输出详细信息，例如分发哪些文件，实际作业配置参数值等，可以用于调试。

以上这篇python api 操作hadoop hdfs详解就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持。

上一篇：详解python如何引用包package

下一篇： python3.6.5基于kerberos认证的hive和hdfs连接调用方式

Python API 操作Hadoop hdfs详解

python访问hdfs的操作

Python API 操作Hadoop hdfs详解

Python自动化运维之Ansible定义主机与组规则操作详解

微信小程序学习笔记之登录API与获取用户信息操作图文详解

Python复制文件操作实例详解

python字典操作实例详解

Python使用Pandas库常见操作详解

Python实现简单的文本相似度分析操作详解

Python使用cx_Oracle模块操作Oracle数据库详解

hadoop入门之hdfs基本操作命令使用方法