Hive的原理

程序员文章站 2023-10-12 16:02:17

阅读目录一、Hive内部表和外部表 1、Hive的create创建表的时候，选择的创建方式： - create table - create external table 2、特点： ● 在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，也就是说外部表中的数据并不是由它自己来管理的！而表则 ......

阅读目录

一、hive内部表和外部表

1、hive的create创建表的时候，选择的创建方式：

- create table

- create external table

2、特点：

● 在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，也就是说外部表中的数据并不是由它自己来管理的！而表则不一样；

● 在删除表的时候，hive将会把属于表的元数据和数据全部删掉；而删除外部表的时候，hive仅仅删除外部表的元数据，数据是不会删除的！

注意：

1、- create table 创建内部表，create external table 创建外部表

2、建议在工作中用外部表来创建

二、hive中的partition

● 在hive中，表中的一个partition对应于表下的一个目录，所有的partition的数据都储存在对应的目录中

– 例如：pvs 表中包含 ds 和 city 两个 partition，则
– 对应于 ds = 20090801, ctry = us 的 hdfs 子目录为：/wh/pvs/ds=20090801/ctry=us；
– 对应于 ds = 20090801, ctry = ca 的 hdfs 子目录为；/wh/pvs/ds=20090801/ctry=ca

● partition是辅助查询，缩小查询范围，加快数据的检索速度和对数据按照一定的规格和条件进行管理。

三、hive中的 bucket

• hive中table可以拆分成partition，table和partition可以通过‘clustered by
’进一步分bucket，bucket中的数据可以通过‘sort by’排序。
• 'set hive.enforce.bucketing = true' 可以自动控制上一轮reduce的数量从而适
配bucket的个数，当然，用户也可以自主设置mapred.reduce.tasks去适配
bucket个数

• bucket主要作用：
– 数据sampling,随机采样
– 提升某些查询操作效率，例如mapside join

• 查看sampling数据：
– hive> select * from student tablesample(bucket 1 out of 2 on id);
– tablesample是抽样语句，语法：tablesample(bucket x out of y)
– y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了64份，当y=32
时，抽取(64/32=)2个bucket的数据，当y=128时，抽取(64/128=)1/2个bucket的数据。x表示从哪个bucket开始抽
取。例如，table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据
，分别为第3个bucket和第（3+16=）19个bucket的数据

四、hive数据类型

1、原生类型

– 原生类型
• tinyint
• smallint
• int
• bigint
• boolean
• float
• double
• string
• binary（hive 0.8.0以上才可用）
• timestamp（hive 0.8.0以上才可用）

2、复合类型

– 复合类型
• arrays：array<data_type>
• maps:map<primitive_type, data_type>   ##复合类型
• structs:struct<col_name: data_type[comment col_comment],……>
• union:uniontype<data_type, data_type,……>

五、hive sql — — join in mr

insert overwrite table pv_users
select pv.pageid, u.age
from page_view pv
join user u
on (pv.userid = u.userid);

select pageid, age, count(1)
from pv_users
group by pageid, age;

六、hive的优化

• map的优化：

– 作业会通过input的目录产生一个或者多个map任务。set dfs.block.size
– map越多越好吗？是不是保证每个map处理接近文件块的大小？
– 如何合并小文件，减少map数？

set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;

– 如何适当的增加map数？

set mapred.map.tasks=10;

– map端聚合 hive.map.aggr=true 。 mr中的combiners.

五、函数嵌套

def father(name):
    print('from father %s' %name)
    def son():
        print('from the son')
        def grandson():
            print('from the grandson')
        grandson()
    son()

father('朱锐')

六、闭包

1、闭包

def father(name):
    print('from father %s' %name)
    def son():
        print('from the son')
        def grandson():
            print('from the grandson')
        grandson()
    son()

father('朱锐')

'''
闭包
'''

def father(name):
    def son():
        # name='simon1'
        print('我的爸爸是%s' %name)
        def grandson():
            print('我的爷爷是%s' %name)
        grandson()
    son()
father('simon')

2、函数闭包装饰器基本实现

import time
def timmer(func):
    def wrapper():
        # print(func)
        start_time=time.time()
        func() #就是在运行test()
        stop_time=time.time()
        print('运行时间是%s' %(stop_time-start_time))
    return wrapper
@timmer #语法糖，这个是重点

def test():
    time.sleep(3)
    print('test函数运行完毕')

# res=timmer(test) #返回的是wrapper的地址
# res() #执行的是wrapper()

# test=timmer(test) #返回的是wrapper的地址
# test() #执行的是wrapper()

test()
'''
语法糖
'''
# @timmer #就相当于 test=timmer(test)

3、函数闭包加上返回值

#未加返回值
import time
def timmer(func):
    def wrapper():
        # print(func)
        start_time=time.time()
        func() #就是在运行test()
        stop_time=time.time()
        print('运行时间是%s' %(stop_time-start_time))
        return 123
    return wrapper
@timmer #语法糖

def test():
    time.sleep(3)
    print('test函数运行完毕')
    return '这是test的返回值'
res=test() #就是在运行wrapper
print(res)

运行结果如下：
c:\python35\python3.exe g:/python_s3/day20/加上返回值.py
test函数运行完毕
运行时间是3.000171661376953
123

#加上返回值
import time
def timmer(func):
    def wrapper():
        # print(func)
        start_time=time.time()
        res=func() #就是在运行test()     ##主要修改这里1
        stop_time=time.time()
        print('运行时间是%s' %(stop_time-start_time))
        return res     ##修改这里2
    return wrapper
@timmer #语法糖

def test():
    time.sleep(3)
    print('test函数运行完毕')
    return '这是test的返回值'
res=test() #就是在运行wrapper
print(res)

运行结果如下：
c:\python35\python3.exe g:/python_s3/day20/加上返回值.py
test函数运行完毕
运行时间是3.000171661376953
这是test的返回值

4、函数闭包加上参数

import time
def timmer(func):
    def wrapper(name,age):   #加入参数，name,age
        # print(func)
        start_time=time.time()
        res=func(name,age) ##加入参数，name,age
        stop_time=time.time()
        print('运行时间是%s' %(stop_time-start_time))
        return res
    return wrapper
@timmer #语法糖

def test(name,age): #加入参数，name,age
    time.sleep(3)
    print('test函数运行完毕,名字是【%s】,年龄是【%s】' % (name,age))
    return '这是test的返回值'
res=test('simon',18) #就是在运行wrapper
print(res)

使用可变长参数代码如下：达到的效果是传参灵活

import time
def timmer(func):
    def wrapper(*args,**kwargs): #test('simon',18)  args=('simon') kwargs={'age':18}
        # print(func)
        start_time=time.time()
        res=func(*args,**kwargs) #就是在运行test()     func(*('simon'),**{'age':18})
        stop_time=time.time()
        print('运行时间是%s' %(stop_time-start_time))
        return res
    return wrapper
@timmer #语法糖

def test(name,age):
    time.sleep(3)
    print('test函数运行完毕,名字是【%s】,年龄是【%s】' % (name,age))
    return '这是test的返回值'
def test1(name,age,gender):
    time.sleep(1)
    print('test函数运行完毕,名字是【%s】,年龄是【%s】,性别是【%s】' % (name,age,gender))
res=test('simon',18) #就是在运行wrapper
print(res)

test1('simon',18,'male')

5、装饰器的使用

#无参装饰器
import time
def timmer(func):
    def wrapper(*args,**kwargs):
        start_time=time.time()
        res=func(*args,**kwargs)
        stop_time=time.time()
        print('run time is %s' %(stop_time-start_time))
        return res
    return wrapper

@timmer
def foo():
    time.sleep(3)
    print('from foo')
foo()

#有参装饰器
def auth(driver='file'):
    def auth2(func):
        def wrapper(*args,**kwargs):
            name=input("user: ")
            pwd=input("pwd: ")

            if driver == 'file':
                if name == 'simon' and pwd == '123':
                    print('login successful')
                    res=func(*args,**kwargs)
                    return res
            elif driver == 'ldap':
                print('ldap')
        return wrapper
    return auth2

@auth(driver='file')
def foo(name):
    print(name)

foo('simon')

#验证功能装饰器

#验证功能装饰器
user_list=[
    {'name':'simon','passwd':'123'},
    {'name':'zhurui','passwd':'123'},
    {'name':'william','passwd':'123'},
    {'name':'zhurui1','passwd':'123'},
]
current_dic={'username':none,'login':false}


def auth_func(func):
    def wrapper(*args,**kwargs):
        if current_dic['username'] and current_dic['login']:
            res=func(*args,**kwargs)
            return res
        username=input('用户名：').strip()
        passwd=input('密码：').strip()
        for user_dic in user_list:
            if username == user_dic['name'] and passwd == user_dic['passwd']:
                current_dic['username']=username
                current_dic['login']=true
                res=func(*args,**kwargs)
                return res
        else:
            print('用户名或者密码错误')

        # if username == 'simon' and passwd == '123':
        #     user_dic['username']=username
        #     user_dic['login']=true
        #     res=func(*args,**kwargs)
        #     return res
        # else:
        #     print('用户名或密码错误')
    return wrapper

@auth_func
def index():
    print('欢迎来到某宝首页')
@auth_func
def home(name):
    print('欢迎回家%s' %name)
@auth_func
def shopping_car(name):
    print('%s购物车里有[%s,%s,%s]' %(name,'餐具','沙发','电动车'))

print('before----->',current_dic)
index()
print('after---->',current_dic)
home('simon')
# shopping_car('simon')

#带参数验证功能装饰器

#带参数验证功能装饰器
user_list=[
    {'name':'simon','passwd':'123'},
    {'name':'zhurui','passwd':'123'},
    {'name':'william','passwd':'123'},
    {'name':'zhurui1','passwd':'123'},
]
current_dic={'username':none,'login':false}

def auth(auth_type='filedb'):
    def auth_func(func):
        def wrapper(*args,**kwargs):
            print('认证类型是',auth_type)
            if auth_type == 'filedb':
                if current_dic['username'] and current_dic['login']:
                    res = func(*args, **kwargs)
                    return res
                username=input('用户名：').strip()
                passwd=input('密码：').strip()
                for user_dic in user_list:
                    if username == user_dic['name'] and passwd == user_dic['passwd']:
                        current_dic['username']=username
                        current_dic['login']=true
                        res = func(*args, **kwargs)
                        return res
                else:
                    print('用户名或者密码错误')
            elif auth_type == 'ldap':
                print('这玩意没搞过，不知道怎么玩')
                res = func(*args, **kwargs)
                return res
            else:
                print('鬼才知道你用的什么认证方式')
                res = func(*args, **kwargs)
                return res

        return wrapper
    return auth_func

@auth(auth_type='filedb') #auth_func=auth(auth_type='filedb')-->@auth_func 附加了一个auth_type  --->index=auth_func(index)
def index():
    print('欢迎来到某宝主页')

@auth(auth_type='ldap')
def home(name):
    print('欢迎回家%s' %name)
#
@auth(auth_type='sssssss')
def shopping_car(name):
    print('%s的购物车里有［%s,%s,%s］' %(name,'奶茶','妹妹','娃娃'))

# print('before-->',current_dic)
# index()
# print('after--->',current_dic)
# home('simon')
shopping_car('simon')

上一篇：帮你查哪家公司加班最多 *应用体验

下一篇：一图了解华米AI芯片黄山1号：2019年上半年商用

Hive的原理

阅读目录

一、hive内部表和外部表

1、hive的create创建表的时候，选择的创建方式：

2、特点：

二、hive中的partition

三、hive中的 bucket

四、hive数据类型

1、原生类型

2、复合类型

五、hive sql — — join in mr

六、hive的优化

• map的优化：

五、函数嵌套

六、闭包

1、闭包

2、函数闭包装饰器基本实现

3、函数闭包加上返回值

4、函数闭包加上参数

5、装饰器的使用

Spring Data JPA 建立表的联合主键

PyQt5实现让QScrollArea支持鼠标拖动的操作方法

python操作kafka实践的示例代码

Java中Vector与ArrayList的区别详解

C#的winform控件命名规范

python 中的列表生成式、生成器表达式、模块导入

Android编程实现调用相册、相机及拍照后直接裁剪的方法

mySQL中in查询与exists查询的区别小结

spring boot 默认异常处理的实现

在ASP.NET 2.0中操作数据之十一：基于数据的自定义格式化