python搞搞大数据之hbase——初探
使用python链接mysql读入一个表并把它再写到hbase 里去(九头蛇万岁)
先声明一下需要用的库:
俩!!:
happybase (写这个的老哥真的happy)
pymysql
建议使用anaconda进行相应版本匹配安装,在装happybase的时候,conda默认的channel是找不到这个库的你需要使用 conda-forge 镜像参考如下网站:
pymysql就不用说了,毕竟mysql业界广泛使用,这个平台都好找
安装好了之后打开你的pycharm把基本purepython项目创建好,解释器指定到anaconda3下面的python.exe上面
然后开始玩耍:
step1:
请打算开你的大数据环境启动hadoop、zookeeper、hbase(简直后台内存爆掉)
step2:
开始编写自己的脚本,建议表级别操作和数据级别操作分别写脚本,这样会好控制一些。
这里我简单写了四个脚本
test.py,mysql.py,delete.py,scan.py
(test开始没规划,实际就是创表脚本)
test.py
1 #!/usr/bin/python 2 # coding:utf-8 3 import happybase 4 5 connection = happybase.connection('localhost', 9090) 6 7 connection.create_table( 8 'short', 9 { 10 'base':dict(), 11 'region':dict(), 12 'infos':dict() 13 } 14 )
这里面很简单的操作,你要用hbase,你就要链接它,happybase.connection(主机名, 端口号)参数就这么简单,对应好就ok
这里创建表的方法写法和hbase的操作感觉相当的像,hbase的table用起来确实就像是字典嵌套字典,太过于真实
这里我的表名为short,表有三个列族base、region、infos,这个是根据数据自己设计出来的。数据是个csv:如下,请先存为csv文件,导入你的mysql做准备(如果你要实验这个例子)
customer_id,first_name,last_name,email,gender,address,country,language,job,credit_type,credit_no
1,spencer,raffeorty,sraffeorty0@dropbox.com,male,9274 lyons court,china,khmer,safety technician iii,jcb,3589373385487669
2,cherye,poynor,cpoynor1@51.la,female,1377 anzinger avenue,china,czech,research nurse,instapayment,6376594861844533
3,natasha,abendroth,nabendroth2@scribd.com,female,2913 evergreen lane,china,yiddish,budget/accounting analyst iv,visa,4041591905616356
4,huntley,seally,hseally3@prlog.org,male,694 del sol lane,china,albanian,environmental specialist,laser,677118310740263477
5,druci,coad,dcoad4@weibo.com,female,16 debs way,china,hebrew,teacher,jcb,3537287259845047
6,sayer,brizell,sbrizell5@opensource.org,male,71 banding terrace,china,maltese,accountant iv,americanexpress,379709885387687
7,becca,brawley,bbrawley6@sitemeter.com,female,7 doe crossing junction,china,czech,payment adjustment coordinator,jcb,3545377719922245
8,michele,bastable,mbastable7@sun.com,female,98 clyde gallagher pass,china,malayalam,tax accountant,jcb,3588131787131504
9,marla,brotherhood,mbrotherhood8@illinois.edu,female,4538 fair oaks trail,china,dari,design engineer,china-unionpay,5602233845197745479
10,lionello,gogarty,lgogarty9@histats.com,male,800 sage alley,china,danish,clinical specialist,diners-club-carte-blanche,30290846607043
11,camile,ringer,cringera@army.mil,female,5060 fairfield alley,china,punjabi,junior executive,china-unionpay,5602213490649878
12,gillan,banbridge,gbanbridgeb@wikipedia.org,female,91030 havey point,china,kurdish,chemical engineer,jcb,3555948058752802
13,guinna,damsell,gdamsellc@spiegel.de,female,869 ohio park,china,fijian,analyst programmer,jcb,3532009465228502
14,octavia,mcdugal,omcdugald@rambler.ru,female,413 forster center,china,english,desktop support technician,maestro,502017593120304035
15,anjanette,penk,apenke@lulu.com,female,8154 schiller road,china,swedish,vp sales,jcb,3548039055836788
16,maura,teesdale,mteesdalef@globo.com,female,9568 quincy alley,china,dutch,dental hygienist,jcb,3582894252458217
导入mysql之后:
我是将它导在了数据库demo下面。
接下来,你就可以去玩蛇了
是不是感觉顺序混乱???混乱就对了
现在要干的事情是链接数据库读取数据,再将其插入到hbase中,mysql查表select,hbase插入put,知识点咚咚咚
mysql.py
1 #!/usr/bin/python 2 # coding:utf-8 3 import pymysql 4 import happybase 5 6 7 class testc: 8 def __init__(self, customer_id, first_name, last_name, email, gender, address, country, language, job, credit_type, 9 credit_no): 10 self._key = customer_id 11 self._first_name = first_name 12 self._last_name = last_name 13 self._email = email 14 self._gender = gender 15 self._address = address 16 self._country = country 17 self._language = language 18 self._job = job 19 self._credit_type = credit_type 20 self._credit_no = credit_no 21 22 def get(self): 23 return list((self._key, self._first_name, self._last_name, 24 self._email, self._gender, self._address, 25 self._country, self._language, self._job, 26 self._credit_type, self._credit_no) 27 ) 28 29 def __str__(self): 30 return '%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s' % (self._key, 31 self._first_name, 32 self._last_name, 33 self._email, 34 self._gender, 35 self._address, 36 self._country, 37 self._language, 38 self._job, 39 self._credit_type, 40 self._credit_no 41 ) 42 43 44 connection = happybase.connection('localhost', 9090) 45 46 db = pymysql.connect(host='127.0.0.1', port=3307, user='root', password='hadoop', database='demo') 47 cursor = db.cursor() 48 49 sql = 'select * from testc' 50 cursor.execute(sql) 51 data = cursor.fetchall() 52 data = list(data) 53 datalist = [] 54 for i in range(0, len(data)): 55 datalist.append(testc(data[i][0], data[i][1], data[i][2], 56 data[i][3], data[i][4], data[i][5], 57 data[i][6], data[i][7], data[i][8], 58 data[i][9], data[i][10] 59 )) 60 print(datalist[i]) 61 # 到这里mysql中的表数据已经被读取并存储与datalist中,接下来将它转存在hbase中去 62 table = connection.table('short') 63 for data_ in datalist: 64 row = data_.get() 65 table.put( 66 bytes('{}'.format(row[0]),encoding='ascii'), 67 { 68 b'base:first_name': bytes('{}'.format(row[1]), encoding='ascii'), 69 b'base:last_name': bytes('{}'.format(row[2]), encoding='ascii'), 70 b'base:email': bytes('{}'.format(row[3]), encoding='ascii'), 71 b'base:gender': bytes('{}'.format(row[4]), encoding='ascii'), 72 b'region:address': bytes('{}'.format(row[5]), encoding='ascii'), 73 b'region:country': bytes('{}'.format(row[6]), encoding='ascii'), 74 b'infos:language': bytes('{}'.format(row[7]), encoding='ascii'), 75 b'infos:job': bytes('{}'.format(row[8]), encoding='ascii'), 76 b'infos:credit_type': bytes('{}'.format(row[9]), encoding='ascii'), 77 b'infos:credit_no': bytes('{}'.format(row[10]), encoding='ascii') 78 } 79 ) 80 81 db.close()
大概流程思路是查出来的数据用特定格式做好,然后再写入,这里我做了个类(本质上没有必要),读者在实验的时候可以考录直接使用一个list去接收
连接mysql就需要你使用pymysql库
db = pymysql.connect(host='127.0.0.1', port=3307, user='root', password='hadoop', database='demo')
其中的参数怕是意思很清楚了,这里不进行过多赘述。
这里有个叫做游标对象的东西 cursor = db.cursor() 可以认为他是个代理,使用它来执行sql语句并展示。
cursor有三个方法,fetchall、fetchone、fetchmany。嚼一嚼英语就知道意思是取全部、一行和多行,多行自然你要设定参数
找合适的容器接收你得到的数据,把数据按一定规格处理好之后,准备导入到hvase中。末尾的for循环就是导入代码,这里全部使用了bytes()是因为hbase只支持二进制,所以转换为了ascii码编码,否则你会在接下来的scan中看到不想要的utf-8字符。
导好了之后,我们使用scan来看一下,hbase中scan是用来看全表的,那么这里table对象就会同样有这个方法,人家老哥很厉害啊。
scan.py
1 #!/usr/bin/python 2 # coding:utf-8 3 4 import happybase 5 6 connection = happybase.connection('localhost', 9090) 7 table = connection.table('short') 8 9 for key, data in table.scan(): 10 print(str(key),data)
这个篇幅很小,因为表级操作。
这里是我查到的结果
这个小小的实验基本就完成了,中间遇到坑的时候可能会重复删表和建表,这里再提供一个
delete.py
1 #!/usr/bin/python 2 # coding:utf-8 3 import happybase 4 5 connection = happybase.connection('localhost', 9090) 6 connection.disable_table('short') 7 connection.delete_table('short')
嗯好的,我只能帮你到这了,还要去学习哦,如果大佬有更好的数据导入是字符编码的处理方式,跪求告知,知识就是力量!谢过大佬。
hail hydra