欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

hive get_json_object json_tuple json解析详解

程序员文章站 2022-07-13 12:28:30
...

1.hive中处理json的两个函数

json是常见的数据接口形式,实际中使用也很广泛,下面我们看看怎么在hive中解析json格式。

hive中常用的解析json格式的函数有两个:

先看看get_json_object

> desc function extended get_json_object;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                      tab_name                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| get_json_object(json_txt, path) - Extract a json object from path                                                                                                                   |
| Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.  |
| A limited version of JSONPath supported:                                                                                                                                            |
|   $   : Root object                                                                                                                                                                 |
|   .   : Child operator                                                                                                                                                              |
|   []  : Subscript operator for array                                                                                                                                                |
|   *   : Wildcard for []                                                                                                                                                             |
| Syntax not supported that's worth noticing:                                                                                                                                         |
|   ''  : Zero length string as key                                                                                                                                                   |
|   ..  : Recursive descent                                                                                                                                                           |
|   @   : Current object/element                                                                                                                                             |
|   ()  : Script expression                                                                                                                                                           |
|   ?() : Filter (script) expression.                                                                                                                                                 |
|   [,] : Union operator                                                                                                                                                              |
|   [start:end:step] : array slice operator                                                                                                                                           |
|                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
16 rows selected (0.579 seconds)

从上面可以看出,get_json_object输入的参数是两个,json_txt与path。
其中,json_txt就是我们需要解析的json字符串,而path表示json中的"字段"。

再看看json_tuple方法

> desc function extended json_tuple;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                  tab_name                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
1 row selected (0.549 seconds)

从上面的解释不难看出,json_tuple与get_json_object的用法很类似,唯一的区别是,json_tuple输入的是json字符串中的多个"字段",返回一个元组,元组中的所有数据类型为string。

2.最简单的json字符串解析

假设有如下一条测试数据

{"age":18, "name": "lili", "gender": "female"}

如果想要解析age字段

> select get_json_object('{"age":18, "name": "lili", "gender": "female"}', "$.age");
+------+--+
| _c0  |
+------+--+
| 18   |
+------+--+

如果使用json_tuple方法,也可以。

> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age");
+-----+--+
| c0  |
+-----+--+
| 18  |
+-----+--+

如果json串里是个数组,可以采取如下提取的方式

> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0,1,2].age") as age;
+-------------+--+
|     age     |
+-------------+--+
| [18,19,15]  |
+-------------+--+

如果只是想查询数组里第一个对象的age值,可以用下面的提取方法

> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0].age");
+------+--+
| _c0  |
+------+--+
| 18   |
+------+--+

如果想同时提取所有字段,可以用json_tuple方法。

> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age", "name", "gender");
+-----+-------+---------+--+
| c0  |  c1   |   c2    |
+-----+-------+---------+--+
| 18  | lili  | female  |
+-----+-------+---------+--+

3.解析json数组

上面的例子,都是解析单个或者确定个数的json对象。但如果不知道json数组有多少个,或者数组长度不确定时,我们可以采用下面迂回的方式来解析。

还是以上面的数组为例

[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]

如果我们想要解析出所有的age字段,该怎么办?

第一步,将"},{”中的,替换成; ,同时将数组的"[]"符号也去除:

> select regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', '');
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                     _c0                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"};{"age":19, "name": "lucy", "gender": "female"};{"age":15, "name": "mike", "gender": "male"}  |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+

第二部,将字符串用split方法分割,得到一个数组

> select split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;");
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                        _c0                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| ["{\"age\":18, \"name\": \"lili\", \"gender\": \"female\"}","{\"age\":19, \"name\": \"lucy\", \"gender\": \"female\"}","{\"age\":15, \"name\": \"mike\", \"gender\": \"male\"}"]  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

第三步,使用explode方法将数据变成一列

> select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;"));
+-------------------------------------------------+--+
|                       col                       |
+-------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"}  |
| {"age":19, "name": "lucy", "gender": "female"}  |
| {"age":15, "name": "mike", "gender": "male"}    |
+-------------------------------------------------+--+

第四步,使用get_json_object解析即可。

> select get_json_object(json_data, "$.age") from (select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;")) as json_data) virtual_table;
+------+--+
| _c0  |
+------+--+
| 18   |
| 19   |
| 15   |
+------+--+

以上步骤,就完成了对json对象数组的解析。