把Elasticsearch当数据库使：表结构定义 Elaticsearch

程序员文章站 2022-06-14 08:46:27

...

Elaticsearch 有非常好的查询性能，以及非常强大的查询语法。在一定场合下可以替代RDBMS做为OLAP的用途。但是其官方查询语法并不是SQL，而是一种Elasticsearch独创的DSL。主要是两个方面的DSL：

Query DSL（https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html）相当于SQL里的 WHERE 部分，实现各种各样的过滤文档的方式
Aggregation DSL (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html) 相当于SQL里的 GROUP BY 部分，实现文档按条件聚合并求一些指标（metric），比如求和求平均这些

这两个DSL说实话是不好学习和理解的，而且即便掌握了写起来也是比较繁琐的，但是功能却非常强大。本系列文章是为了两个目的：

通过类比SQL的概念，实验并学习Elasticsearch聚合DSL的语法和语义
用 python 实现一个翻译器，能够使用 SQL 来完成 Elasticsearch 聚合DSL一样的功能。这个小脚本可以在日常工作中做为一件方便的利器

基础Elasticsearch知识（比如什么是文档，什么是索引）这里就不赘述了。我们的重点是学习其查询和聚合的语法。在本章中，我们先来准备好样本数据。选择的样本数据是全美的股票列表（http://www.nasdaq.com/screening/company-list.aspx）。选择这份数据的原因是因为其维度比较丰富（ipo年份，版块，交易所等），而且有数字字段用于聚合（最近报价，总市值）。数据下载为csv格式（https://github.com/taowen/es-monitor/tree/master/sample），并且有一个导入脚本（https://github.com/taowen/es-monitor/blob/master/sample/symbol.py）

下面是导入Elasticsearch的mapping（相当于关系型数据库的表结构定义）：

{
    "symbol": {
        "properties": {
            "sector": {
                "index": "not_analyzed", 
                "type": "string"
            }, 
            "market_cap": {
                "index": "not_analyzed", 
                "type": "long"
            }, 
            "name": {
                "index": "analyzed", 
                "type": "string"
            }, 
            "ipo_year": {
                "index": "not_analyzed", 
                "type": "integer"
            }, 
            "exchange": {
                "index": "not_analyzed", 
                "type": "string"
            }, 
            "symbol": {
                "index": "not_analyzed", 
                "type": "string"
            }, 
            "last_sale": {
                "index": "not_analyzed", 
                "type": "long"
            }, 
            "industry": {
                "index": "not_analyzed", 
                "type": "string"
            }
        }, 
        "_source": {
            "enabled": true
        }, 
        "_all": {
            "enabled": false
        }
    }
}

对于把 Elasticsearch 当作数据库来使用，默认以下几个设置

把所有字段设置为 not_analyzed
_source 打开，这样就不用零散地存储每个字段了，大部分情况下这样更高效
_all 关闭，因为检索都是基于 k=v 这样字段已知的查询的

执行python import-symbol.py导入完成数据之后，执行

GET http://127.0.0.1:9200/symbol/_count

{"count":6714,"_shards":{"total":3,"successful":3,"failed":0}}

可以看到文档已经被导入索引了。除了导入一个股票的列表，我们还可以把历史的股价给导入到数据库中。这个数据比较大，放在了网盘上下载（https://yunpan.cn/cxRN6gLX7f9md 访问密码 571c）(http://pan.baidu.com/s/1nufbLMx 访问密码 bes2)。执行python import-quote.py 导入

 "quote": {
    "_all": {
      "enabled": false
    },
    "_source": {
      "enabled": true
    }, 
    "properties": {
      "date": {
        "format": "strict_date_optional_time||epoch_millis",
        "type": "date"
      },
      "volume": {
        "type": "long"
      },
      "symbol": {
        "index": "not_analyzed",
        "type": "string"
      },
      "high": {
        "type": "long"
      },
      "low": {
        "type": "long"
      },
      "adj_close": {
        "type": "long"
      },
      "close": {
        "type": "long"
      },
      "open": {
        "type": "long"
      }
    }
  }

从 mapping 的角度，和表结构定义是非常类似的。除了_source，_all和analyzed这几个概念，基本上没有什么差异。Elasticsearch做为数据库最大的区别是 index/mapping 的关系，以及 index 通配这些。

原文地址：https://segmentfault.com/a/1190000004433446