Python使用elasticsearch搜索引擎(二)

程序员文章站 2022-04-29 08:10:02

...

个人环境：win10 + pycharm + python3.7.1

Elasticsearch 的安装

我们可以到 Elasticsearch 的官方网站下载 Elasticsearch：https://www.elastic.co/downloads/elasticsearch，同时官网也附有安装说明。

首先把安装包下载下来并解压，然后运行 bin/elasticsearch（Mac 或 Linux）或者 bin\elasticsearch.bat (Windows) 即可启动 Elasticsearch 了。

或者直接命令安装

pip3 install elasticsearch

官方文档是：https://elasticsearch-py.readthedocs.io/，所有的用法都可以在里面查到，文章后面的内容也是基于官方文档来的。

安装好后制作启动elasticsearch 和 kibana脚本

import os
import time
import random

elasticsearch = r'C:\Users\Jason\Desktop\PyPersonalAbility\ElasticSearchStudy\elasticsearch-rtf\bin\elasticsearch.bat' #上篇文章介绍过了的路劲，下同
kibana = r'C:\Users\Jason\Desktop\PyPersonalAbility\ElasticSearchStudy\kibana-5.1.2-windows-x86\kibana-5.1.2-windows-x86\bin\kibana.bat'
def progress_bar(item):
    for i in range(11, 0, -1):
        if item == 'kibana':
            time.sleep(random.random() + 0.8)
        else:
            time.sleep(random.random() + 0.4)
        res = '\r%s正在加载：%s %s%%\n' % (item, ('████' * (12 - i)), (11 - i) * 10) if i == 1 \
            else '\r%s正在加载：%s %s%%' % (item,('████' * (12 - i)),(11 - i) * 10)
        print('\033[31m%s\033[0m' % res, end='')


def run():
    for item in [(elasticsearch, 'elasticsearch'), (kibana, 'kibana')]:
        os.system('start %s' % item[0])
        progress_bar(item[1])
        time.sleep(10)

运行启动脚本，显示如图表示启动成功：

Python使用elasticsearch搜索引擎(二)

Elasticsearch 默认会在 9200 端口上运行，我们打开浏览器访问
http://localhost:9200/ 就可以看到类似内容：

Python使用elasticsearch搜索引擎(二)

创建 Index==>indices.create

from elasticsearch import Elasticsearch

#创建index
es = Elasticsearch()
result = es.indices.create(index='news', ignore = 400)

print(result)

如果创建成功，会返回如下结果：

Python使用elasticsearch搜索引擎(二)

返回结果是 JSON 格式，其中的 acknowledged 字段表示创建操作执行成功。

但这时如果我们再把代码执行一次的话，就会返回报错，因为index已经存在了。

删除 Index==》delete

删除 Index 也是类似的，代码如下：

from elasticsearch import Elasticsearch

es = Elasticsearch()
result = es.indices.delete(index="news",ignore=400)
print(result)

插入数据

Elasticsearch 就像 MongoDB 一样，在插入数据的时候可以直接插入结构化字典数据，

1.插入数据可以调用 create() 方法，例如这里我们插入两条数据：

from elasticsearch import Elasticsearch

es = Elasticsearch()
#方法一：es.create("注意必须带上id")
data = {"title":"python Web开发","url":"https://blog.csdn.net/wufaliang003/article/details/81368365"}
result = es.create(index="news",doc_type="work",id=1,body=data)
print(result)
##方法二：es.index() 可以不用id
result1 = es.index(index="news",doc_type="work",body={"title":"数据分析","url":"https://www.baidu.com/"})
print(result1)

这里我们首先声明了一条新闻数据，包括标题和链接，然后通过调用 create() 方法插入了这条数据，在调用 create() 方法时，我们传入了四个参数，index 参数代表了索引名称，doc_type 代表了文档类型，body 则代表了文档具体内容，id 则是数据的唯一标识 ID。
另外其实我们也可以使用 index() 方法来插入数据，但与 create() 不同的是，create() 方法需要我们指定 id 字段来唯一标识该条数据，而 index() 方法则不需要，如果不指定 id，会自动生成一个 id，调用 index() 方法的写法如下，就是方法二，运行结果

{'_index': 'news', '_type': 'work', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': True}
{'_index': 'news', '_type': 'work', '_id': 'AW24hyEc94oPzuXLihoS', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': True}

更新数据（我们这里不知道是不是版本问题，update不能用，只能用index更新）

1.更新数据也非常简单，我们同样需要指定数据的 id 和内容，调用 update() 方法即可或者index()方法，代码如下：

from elasticsearch import Elasticsearch

es = Elasticsearch()

data = {
    "title":"python Web开发",
    "url":"https://blog.csdn.net/wufaliang003/article/details/81368365",
    "date":"2018-10-10"
}

#更新方法一es.update(根据id来,但是我们这版本用不了)
# result = es.update(index="news",doc_type="work",body=data,id=1)
#
# #更新方法二es.index(根据id来)
result = es.index(index="news",doc_type="work",body=data,id=1)
print(result)

更新结果:

{'_index': 'news', '_type': 'work', '_id': '1', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': False}

2.条件更新

update_by_query：更新满足条件的所有数据，写法同上删除和查询

删除数据（delete(index,doc_type,id)）

如果想删除一条数据可以调用 delete() 方法，指定需要删除的数据 id 即可，写法如下：

from elasticsearch import Elasticsearch

es = Elasticsearch()
result = es.delete(index='news', doc_type = 'work',id = 1)
print(result)

根据多个条件删除:

delete_by_query：删除满足条件的所有数据，查询条件必须符合DLS格式

query = {'query': {'match': {'id': '1'}}}# 删除性别为女性的所有文档

query = {'query': {'range': {'age': {'lt': 11}}}}# 删除年龄小于11的所有文档

es.delete_by_query(index='indexName', body=query, doc_type='typeName')

查询数据

查询数据，两种get and search
#get获取
result = es.get(index="my-index", doc_type="test-type", id=01)
es.get(index='indexName', doc_type='typeName', id='idValue')

上面的几个操作都是非常简单的操作，普通的数据库如 MongoDB 都是可以完成的，看起来并没有什么了不起的，Elasticsearch 更特殊的地方在于其异常强大的检索功能。

对于中文来说，我们需要安装一个分词插件，这里使用的是 elasticsearch-analysis-ik，GitHub 链接为：https://github.com/medcl/elasticsearch-analysis-ik/releases，这里我们使用 Elasticsearch 的另一个命令行工具 elasticsearch-plugin 来安装，这里安装的版本是 5.1.1，请确保和 Elasticsearch 的版本对应起来，这里大概讲下思路:

1.下载elasticsearch-analysis-ik-5.1.1

2.在C:\Users\Jason\Desktop\PyPersonalAbility\ElasticSearchStudy\elasticsearch-rtf\plugins路径下新建

IK文件夹，将elasticsearch-analysis-ik-5.1.1下的配置文件和jar包复制进来

Python使用elasticsearch搜索引擎(二)

具体安装参考博客：https://blog.csdn.net/didiaodeabing/article/details/79309046

安装之后重新启动 Elasticsearch 就可以了，它会自动加载安装好的插件。

首先我们新建一个索引并指定需要分词的字段，代码如下：

from elasticsearch import Elasticsearch
 
es = Elasticsearch()
mapping = {
    'properties': {
        'title': {
            'type': 'text',
            'analyzer': 'ik_max_word',
            'search_analyzer': 'ik_max_word'
        }
    }
}
es.indices.delete(index='news', ignore=[400, 404])
es.indices.create(index='news', ignore=400)
result = es.indices.put_mapping(index='news', doc_type='politics', body=mapping)
print(result)

这里我们先将之前的索引删除了，然后新建了一个索引，然后更新了它的 mapping 信息，mapping 信息中指定了分词的字段，指定了字段的类型 type 为 text，分词器 analyzer 和搜索分词器 search_analyzer 为 ik_max_word，即使用我们刚才安装的中文分词插件。如果不指定的话则使用默认的英文分词器。

接下来我们插入几条新的数据：

datas = [
    {
        'title': '美国留给伊拉克的是个烂摊子吗',
        'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',
        'date': '2011-12-16'
    },
    {
        'title': '*部：各地校车将享最高路权',
        'url': 'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',
        'date': '2011-12-16'
    },
    {
        'title': '中韩渔警冲突调查：韩警平均每天扣1艘中国渔船',
        'url': 'https://news.qq.com/a/20111216/001044.htm',
        'date': '2011-12-17'
    },
    {
        'title': '中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首',
        'url': 'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',
        'date': '2011-12-18'
    }
]
 
for data in datas:
    es.index(index='news', doc_type='politics', body=data)

这里我们指定了四条数据，都带有 title、url、date 字段，然后通过 index() 方法将其插入 Elasticsearch 中，索引名称为 news，

接下来我们根据关键词查询一下相关内容：

result = es.search(index='news')
print(result)

结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "c05G9mQBD9BuE5fdHOUT",
        "_score": 1.0,
        "_source": {
          "title": "美国留给伊拉克的是个烂摊子吗",
          "url": "http://view.news.qq.com/zt2011/usa_iraq/index.htm",
          "date": "2011-12-16"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 1.0,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击，嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 1.0,
        "_source": {
          "title": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dE5G9mQBD9BuE5fdHOUf",
        "_score": 1.0,
        "_source": {
          "title": "*部：各地校车将享最高路权",
          "url": "http://www.chinanews.com/gn/2011/12-16/3536077.shtml",
          "date": "2011-12-16"
        }
      }
    ]
  }
}

可以看到返回结果会出现在 hits 字段里面，然后其中有 total 字段标明了查询的结果条目数，还有 max_score 代表了最大匹配分数。

另外我们还可以进行全文检索，这才是体现 Elasticsearch 搜索引擎特性的地方：

dsl = {
    'query': {
        'match': {
            'title': '中国 领事馆'
        }
    }
}
 
es = Elasticsearch()
result = es.search(index='news', doc_type='politics', body=dsl)
print(json.dumps(result, indent=2, ensure_ascii=False))

这里我们使用 Elasticsearch 支持的 DSL 语句来进行查询，使用 match 指定全文检索，检索的字段是 title，内容是“中国领事馆”，搜索结果如下：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.546152,
    "hits": [
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dk5G9mQBD9BuE5fdHOUm",
        "_score": 2.546152,
        "_source": {
          "title": "中国驻洛杉矶领事馆遭亚裔男子枪击，嫌犯已自首",
          "url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml",
          "date": "2011-12-18"
        }
      },
      {
        "_index": "news",
        "_type": "politics",
        "_id": "dU5G9mQBD9BuE5fdHOUj",
        "_score": 0.2876821,
        "_source": {
          "title": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船",
          "url": "https://news.qq.com/a/20111216/001044.htm",
          "date": "2011-12-17"
        }
      }
    ]
  }
}

这里我们看到匹配的结果有两条，第一条的分数为 2.54，第二条的分数为 0.28，这是因为第一条匹配的数据中含有“中国”和“领事馆”两个词，第二条匹配的数据中不包含“领事馆”，但是包含了“中国”这个词，所以也被检索出来了，但是分数比较低。

因此可以看出，检索时会对对应的字段全文检索，结果还会按照检索关键词的相关性进行排序，这就是一个基本的搜索引擎雏形。

另外 Elasticsearch 还支持非常多的查询方式，详情可以参考官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

部分参考https://blog.csdn.net/u013429010/article/details/81746179

上一篇： CDH 6.3.1 Hive对update delete支持

下一篇：二、搜索引擎篇-搭建es环境

Python使用elasticsearch搜索引擎(二)

Elasticsearch 的安装

创建 Index==>indices.create

删除 Index==》delete

插入数据

更新数据（我们这里不知道是不是版本问题，update不能用，只能用index更新）

删除数据（delete(index,doc_type,id)）

查询数据

使用Python进行二进制文件读写的简单方法(推荐)

使用Python进行二进制文件读写的简单方法(推荐)

Python使用MyQR制作专属动态彩色二维码功能

Python中的二叉树查找算法模块使用指南

Python二维码生成库qrcode安装和使用示例

Python使用MyQR制作专属动态彩色二维码功能

使用Python读取二进制文件的实例讲解

使用Python操作Elasticsearch数据索引的教程

使用python绘制二元函数图像的实例

Python二维码生成库qrcode安装和使用示例