Elasticsearch官方文档要点整理

程序员文章站 2022-03-18 13:44:01

...

normalizer ： 标准化text，比如将大写转换成小写，这样倒排索引实际存储的是转换后的token，比如User user这两个doc实际是等价的，当搜索user时能匹配到两个doc，当对USer做统计时能统计到两个，但是_source里的field实际还是原来的，而不是标准化的token
boost：放大关联性评分，默认1.0
dynamic：
- true（default），允许运行时添加新属性，且会设置索引；
- false：新属性会在_source里存在，但不会设置索引；
- strict：侦测到新属性会报错
enabled：是否要索引此数据，默认是true，有些场景下设置为false合适，比如type为object的数据
ignore_above：当keyword的长度超过指定长度后，此field将被忽略，也就是不索引，如果keyword长度在ignore_above内才能正常索引
ignore_malformed：正常来说es会对数据做一个转换，比如mapings设定是一个integer，但传过来的是一个String，会自动转换，如果设置ignore_malformed：true，那么会忽略类型不匹配的doc；“index.mapping.ignore_malformed”: true ，ignore_malformed可以设置field维度上，也可以设置到index上
null_value：设置属性为空时的默认值
store， elasticsearch默认将 _source 当做一个store的element，当在_source里指定了include，exclude时，加载_source的json数据，然后解析提取。当索引的文档很大，但是我们经常使用的field很少时，可以考虑将经常使用的field设置成 stored_fields，这样在读取指定属性时就不需要加载整个_source，不需要解析_source提取指定field，在性能上会有较大的提高
analyzer：由Character filters，Tokenizer，Token filters 3个组件构成，Tokenizer只有一个，其他的可以有多个

discovery.seed_hosts: 前身是discovery.zen.ping.unicast.hosts，提供集群的master-eligible nodes

   92.168.1.10:9300
   192.168.1.11 
   seeds.mydomain.com             : 通过项目内的yml配置服务发现

discovery.type：若指定为single-node则表示单节点集群
cluster.initial_master_nodes：此参数在一个初始化的集群中能生效，指定 master-eligible nodes；如果是加入现有的集群，那么这个配置不生效
```
   -master-a
   -master-b
   -master-c
```
discovery.seed_providers: file ：通过文件内配置服务发现，前身discovery.zen.hosts_provider，提供集群内所有的node
文件内容格式：
```
 10.10.10.5
 10.10.10.6:9305
 10.10.10.5:10005
 [2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301
```
cluster.auto_shrink_voting_configuration：默认true, 设置为true时，当一个master eligible 节点离开集群时，自动将voting configuration减一
如果设置了master-eligible 节点的个数为偶数时，es会将某个节点移除出master候选者，保留奇数个可参与选举的节点个数
如果设置了4个master-eligible节点，那么任务需要节点投票的操作都需要至少3个节点同意才行；
如果4个节点的集群分裂成两个双节点的集群时，他们的节点数都不到3，都不能正常运行
当某个节点离开集群时，此时只需要2个节点投票了，也就是voting configuration会自动减一
follower checks and leader checks：follower检查master的health，master检查follower的health
开发模式：discovery.seed_providers，discovery.seed_hosts，cluster.initial_master_nodes 当这3个参数都没有配置时，默认是开发模式
join request: 加入现有集群的请求
master-gligible节点移除策略：当要移除集群的Master-Eligible节点时，不要一次性移除超过半数的节点，最好是一个一个地移除，给集群足够的反应时间，减少voting configuration的个数，这样集群才能正常的缩容，比如7个master-eligible缩小到3个节点
ES的集群状态同步类似于Eureka，同步的是增量信息，而不是全量，当一个节点新加入时，同步增量时发现原来数据没有，那么master会同步全量过去
集群状态变更流程: 当master发起集群状态变更时，先将变更信息发送给所有的master-eligible node，所有node发送ack给master, master收集到超过半数的ack时，再向所有的 node发送commit命令，所有的node收到命令后应用 cluster state, 然后发送第二个ack给master
cluster.publish.timeout：集群状态变更时间限制，默认30S
cluster.follower_lag.timeout: 默认90S，当某个节点没有应用cluster state时, master给他90S时间来追上集群状态，
也就是90S内发回ACK，否则master将他移除出集群
cluster.routing.allocation.disk.watermark.low : 默认85%，当磁盘使用达到此阈值时，不再此节点分配新索引和分片。
cluster.routing.allocation.disk.watermark.high：默认90%, 当磁盘使用达到此阈值时，尝试将此节点上的索引迁移到其他节点上
cluster.routing.allocation.disk.watermark.flood_stage：默认95%,当磁盘使用达到此阈值时，不允许索引写入和修改

cluster.info.update.interval：默认30S，每隔多久探查磁盘使用量

  PUT _cluster/settings
  {
  	  "transient": {
  		    "cluster.routing.allocation.disk.watermark.low": "100gb",
  		    "cluster.routing.allocation.disk.watermark.high": "50gb",
  		    "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
  		    "cluster.info.update.interval": "1m"
  	  }
  }

cluster.max_shards_per_node：默认值1000，每个node上最有只能含有1000个索引的分片，包含primary和replica，关闭了的索引分片不算
在集群重新启动时，为了防止在启动过程中的数据分片重新分配问题，可以通过下列参数来控制分片恢复的过程：
```
   gateway.expected_nodes
   gateway.expected_master_nodes
   gateway.expected_data_nodes
   gateway.recover_after_time
```
indices.breaker.fielddata.limit： 40% jvm， fieldData内存占用不能超过此阈值，超过了会使用缓存淘汰机制
indices.fielddata.cache.size：百分比或者确定内存大小
indices.breaker.request.limit：所有请求占用内存不能超过60%的JVM内存
indices.queries.cache.size：默认10%JVM,查询缓存，也可以设定成固定内存大小
indices.memory.index_buffer_size：新索引的doc，还没有写入segment，所占用的内存，每个shard独立计算，默认10% jvm
indices.memory.min_index_buffer_size：默认48M
indices.memory.max_index_buffer_size：默认未设置
indices.recovery.max_bytes_per_sec：在恢复分片时，限制输入或者输出的吗，每秒字节数，默认40MB
indices.query.bool.max_clause_count：每个检索请求里从句的个数，默认上限1024
transport.port: 默认9300-9400
http.port：默认是区间 9200-9300
discovery.seed_hosts：默认[“127.0.0.1”, “[::1]”]
path.data: es的数据目录，默认$ES_HOME/data
Remote Cluster ：Remote Cluster ，Cross Cluster Search，远程集群，跨集群检索
分片定位：shard_num = (hash(_routing) + hash(_id) % routing_partition_size) % num_primary_shards， routing_partition_size 是es在服务端生成的路由基数，是分片数的整数倍
“index.unassigned.node_left.delayed_timeout”: “5m”，在节点离开集群5分钟后才开始处理集群索引丢失问题，默认1分钟，可以设置为0，在主动关闭节点时能马上重新分配索引
index.translog.sync_interval：当index.translog.durability = async时，后台异步的每隔5S执行translog持久化
index.translog.durability：
- request，默认值，也就是每个index,delete,update时持久化translog;
- async, 异步持久化translog, 每隔index.translog.sync_interval 运行一次
index.translog.flush_threshold_size： translog多大时执行一次commit动作，也就是把translog里的数据同步到Lucene里，
防止此文件太大时，恢复时间太长，默认512M，也就是最大512M
index.translog.retention.size：最多保留多大的translog文件，总计大小，默认512M
index.translog.retention.age：每个translog最多保留多久，默认12H
IndexSorting：将lucene 的 segment里的数据按照指定field排序，这样在查询时如果返回指定field的topN就能出发提前中断，极大提升查询效率，但是根据关联性评分排序不适用，因为评分是每次查询时实时算出来的
index.sort.field： Only boolean, numeric, date and keyword fields with doc_values are allowed here.
index.sort.order：asc， desc，排序模式
track_total_hits：在检索时是否关注匹配的全部数据条数

index lifecycle management：索引生命周期管理，创建生命周期策略

  PUT _ilm/policy/datastream_policy   
  {
      "policy":{
          "phases":{
              "hot":{
                  "actions":{
                      "rollover":{
                          "max_size":"50GB",
                          "max_age":"30d"
                      }
                  }
              },
              "warm":{
                  "min_age":"1d",   // 当从创建此分片时，超过1天，phase进入warm阶段
                  "actions":{
                      "allocate":{
                          "number_of_replicas":1
                      }
                  }
              },
              "delete":{
                  "min_age":"90d",   // 当从创建此分片时，超过90天，phase进入delete阶段，删除索引分片
                  "actions":{
                      "delete":{
                      }
                  }
              }
          }
      }
  }

构建索引模块，应用生命周期：

  PUT _template/datastream_template
  {
    "index_patterns": ["datastream-*"],                 
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "index.lifecycle.name": "datastream_policy",      // 使用哪个生命周期策略
      "index.lifecycle.rollover_alias": "datastream"    // 滚动生成所有索引的别名
    }
  }

LifeCycle Phase: 生命周期阶段，周期管理时不需要定义每个阶段
- hot The index is actively being written to
- warm The index is generally not being written to, but is still queried
- cold The index is no longer being updated and is seldom queried. The information still needs to be searchable, but it’s okay if those queries are slower.
- delete The index is no longer needed and can safely be deleted

indices.lifecycle.poll_interval：配置多久去校验生命周期配置
在生命周期每个阶段可以执行的动作：
- Allocate ：warm, cold，可以重新制定分片的副本数量，可配置下来参数
  - number_of_replicas：The number of replicas to assign to the index
  - include：assigns an index to nodes having at least one of the attributes
  - exclude：assigns an index to nodes having none of the attributes
  - require：assigns an index to nodes having all of the attributes
- delete：删除分片，
```
  "delete" : {
   }
```
- forcemerge：强制合并segments
```
  "forcemerge" : {
  	   "max_num_segments": 1
  }
```
- freeze：冻结索引
```
  "freeze" : { 
  }
```
- readonly：设置索引为只读
```
  "readonly" : { 
  }
```
- Rollover：按模板滚动生成索引，触发阈值为 max_size，max_docs，max_age 三选一；index.lifecycle.rollover_alias：滚动生成的索引别名
- shrink：缩容分片数
```
  "shrink" : {
  	"number_of_shards": 1
  } 
```

freeze index：冻结索引，索引信息不会加载到内存，都是在磁盘，每次检索时构建，用完就放弃，这样不会占用内存空间。建议将冻结的index分配到指定的node上，防止其检索时的高延迟和瞬间内存占用影响正常其他低延迟的功能，同时将每个shard的segment merge成一个，提高检索时的加载效率 POST /twitter/_forcemerge?max_num_segments=1
index.lifecycle.indexing_complete：当设置为true时，不再滚动生成，适用于模板出错时切换新的索引模板
生命周期策略更新：索引生命周期策略，新的索引应用此策略是默认使用最新版本的，如果有索引在使用这个策略，更新此策略是不会对正在使用中的索引生效，但滚动生成的新的索引，或者进入到下个阶段时会应用更新后的策略，如果执行生命周期出现异常时，会进入error的步骤steep，修改周期policy，然后执行retry POST /myindex/_ilm/retry
search_throttled： freez index 检索时的的 Thread Pool，默认大小是1，定义了同一时间只有一个frozed index 可以被加载查询
配置Watch(监控)步骤：
- Schedule：定期的运行query，然后检测是否符合Condition
- Query : 查询作为condition的输入，支持es的所有查询语言包括聚合
- Condition：一个condition判断是否执行action，可以使用最简单的always true，或者使用脚本判断
- Actions：配置一个或者多个action，比如发送一个email，或者向第三方系统推送数据，或者将query的数据索引起来

上一篇： linux(centos5.5)/windows下nginx开启phpinfo模式功能的配置方法分享

下一篇：把ImageMagic库编译进nginx服务器的一些必要配置

Elasticsearch官方文档要点整理

AsyncTask官方文档教程整理

AsyncTask官方文档教程整理

【官方文档整理】Mysql中的锁

Elasticsearch官方文档要点整理

tez安装官方文档整理+翻译

Cordova官方文档整理

spring 官方文档的接口理解整理（一）Bean

[VUE系列二]vue官方文档总结和整理

elasticsearch5 官方文档整理。。。。中

Spark官方文档整理：spark-core