Elasticsearch Upgrade: 1.7 to 7.1

Version: 1.7 → 7.1

Relevant Github Repos/Branches:

  • Removal of Tosca/Figaro user interfaces in favor of hysds_ui

Big Changes

  • PLEASE LOOK AT AND USE THE NEW ELASTICSEARCH UTILITY CLASS: (SOURCE CODE)

  • Only 1 type allowed in each index: _doc
  • Need to manually enable all field text searches
  • Removal of filtered since ES 5.0
  • Split string into text and keyword
  • fielddata: true is the mapping allows for sorting (but we'll sort on the keyword instead): Documentation
  • Support for z coordinate in geoshapesdocumentation
    • it wont affect searches but adds more flexibility in location data
  • _default_ mapping deprecated in ES 6.0.0 (Link)
  • Changes in the geo coordinates query

    • Note: {"type": "geo_shape","tree": "quadtree","tree_levels": 26} makes uploading documents slow, specifically "tree_levels”: 2

    • {
        "query": {
          "bool": {
            "filter": {
              "geo_shape": {
                "location": {
                  "shape": {
                    "type": "polygon",
                    "coordinates": [[<coordinates>]]
                  },
                  "relation": "within"
                }
              }
            }
          }
        }
      }
  • Changes in Percolator


  • Removal of   _all: { "enabled": true }   type in indices so we cannot search for all fields

    • workaround is adding copy_to in field mapping, especially in dynamic templating

    • Link to copy_to documentation

    • Does not work with multi-fields
      • "random_field_name": {
          "type": "keyword",
          "ignore_above": 256,
          "copy_to": "all_text_fields", # DOES WORK
          "fields": {
            "keyword": {
              "type": "text"
              "copy_to": "all_text_fields" # DOES NOT WORK
            }
          }
        }
    • Proper mapping with text fields

      • "random_field_name": {
          "type": "text",
          "copy_to": "all_text_fields"
          "fields": {
            "keyword": { # WE USE 'raw' instead of 'keyword' in our own indices
              "type": "keyword" # THIS IS NEEDED FOR AGGREGATION ON THE FACETS FOR THE UI
              "ignore_above": 256
            }
          }
        }
    • Need to add the copy_to field mapping
      • "all_text_fields": {
          "type": "text"
        }
  • General changes to the mapping

    • create example mapping called grq_v1.1_s1-iw_slc

    • copied example data into new ES index, using built in dynamic mapping to build initial mapping

    • mapping changes:

      • metadata.context to {"type": "object", "enabled": false}

        • properties.location to {"type": "geo_shape","tree": "quadtree"}

        • use type keyword to be able to use msearch:

          • "reason": "Fielddata is disabled on text fields by default. ... Alternatively use a keyword field instead."
  • Changes to query_string

    • removal of escaping literal double quotes in query_string
    • old query_string from 1.7, would return S1B_IW_SLC__1SDV_20170812T010949_20170812T011016_006900_00C25E_B16D
      • {
          "query": {
            "query_string": {
              "query": "\"__1SDV_20170812T010949_20170812T011016_006900_00C25E_B16\"",
              "default_operator": "OR"
            }
          }
        }
    • new query_string returns equivalent document, requires wildcard * at the beginning and end of string
      • {
          "query": {
            "query_string": {
              "default_field": "all_text_fields",
              "query": "*__1SDV_20170812T010949_20170812T011016_006900_00C25E_B16*",
              "default_operator": "OR"
            }
          }
        }
    • i dont think date searches really changed much
      • {
          "query": {
            "query_string": {
              "query": "starttime: [2019-01-01 TO 2019-01-31]",
              "default_operator": "OR"
            }
          }
        }
    • can combine different fields as well
      • {
          "query": {
            "query_string": {
              "fields": ["all_text_fields", "all_date_fields"],
              "query": "[2019-01-01 TO 2019-01-31] AND *__1SDV_20190109T020750_20190109T020817_014411*",
              "default_operator": "OR"
            }
          }
        }


  • Removal of search_type=scan

    • https://www.elastic.co/guide/en/elasticsearch/reference/5.5/breaking_50_search_changes.html
    • NOTE: must clear _scroll_id after using the scroll API to pull data
      • Will return error is _scroll_id's not cleared
      • query_phase_execution_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [11000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.
    • Requires changes in our HySDS code, wherever it uses search_type=scan

      curl -X POST http://localhost:9200/hysds_ios/_search?search_type=scan&scroll=10m&size=100
      {
        "error": {
          "root_cause": [
            {
              "type": "illegal_argument_exception",
              "reason": "No search type for [scan]"
            }
          ],
          "type": "illegal_argument_exception",
          "reason": "No search type for [scan]"
        },
        "status": 400
      }
      
      # removing search_type=scan from the endpoint fixes this problem
      curl -X POST http://100.64.134.55:9200/user_rules/_search?scroll=10m&size=100
      {
        "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAAEWMUpVeFNzVXpTVktlUzFPc0NKa1dndw==",
        "took": 34,
        "timed_out": false,
        "_shards": {
          "total": 1,
          "successful": 1,
          "skipped": 0,
          "failed": 0
        },
        "hits": {
          "total": {
            "value": 0,
            "relation": "eq"
          },
          "max_score": null,
          "hits": []
        }
      }
  • Removal of filtered: Link and Guide

    • deprecated in version 5.x, move all logic to query and bool
    • andornot changed to must should and must_not
      • if using should, will need to add minimum_should_match: 1
      • Link

    • # from this:
      {
        "filtered": {
          "filter": {
            "and": [
              {
                "match": {
                  "tags": "ISL"
                }
              },
              {
                "range": {
                  "metadata.ProductReceivedTime": {"gte": "2020-03-24T00:00:00.000000Z"}
                }
              },
              {
                "range": {
                  "metadata.ProductReceivedTime": {"lte": "2020-03-24T23:59:59.999999Z"}
                }
              }
            ]
          }
        }
      }
      
      # change to this:
      {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "tags": "ISL"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "metadata.ProductReceivedTime": {"gte": "2020-03-24T00:00:00.000000Z"}
                }
              },
              {
                "range": {
                  "metadata.ProductReceivedTime": {"lte": "2020-03-24T23:59:59.999999Z"}
                }
              }
            ]
          }
        }
      }

Changes to Logstash

  • Mozart streams data to elasticsearch through Logstash
  • Changes to logstash 7.1

    • time long are read as int instead of date_epoch_milis
      • need to convert and split the string, and removal the decimal places
    • removal of flush_size, was originally set to 1
  • https://github.com/hysds/hysds/blob/develop-es7/configs/logstash/indexer.conf.mozart
  • input {
      redis {
        host => "{{ MOZART_REDIS_PVT_IP }}"
        {% if MOZART_REDIS_PASSWORD != "" %}password => "{{ MOZART_REDIS_PASSWORD }}"{% endif %}
        # these settings should match the output of the agent
        data_type => "list"
        key => "logstash"
    
        # We use the 'msgpack' codec here because we expect to read
        # msgpack events from redis.
        codec => msgpack
      }
    }
    
    filter {
      if [resource] in ["worker", "task"] {
        mutate {
          convert => {
            "[event][timestamp]" => "string"
            "[event][local_received]" => "string"
          }
    
          split => ["[event][timestamp]", "."]
          split => ["[event][local_received]", "."]
    
          add_field => [ "[event][timestamp_new]" , "%{[event][timestamp][0]}" ]
          add_field => [ "[event][local_received_new]" , "%{[event][local_received][0]}" ]
    
          remove_field => ["[event][timestamp]", "[event][local_received]"]
        }
    
        mutate {
          rename => { "[event][timestamp_new]" => "timestamp" }
          rename => { "[event][local_received_new]" => "local_received" }
        }
      }
    }
    
    output {
      #stdout { codec => rubydebug }
    
      if [resource] == "job" {
        elasticsearch {
          hosts => ["{{ MOZART_ES_PVT_IP }}:9200"]
          index => "job_status-current"
          document_id => "%{payload_id}"
          template => "{{ OPS_HOME }}/mozart/etc/job_status.template"
          template_name => "job_status"
        }
      } else if [resource] == "worker" {
        elasticsearch {
          hosts => ["{{ MOZART_ES_PVT_IP }}:9200"]
          index => "worker_status-current"
          document_id => "%{celery_hostname}"
          template => "{{ OPS_HOME }}/mozart/etc/worker_status.template"
          template_name => "worker_status"
        }
      } else if [resource] == "task" {
        elasticsearch {
          hosts => ["{{ MOZART_ES_PVT_IP }}:9200"]
          index => "task_status-current"
          document_id => "%{uuid}"
          template => "{{ OPS_HOME }}/mozart/etc/task_status.template"
          template_name => "task_status"
        }
      } else if [resource] == "event" {
        elasticsearch {
          hosts => ["{{ MOZART_ES_PVT_IP }}:9200"]
          index => "event_status-current"
          document_id => "%{uuid}"
          template => "{{ OPS_HOME }}/mozart/etc/event_status.template"
          template_name => "event_status"
        }
      } else {}
    }

Running Elasticsearch 7 on EC2 instance

In order to expose port 0.0.0.0 properly, we need to edit the config/elasticsearch.yml file

network.host: 0.0.0.0
cluster.name: grq_cluster
node.name: ESNODE_CYR
node.master: true
node.data: true
transport.host: localhost
transport.tcp.port: 9300
http.port: 9200
discovery.zen.minimum_master_nodes: 2

# allows UI to talk to elasticsearch (in production we would put the actual hostname of the uI)
http.cors.enabled : true
http.cors.allow-origin: "*"


Running Kibana on EC2 instance

Install Kibana in command line

curl -O https://artifacts.elastic.co/downloads/kibana/kibana-7.1.1-darwin-x86_64.tar.gz
tar -xzf kibana-7.1.1-darwin-x86_64.tar.gz
cd kibana-7.1.1-darwin-x86_64/


Edit the config/kibana.yml file to expose host 0.0.0.0

server.host: 0.0.0.0


Index Template

So that every index created automatically follows this template for its mapping

{
  "order": 0,
  "index_patterns": [
    "{{ prefix }}_*"
  ],
  "settings": {
    "index.refresh_interval": "5s",
    "analysis": {
      "analyzer": {
        "default": {
          "filter": [
            "standard",
            "lowercase",
            "word_delimiter"
          ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "dynamic_templates": [
      {
        "integers": {
          "match_mapping_type": "long",
          "mapping": {
            "type": "integer"
          }
        }
      },
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "norms": false,
            "type": "text",
            "copy_to": "all_text_fields",
            "fields": {
              "raw": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "match": "*"
        }
      }
    ],
    "properties": {
      "browse_urls": {
        "type": "text",
        "copy_to": "all_text_fields"
      },
      "urls": {
        "type": "text",
        "copy_to": "all_text_fields"
      },
      "location": {
        "tree": "quadtree",
        "type": "geo_shape"
      },
      "center": {
        "tree": "quadtree",
        "type": "geo_shape"
      },
      "starttime": {
        "type": "date"
      },
      "endtime": {
        "type": "date"
      },
      "creation_timestamp": {
        "type": "date"
      },
      "metadata": {
        "properties": {
          "context": {
            "type": "object",
            "enabled": false
          }
        }
      },
      "prov": {
        "properties": {
          "wasDerivedFrom": {
            "type": "keyword"
          },
          "wasGeneratedBy": {
            "type": "keyword"
          }
        }
      },
      "all_text_fields": {
        "type": "text"
      }
    }
  },
  "aliases": {
    "{{ alias }}": {}
  }
}


Percolator

Percolator needs to be compatible with ES 7.1 (not applicable because HySDS uses its own version of percolator)

User Rules (Documentation for user rules triggering)

  • mapping added in mozart server /home/ops/mozart/ops/tosca/configs/user_rules_dataset.mapping

  • python code to create the user_rules index: Link
  • Mapping template for user_rules index Link
  • # PUT user_rules
    {
      "mappings": {
        "properties": {
          "creation_time": {
            "type": "date"
          },
          "enabled": {
            "type": "boolean",
            "null_value": true
          },
          "job_type": {
            "type": "keyword"
          },
          "kwargs": {
            "type": "keyword"
          },
          "modification_time": {
            "type": "date"
          },
          "modified_time": {
            "type": "date"
          },
          "passthru_query": {
            "type": "boolean"
          },
          "priority": {
            "type": "long"
          },
          "query": {
            "type": "object",
            "enabled": false
          },
          "query_all": {
            "type": "boolean"
          },
          "query_string": {
            "type": "text"
          },
          "queue": {
            "type": "text"
          },
          "rule_name": {
            "type": "keyword"
          },
          "username": {
            "type": "keyword"
          },
          "workflow": {
            "type": "keyword"
          }
        }
      }
    }

hysds_ios Index

Github Link to template.json: Link

  • Python code to create hysds_ios index template: Link
  • Follow HySDS and Job-Spec documentation for Jenkins build Link
  • {
      "order": 0,
      "template": "{{ index }}",
      "settings": {
        "index.refresh_interval": "5s",
        "analysis": {
          "analyzer": {
            "default": {
              "filter": [
                "standard",
                "lowercase",
                "word_delimiter"
              ],
              "tokenizer": "keyword"
            }
          }
        }
      },
      "mappings": {
        "dynamic_templates": [
          {
            "integers": {
              "match_mapping_type": "long",
              "mapping": {
                "type": "integer"
              }
            }
          },
          {
            "strings": {
              "match_mapping_type": "string",
              "mapping": {
                "norms": false,
                "type": "text",
                "copy_to": "all_text_fields",
                "fields": {
                  "raw": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "match": "*"
            }
          }
        ],
        "properties": {
          "_timestamp": {
            "type": "date",
            "store": true
          }
        }
      }
    }



Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: