Build confidence with Elasticsearch

Nidhi Vichare
8 minute read
November 22, 2020
Elasticsearch
Search Analytics
Big Data
Data Technologes
Data Engineering
Data Warehouse
Databases
Usage-driven design

Which Data Technology is the right choice for your company?


What is Elasticsearch? Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

Elasticsearch

Elasticsearch Deployment Architecture

What is a Mapping?

  • Elasticsearch will happily index any document without knowing its details (number of fields, their data types, etc.) however, behind-the-scenes Elasticsearch assigns data types to your fields in a mapping
  • A mapping is a schema definition that contains:
    • names of fields
    • data types of fields
    • how the field should be indexed and stored by Lucene
  • Mappings map your complex JSON documents into the simple flat documents that Lucene expects

What are Analyzers?

🔗 Read my Blog about Text Analyzers here

  • Text analysis is done by an analyzer
  • By default, Elasticsearch applies the standard analyzer
  • There are many other analyzers, including:
    • whitespace, stop, pattern, language-specific analyzers

Read more about Analyzers here

  • The built-in analyzers can work great for many use cases but you may need to define your own custom analyzers
///Optimized Mapping

"PUT my_logs"{
   "mappings":{
      "properties":{
         "message":{ --The message field is a text field. You can use the field to search for individual words
            "type":"text"
         },
         "http_version":{
            "type":"keyword"  --The http_version is a keyword field. It can be used for exact searches and aggregations
         },
         "country_name":{
            "type":"text",
            "fields":{
               "keyword":{
                  "type":"keyword",  --The county_name fieldhas been indexed twice, so there is full flexibility for that field
                  "ignore_above":256
               }
            }
         }
      }
   }
}
//Dynamic Mapping

{
   "blogs_temp":{
      "mappings":{
         "properties":{
            "@timestamp":{
               "type":"date"
            },
            "author":{
               "type":"text",
               "fields":{
                  "keyword":{
                     "type":"keyword",
                     "ignore_above":256
                  }
               }
            },
            "category":{
               "type":"text",
               "fields":{
                  "keyword":{
                     "type":"keyword",
                     "ignore_above":256
                  }
               }
            }
         }
      }
   }
}

PUT test2{
   "mappings":{
      "dynamic_templates":[
         {
            "my_string_fields":{
               "match_mapping_type":"string",
               "mapping":{
                  "type":"keyword"
               }
            }
         }
      ]
   }
}

Mapping parameters allow you to influence how Elasticsearch will index the fields in your documents

  • Dynamic templates make it easier to set up your own mappings by defining defaults for fields, based on their JSON type, name or path

Use Case for Dynamic Templates:

  • Suppose you have documents with a large number of fields

  • Documents have dynamic field names not known at the time of your mapping definition

  • Using dynamic templates, you can define a field’s mapping based on:

    • the field’s datatype,
    • the name of the field, or
    • the path to the field
  • Lucene builds multiple data structures out of your documents: inverted indices and doc values

    • The inverted index make searching fast
    • Doc values allow you to aggregate and sort on values
    • You can disable the inverted index or doc values for individual fields in the mapping, to optimize Elasticsearch
///The nested bucket aggregation solves the issue by aggregating inner documents independently from each other
"GET blogs_example/_search"{
   "size":0,
   "aggs":{
      "nested_authors":{
         "nested":{
            "path":"authors"
         },
         "aggs":{
            "companies":{
               "terms":{
                  "field":"authors.company.keyword"
               },
               "aggs":{
                  "authors":{
                     "terms":{
                        "field":"authors.name.keyword"
                     }
                  }
               }
            }
         }
      }
   }
}

Salient Features

  • Data Streams in 7.10

    • Create a manual data stream using an API. The stream’s name must match one of your template’s index patterns.
     PUT /_data_stream/my-data-stream-alt
    
  • Index templates

    An index template is a way to tell Elasticsearch how to configure an index when it is created. For data streams, the index template configures the stream’s backing indices as they are created. Templates are configured prior to index creation and then when an index is created either manually or through indexing a document, the template settings are used as a basis for creating the index.

    There are two types of templates, index templates and component templates. Component templates are reusable building blocks that configure mappings, settings, and aliases. You use component templates to construct index templates, they aren’t directly applied to a set of indices. Index templates can contain a collection of component templates, as well as directly specify settings, mappings, and aliases.

  • Example of Index template APIs

   Put index template
   Get index template
   Delete index template
   Put component template
   Get component template
   Delete component template
   Index template exists
   Simulate index
   Simulate template

  • Ingestion pipelines

    Use an ingest node to pre-process documents before the actual document indexing happens. The ingest node intercepts bulk and index requests, it applies transformations, and it then passes the documents back to the index or bulk APIs. To pre-process documents before indexing, define a pipeline that specifies a series of processors. Each processor transforms the document in some specific way. For example, a pipeline might have one processor that removes a field from the document, followed by another processor that renames a field. The cluster state then stores the configured pipelines.

To use a pipeline, simply specify the pipeline parameter on an index or bulk request. This way, the ingest node knows which pipeline to use.

For example: Create a pipeline

PUT _ingest/pipeline/my_pipeline_id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "new"
      }
    }
  ]
}

PUT my-index-00001/_doc/my-id?pipeline=my_pipeline_id
{
  "foo": "bar"
}

{
  "_index" : "my-index-00001",
  "_type" : "_doc",
  "_id" : "my-id",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
  • Elastic SQL

    Elasticsearch has the speed, scale, and flexibility your data needs — and it speaks SQL. Use traditional database syntax to unlock non-traditional performance, like full text search across petabytes of data with real-time results.

PUT /library/book/_bulk?refresh
{"index":{"_id": "Leviathan Wakes"}}
{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
{"index":{"_id": "Hyperion"}}
{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
{"index":{"_id": "Dune"}}
{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}

POST /_sql?format=txt
{
  "query": "SELECT * FROM library WHERE release_date < '2000-01-01'"
}

 author     |     name      |  page_count   | release_date
---------------+---------------+---------------+------------------------
Dan Simmons    |Hyperion       |482            |1989-05-26T00:00:00.000Z
Frank Herbert  |Dune           |604            |1965-06-01T00:00:00.000Z

  • Lucene 8 under the hood

  • Cross-cluster replication

    CCR is designed around an active-passive index model. An index in one Elasticsearch cluster can be configured to replicate changes from an index in another Elasticsearch cluster. The index that is replicating changes is termed a “follower index” and the index being replicated from is termed the “leader index”. The follower index is passive in that it can serve read requests and searches but can not accept direct writes; only the leader index is active for direct writes. As CCR is managed at the index level, a cluster can contain both leader indices and follower indices. In this way, you can solve some active-active use cases by replicating some indices one way

  • ILM

    • Spin up a new index when an index reaches a certain size or number of documents
    • Create a new index each day, week, or month and archive previous ones
    • Delete stale indices to enforce data retention standards
  • High level Restful API Elasticsearch exposes REST APIs that are used by the UI components and can be called directly to configure and access Elasticsearch features.

    • Index APIs Index APIs are used to manage individual indices, index settings, aliases, mappings, and index templates.

      • Examples of Index Management APIs:

          Create index
          Delete index
          Get index
          Index exists
          Close index
          Open index
          Shrink index
          Split index
          Clone index
          Rollover index
          Freeze index
          Unfreeze index
          Resolve index
        
      • Examples of Mapping Management APIs:

         Put mapping
         Get mapping
         Get field mapping
         Type exists
        
    • Example of Alias Management

       Add index alias
       Delete index alias
       Get index alias
       Index alias exists
       Update index alias
    
  • Example of Index Setting

    Update index settings
    Get index settings
    Analyze
    

Spark and ES Integration code sample


/// https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html

import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.\_

object WriteToElasticSearch {

def main(args: Array[String]): Unit = {
WriteToElasticSearch.writeToIndex()
}

def writeToIndex(): Unit = {

val spark = SparkSession
.builder()
.appName("WriteToES")
.master("local[*]")
.config("spark.es.nodes","localhost")
.config("spark.es.port","9200")
.getOrCreate()

import spark.implicits.\_

val indexDocuments = Seq(
AlbumIndex("Led Zeppelin",1969,"Led Zeppelin"),
AlbumIndex("Boston",1976,"Boston"),
AlbumIndex("Fleetwood Mac", 1979,"Tusk")
).toDF

indexDocuments.saveToEs("demoindex/albumindex")
}
}

case class AlbumIndex(artist:String, yearOfRelease:Int, albumName: String)

Further Reading

🔗 Read more about Text Analyzers here

🔗 Read more about Snowflake here

🔗 Read more about Elasticsearch here

🔗 Read more about Kafka here

🔗 Read more about Spark here

🔗 Read more about Data Lakes here

🔗 Read more about Redshift vs Snowflake here

🔗 Read more about Best Practices on Database Design here

References

Read this Blog on CCR