Build confidence with Elasticsearch

Nidhi Vichare

8 minute read

November 22, 2020

Elasticsearch

Search Analytics

Big Data

Data Technologes

Data Engineering

Data Warehouse

Databases

Usage-driven design

Which Data Technology is the right choice for your company?

“

What is Elasticsearch? Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

Elasticsearch

Elasticsearch Deployment Architecture

What is a Mapping?

Elasticsearch will happily index any document without knowing its details (number of fields, their data types, etc.) however, behind-the-scenes Elasticsearch assigns data types to your fields in a mapping
A mapping is a schema definition that contains:
- names of fields
- data types of fields
- how the field should be indexed and stored by Lucene
Mappings map your complex JSON documents into the simple flat documents that Lucene expects

What are Analyzers?

🔗 Read my Blog about Text Analyzers here

Text analysis is done by an analyzer
By default, Elasticsearch applies the standard analyzer
There are many other analyzers, including:
- whitespace, stop, pattern, language-specific analyzers

Mapping parameters allow you to influence how Elasticsearch will index the fields in your documents

Dynamic templates make it easier to set up your own mappings by defining defaults for fields, based on their JSON type, name or path

Use Case for Dynamic Templates:

Suppose you have documents with a large number of fields
Documents have dynamic field names not known at the time of your mapping definition
Using dynamic templates, you can define a field’s mapping based on:
- the field’s datatype,
- the name of the field, or
- the path to the field
Lucene builds multiple data structures out of your documents: inverted indices and doc values
- The inverted index make searching fast
- Doc values allow you to aggregate and sort on values
- You can disable the inverted index or doc values for individual fields in the mapping, to optimize Elasticsearch

///The nested bucket aggregation solves the issue by aggregating inner documents independently from each other
"GET blogs_example/_search"{
   "size":0,
   "aggs":{
      "nested_authors":{
         "nested":{
            "path":"authors"
         },
         "aggs":{
            "companies":{
               "terms":{
                  "field":"authors.company.keyword"
               },
               "aggs":{
                  "authors":{
                     "terms":{
                        "field":"authors.name.keyword"
                     }
                  }
               }
            }
         }
      }
   }
}

Salient Features

Data Streams in 7.10
- Create a manual data stream using an API. The stream’s name must match one of your template’s index patterns.
```
 PUT /_data_stream/my-data-stream-alt
```
Index templates

An index template is a way to tell Elasticsearch how to configure an index when it is created. For data streams, the index template configures the stream’s backing indices as they are created. Templates are configured prior to index creation and then when an index is created either manually or through indexing a document, the template settings are used as a basis for creating the index.

There are two types of templates, index templates and component templates. Component templates are reusable building blocks that configure mappings, settings, and aliases. You use component templates to construct index templates, they aren’t directly applied to a set of indices. Index templates can contain a collection of component templates, as well as directly specify settings, mappings, and aliases.
Example of Index template APIs

   Put index template
   Get index template
   Delete index template
   Put component template
   Get component template
   Delete component template
   Index template exists
   Simulate index
   Simulate template

Ingestion pipelines

Use an ingest node to pre-process documents before the actual document indexing happens. The ingest node intercepts bulk and index requests, it applies transformations, and it then passes the documents back to the index or bulk APIs. To pre-process documents before indexing, define a pipeline that specifies a series of processors. Each processor transforms the document in some specific way. For example, a pipeline might have one processor that removes a field from the document, followed by another processor that renames a field. The cluster state then stores the configured pipelines.

To use a pipeline, simply specify the pipeline parameter on an index or bulk request. This way, the ingest node knows which pipeline to use.

For example: Create a pipeline

PUT _ingest/pipeline/my_pipeline_id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "new"
      }
    }
  ]
}

PUT my-index-00001/_doc/my-id?pipeline=my_pipeline_id
{
  "foo": "bar"
}

{
  "_index" : "my-index-00001",
  "_type" : "_doc",
  "_id" : "my-id",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Elastic SQL

Elasticsearch has the speed, scale, and flexibility your data needs — and it speaks SQL. Use traditional database syntax to unlock non-traditional performance, like full text search across petabytes of data with real-time results.

PUT /library/book/_bulk?refresh
{"index":{"_id": "Leviathan Wakes"}}
{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
{"index":{"_id": "Hyperion"}}
{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
{"index":{"_id": "Dune"}}
{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}

POST /_sql?format=txt
{
  "query": "SELECT * FROM library WHERE release_date < '2000-01-01'"
}

 author     |     name      |  page_count   | release_date
---------------+---------------+---------------+------------------------
Dan Simmons    |Hyperion       |482            |1989-05-26T00:00:00.000Z
Frank Herbert  |Dune           |604            |1965-06-01T00:00:00.000Z

Lucene 8 under the hood
Cross-cluster replication

CCR is designed around an active-passive index model. An index in one Elasticsearch cluster can be configured to replicate changes from an index in another Elasticsearch cluster. The index that is replicating changes is termed a “follower index” and the index being replicated from is termed the “leader index”. The follower index is passive in that it can serve read requests and searches but can not accept direct writes; only the leader index is active for direct writes. As CCR is managed at the index level, a cluster can contain both leader indices and follower indices. In this way, you can solve some active-active use cases by replicating some indices one way
ILM
- Spin up a new index when an index reaches a certain size or number of documents
- Create a new index each day, week, or month and archive previous ones
- Delete stale indices to enforce data retention standards
High level Restful API Elasticsearch exposes REST APIs that are used by the UI components and can be called directly to configure and access Elasticsearch features.
- Index APIs Index APIs are used to manage individual indices, index settings, aliases, mappings, and index templates.
  - Examples of Index Management APIs:
```
  Create index
  Delete index
  Get index
  Index exists
  Close index
  Open index
  Shrink index
  Split index
  Clone index
  Rollover index
  Freeze index
  Unfreeze index
  Resolve index
```
  - Examples of Mapping Management APIs:
```
 Put mapping
 Get mapping
 Get field mapping
 Type exists
```
- Example of Alias Management
```
   Add index alias
   Delete index alias
   Get index alias
   Index alias exists
   Update index alias
```

Example of Index Setting

Update index settings
Get index settings
Analyze

Spark and ES Integration code sample


/// https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html

import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.\_

object WriteToElasticSearch {

def main(args: Array[String]): Unit = {
WriteToElasticSearch.writeToIndex()
}

def writeToIndex(): Unit = {

val spark = SparkSession
.builder()
.appName("WriteToES")
.master("local[*]")
.config("spark.es.nodes","localhost")
.config("spark.es.port","9200")
.getOrCreate()

import spark.implicits.\_

val indexDocuments = Seq(
AlbumIndex("Led Zeppelin",1969,"Led Zeppelin"),
AlbumIndex("Boston",1976,"Boston"),
AlbumIndex("Fleetwood Mac", 1979,"Tusk")
).toDF

indexDocuments.saveToEs("demoindex/albumindex")
}
}

case class AlbumIndex(artist:String, yearOfRelease:Int, albumName: String)

References

Read this Blog on CCR

Build confidence with Elasticsearch

Elasticsearch

What is a Mapping?

What are Analyzers?

Mapping parameters allow you to influence how Elasticsearch will index the fields in your documents

Use Case for Dynamic Templates:

Salient Features

Spark and ES Integration code sample

Further Reading

References