What is Elasticsearch? Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
🔗 Read my Blog about Text Analyzers here
Read more about Analyzers here
///Optimized Mapping
"PUT my_logs"{
"mappings":{
"properties":{
"message":{ --The message field is a text field. You can use the field to search for individual words
"type":"text"
},
"http_version":{
"type":"keyword" --The http_version is a keyword field. It can be used for exact searches and aggregations
},
"country_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword", --The county_name fieldhas been indexed twice, so there is full flexibility for that field
"ignore_above":256
}
}
}
}
}
}
//Dynamic Mapping
{
"blogs_temp":{
"mappings":{
"properties":{
"@timestamp":{
"type":"date"
},
"author":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"category":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
PUT test2{
"mappings":{
"dynamic_templates":[
{
"my_string_fields":{
"match_mapping_type":"string",
"mapping":{
"type":"keyword"
}
}
}
]
}
}
Suppose you have documents with a large number of fields
Documents have dynamic field names not known at the time of your mapping definition
Using dynamic templates, you can define a field’s mapping based on:
Lucene builds multiple data structures out of your documents: inverted indices and doc values
///The nested bucket aggregation solves the issue by aggregating inner documents independently from each other
"GET blogs_example/_search"{
"size":0,
"aggs":{
"nested_authors":{
"nested":{
"path":"authors"
},
"aggs":{
"companies":{
"terms":{
"field":"authors.company.keyword"
},
"aggs":{
"authors":{
"terms":{
"field":"authors.name.keyword"
}
}
}
}
}
}
}
}
Data Streams in 7.10
PUT /_data_stream/my-data-stream-alt
Index templates
An index template is a way to tell Elasticsearch how to configure an index when it is created. For data streams, the index template configures the stream’s backing indices as they are created. Templates are configured prior to index creation and then when an index is created either manually or through indexing a document, the template settings are used as a basis for creating the index.
There are two types of templates, index templates and component templates. Component templates are reusable building blocks that configure mappings, settings, and aliases. You use component templates to construct index templates, they aren’t directly applied to a set of indices. Index templates can contain a collection of component templates, as well as directly specify settings, mappings, and aliases.
Example of Index template APIs
Put index template
Get index template
Delete index template
Put component template
Get component template
Delete component template
Index template exists
Simulate index
Simulate template
Ingestion pipelines
Use an ingest node to pre-process documents before the actual document indexing happens. The ingest node intercepts bulk and index requests, it applies transformations, and it then passes the documents back to the index or bulk APIs. To pre-process documents before indexing, define a pipeline that specifies a series of processors. Each processor transforms the document in some specific way. For example, a pipeline might have one processor that removes a field from the document, followed by another processor that renames a field. The cluster state then stores the configured pipelines.
To use a pipeline, simply specify the pipeline parameter on an index or bulk request. This way, the ingest node knows which pipeline to use.
For example: Create a pipeline
PUT _ingest/pipeline/my_pipeline_id
{
"description" : "describe pipeline",
"processors" : [
{
"set" : {
"field": "foo",
"value": "new"
}
}
]
}
PUT my-index-00001/_doc/my-id?pipeline=my_pipeline_id
{
"foo": "bar"
}
{
"_index" : "my-index-00001",
"_type" : "_doc",
"_id" : "my-id",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
Elastic SQL
Elasticsearch has the speed, scale, and flexibility your data needs — and it speaks SQL. Use traditional database syntax to unlock non-traditional performance, like full text search across petabytes of data with real-time results.
PUT /library/book/_bulk?refresh
{"index":{"_id": "Leviathan Wakes"}}
{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
{"index":{"_id": "Hyperion"}}
{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
{"index":{"_id": "Dune"}}
{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}
POST /_sql?format=txt
{
"query": "SELECT * FROM library WHERE release_date < '2000-01-01'"
}
author | name | page_count | release_date
---------------+---------------+---------------+------------------------
Dan Simmons |Hyperion |482 |1989-05-26T00:00:00.000Z
Frank Herbert |Dune |604 |1965-06-01T00:00:00.000Z
Lucene 8 under the hood
Cross-cluster replication
CCR is designed around an active-passive index model. An index in one Elasticsearch cluster can be configured to replicate changes from an index in another Elasticsearch cluster. The index that is replicating changes is termed a “follower index” and the index being replicated from is termed the “leader index”. The follower index is passive in that it can serve read requests and searches but can not accept direct writes; only the leader index is active for direct writes. As CCR is managed at the index level, a cluster can contain both leader indices and follower indices. In this way, you can solve some active-active use cases by replicating some indices one way
ILM
High level Restful API Elasticsearch exposes REST APIs that are used by the UI components and can be called directly to configure and access Elasticsearch features.
Index APIs Index APIs are used to manage individual indices, index settings, aliases, mappings, and index templates.
Examples of Index Management APIs:
Create index
Delete index
Get index
Index exists
Close index
Open index
Shrink index
Split index
Clone index
Rollover index
Freeze index
Unfreeze index
Resolve index
Examples of Mapping Management APIs:
Put mapping
Get mapping
Get field mapping
Type exists
Example of Alias Management
Add index alias
Delete index alias
Get index alias
Index alias exists
Update index alias
Example of Index Setting
Update index settings
Get index settings
Analyze
/// https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.\_
object WriteToElasticSearch {
def main(args: Array[String]): Unit = {
WriteToElasticSearch.writeToIndex()
}
def writeToIndex(): Unit = {
val spark = SparkSession
.builder()
.appName("WriteToES")
.master("local[*]")
.config("spark.es.nodes","localhost")
.config("spark.es.port","9200")
.getOrCreate()
import spark.implicits.\_
val indexDocuments = Seq(
AlbumIndex("Led Zeppelin",1969,"Led Zeppelin"),
AlbumIndex("Boston",1976,"Boston"),
AlbumIndex("Fleetwood Mac", 1979,"Tusk")
).toDF
indexDocuments.saveToEs("demoindex/albumindex")
}
}
case class AlbumIndex(artist:String, yearOfRelease:Int, albumName: String)
🔗 Read more about Text Analyzers here
🔗 Read more about Snowflake here
🔗 Read more about Elasticsearch here
🔗 Read more about Kafka here
🔗 Read more about Spark here
🔗 Read more about Data Lakes here
🔗 Read more about Redshift vs Snowflake here
🔗 Read more about Best Practices on Database Design here