Module 10 - Specialized Databases

Elasticsearch

Full-text search, log analytics, and real-time insights at scale.

1The Library Index Analogy

Simple Analogy

A library catalog tells you which shelf has books about "machine learning" without scanning every book. Elasticsearch builds an index of every word in your data-finding documents containing "distributed systems" takes milliseconds, even across billions of documents.

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It's designed for full-text search, log analytics, and real-time data exploration.

2Core Concepts

Index

Collection of documents (like a database table). Has a schema/mapping.

logs-2024-01, products, users

Document

JSON object stored in an index (like a table row).

{"title": "iPhone 15", "price": 999}

Shard

Horizontal partition of an index. Enables parallel processing.

5 primary shards across 5 nodes

Replica

Copy of a shard for redundancy and read scaling.

1 replica = 2 copies of each shard

3Inverted Index

How Full-Text Search Works

Documents

Doc 1: "The quick brown fox"

Doc 2: "The quick rabbit"

Doc 3: "The brown dog"

Inverted Index

"quick" → [Doc 1, Doc 2]

"brown" → [Doc 1, Doc 3]

"fox" → [Doc 1]

Inverted index maps each word to the documents containing it. Search "quick brown" = intersection of [1,2] and [1,3] = [1]. O(1) lookup!

4Query Types

Match

Full-text search with relevance scoring

{"match": {"title": "quick fox"}}

Term

Exact match (not analyzed)

{"term": {"status": "published"}}

Range

Numeric or date ranges

{"range": {"price": {"gte": 100, "lte": 500}}}

Bool

Combine queries (must, should, must_not)

{"bool": {"must": [...], "filter": [...]}}

Aggregation

Analytics: count, avg, histogram, terms

{"aggs": {"by_category": {"terms": {"field": "category"}}}}

5Use Cases

Full-Text Search

Product search, site search, document search

Amazon, Wikipedia, GitHub

Log Analytics

Centralized logging with ELK stack (Elasticsearch, Logstash, Kibana)

DevOps, security, debugging

Application Monitoring

APM, metrics, traces with Elastic APM

Distributed tracing

Security Analytics

SIEM, threat detection, anomaly detection

Elastic Security

6Scaling Considerations

Shard Sizing

Aim for 10-50GB per shard. Too many small shards = overhead. Too few large = slow queries.

Index Lifecycle

Use ILM to rotate indices: hot → warm → cold → delete. Essential for logs.

Memory

ES is memory-hungry. JVM heap = 50% of RAM (max 32GB). Rest for OS cache.

Mapping

Define mappings explicitly. Dynamic mapping can cause issues at scale.

7Key Takeaways

1Elasticsearch = distributed search engine built on Lucene

2Inverted index maps words to documents for O(1) search

3Shards for horizontal scaling, replicas for redundancy

4Perfect for full-text search and log analytics

5ELK stack: Elasticsearch + Logstash + Kibana for observability

?Quiz

1. E-commerce site needs product search with typo tolerance. Best choice?

2. What makes Elasticsearch search fast?