dangerousmetrics.com

Anvil (The Geek Stuff)

Introduction

My first exposure to Lucene based search systems came during the Elasticsearch 5.7 era. The initial use case was not product search or website analytics, but operational log analysis at scale. The system ingested and indexed approximately 1.2 million email log events per day, where correctness, timestamp accuracy, and field discipline were not theoretical concerns but daily requirements.

That experience shifted search from a feature to infrastructure. Over time, I worked with clusters ranging from small single node deployments to large, distributed environments running Elasticsearch 7. Supporting those systems in production revealed the mechanics beneath the API, how mappings shape future capability, how shard decisions ripple through performance, and how seemingly small schema choices become long term operational constraints.

After leaving Elastic, I continued applying Lucene based platforms to different domains, particularly log correlation and telecom data analysis. During that transition, I chose to move toward OpenSearch, a fork originating from Elasticsearch 7. In my view, that version represented a mature balance of capability and operational clarity.

OpenSearch has evolved significantly since those early forked releases. It remains open source and, in my experience, is straightforward to operate locally and in constrained environments. The choice to use OpenSearch was not driven by licensing debate, but by practical considerations and familiarity with the architectural foundations it inherited.

This section of Anvil is not an introduction to search as a concept, nor is it a product comparison. It is an examination of how a Lucene based distributed document store behaves under real operational use. The goal is to treat OpenSearch as infrastructure that must be designed, mapped, queried, and maintained deliberately.

What follows is built from production experience, schema mistakes, reindex cycles, and the realities of operating search systems that other systems depend on. The emphasis is not on features, but on structure.


Scope

This system is concerned with understanding how OpenSearch behaves as a distributed document store under real operational constraints. It operates on the premise that search is not a feature layered on top of data, but a structural consequence of how data is stored, mapped, and partitioned.

The focus here is on indices, mappings, shards, and queries as architectural decisions. Documents are treated as immutable facts. Mappings are treated as contracts. Shards are treated as boundaries of responsibility, not implementation details.

This scope explicitly values correctness and predictability over convenience. Dynamic mapping, default shard counts, and copy paste examples are examined as tradeoffs, not best practices. When behavior is surprising, the underlying mechanics are traced until the outcome is explainable.

Just as important are the boundaries. This system does not attempt to teach cluster installation, cloud deployment strategy, or vendor specific features. It does not optimize for ecommerce relevance scoring, nor does it assume a particular application framework.

In practice, this means the system answers questions like how a field type influences aggregation behavior, how shard distribution affects performance, how mappings constrain future queries, and how ingestion decisions propagate through the lifecycle of an index.

The goal is operational clarity. By treating OpenSearch as infrastructure rather than a convenience layer, the system becomes something that can be reasoned about, not merely configured.


What Is OpenSearch

OpenSearch is a distributed, schema-driven document store built on top of the Apache Lucene search library. At its core, it accepts JSON documents, indexes their fields into inverted indices, and exposes a RESTful API for querying, aggregation, and cluster management.

Although commonly described as a search engine, OpenSearch behaves more accurately as a distributed indexing and retrieval platform. Search is a consequence of how data is structured and stored. The system’s primary responsibilities are document indexing, shard distribution, replication, and query coordination across nodes.

Data in OpenSearch is organized into indices. Each index is divided into primary shards, which are logical partitions of the dataset. Shards may have replicas for redundancy and query load distribution. A cluster coordinates these shards across one or more nodes, routing indexing and search requests to the appropriate shard copies.

Every document indexed into OpenSearch is parsed according to a mapping. The mapping defines field types, analyzers, and structural expectations. These definitions are not cosmetic. They directly influence how data is tokenized, stored, and made searchable. Mapping decisions constrain future queries and determine which operations are possible without reindexing.

Under the surface, OpenSearch delegates the actual indexing and search mechanics to Apache Lucene. Lucene is responsible for maintaining inverted indices, handling segment merges, and executing low-level search algorithms. OpenSearch adds distribution, clustering, replication, APIs, and operational tooling around that core.

This layered architecture is critical to understand. Lucene provides the data structures and search primitives. OpenSearch provides the distributed system that coordinates those primitives at scale. When a query is executed, it is decomposed across shards, processed by Lucene within each shard, and then merged back into a unified result.

The result is a system that can ingest large volumes of structured data, make it searchable within seconds, and compute aggregations across millions of documents in near real time. Its capabilities, however, are inseparable from its structure. Understanding that structure is the purpose of this book.


What Is Lucene

Origins and Purpose

Apache Lucene is a high performance search library written in Java. It was originally created by Doug Cutting in the late 1990s and later became a top level Apache Software Foundation project. Lucene was not designed as a server. It is a library that applications embed in order to add full text search capabilities.

On its own, Lucene provides indexing and search primitives. It does not expose a REST API, manage clusters, or distribute data across nodes. Those capabilities were added later by systems such as Elasticsearch and OpenSearch. Lucene remains the underlying engine that performs the actual indexing and search operations.

Standalone Capability

Lucene can be used independently inside any Java application. A developer can define documents, create fields, build indices on disk, and execute queries directly against those indices. In this mode, Lucene operates as a local embedded search engine. There is no network layer, no cluster coordination, and no distributed shard logic.

This standalone capability is important because it reveals the true boundary of responsibility. Lucene handles indexing and searching of documents within a single index on a single machine. Everything else that modern search platforms provide is layered around this core.

Document Model and Inverted Index

Lucene operates on documents composed of fields. Each field has a defined type and may be stored, indexed, or both. When a document is indexed, Lucene analyzes the fields according to their configuration. Text fields may be tokenized into terms, filtered, and normalized.

The central data structure Lucene builds is the inverted index. Rather than mapping documents to the terms they contain, an inverted index maps terms to the documents in which they appear. This allows search queries to resolve quickly by locating matching terms first and then retrieving the associated document identifiers.

As documents are indexed, Lucene writes immutable segments to disk. Over time, these segments are merged in the background to improve search performance and reclaim deleted document space. Segment merging is a fundamental part of Lucene’s lifecycle and directly influences indexing throughput and disk usage.

From Library to Distributed System

Elasticsearch built upon Lucene by wrapping it in a distributed system. Instead of embedding Lucene in a single application, Elasticsearch exposed it as a network service, added JSON document handling, introduced dynamic mappings, and implemented shard distribution across nodes.

Each shard in Elasticsearch and OpenSearch is, at its core, a Lucene index. When a document is indexed into a shard, it is ultimately written into Lucene segments on disk. When a query executes, Lucene performs the low level term lookup, scoring, and document retrieval within that shard.

The distributed coordination layer determines which shards receive the request and how partial results are merged. Lucene does not know about clusters or replicas. It simply indexes and searches within the boundaries of a single index.

Why This Matters

Understanding Lucene clarifies why mapping, field types, analyzers, and shard counts are not superficial configuration choices. They influence how Lucene builds its inverted index, how segments are structured, and how queries are executed at the lowest level.

OpenSearch and Elasticsearch may evolve independently, but they both inherit Lucene’s core mechanics. To reason about performance, relevance, or storage behavior, one must understand the underlying library that actually performs the work.


The Inverted Index

Relational Index vs Inverted Index

In a relational database, an index typically maps a column value to the row locations that contain that value. If a table has an index on a column such as email_address, the database builds a structure that allows it to quickly locate rows matching a specific value.

This works well for exact lookups and range queries on structured fields. However, relational indices are not designed for efficient full text search across large bodies of text. Searching for words within arbitrary text fields would require scanning large portions of the dataset unless additional specialized indexing strategies are introduced.

An inverted index reverses the relationship. Instead of mapping documents to the terms they contain, it maps terms to the documents in which they appear. The structure is optimized for answering the question: which documents contain this term.

Conceptual Example

Consider three simple documents:

Doc1: "fiber outage reported"
Doc2: "fiber maintenance scheduled"
Doc3: "router outage detected"

After tokenization and normalization, the inverted index might look conceptually like this:

fiber      -> [Doc1, Doc2]
outage     -> [Doc1, Doc3]
reported   -> [Doc1]
maintenance-> [Doc2]
scheduled  -> [Doc2]
router     -> [Doc3]
detected   -> [Doc3]

Instead of scanning each document to see whether it contains the word "outage", the system can directly look up the term in the inverted index and retrieve the list of matching document identifiers.

How a Query Executes

When a query such as "fiber outage" is executed, the analyzer first processes the input using the same tokenization rules applied during indexing. The query is broken into terms: fiber and outage.

The search engine then performs a lookup for each term in the inverted index:

fiber  -> [Doc1, Doc2]
outage -> [Doc1, Doc3]

Because both terms must be present for a simple AND query, the engine computes the intersection of the two document lists:

Intersection -> [Doc1]

Doc1 is the only document that contains both terms. If the query were an OR query, the engine would compute the union of the lists instead.

In real systems, the inverted index also stores additional metadata, such as term frequency and positional information. This allows the engine to compute relevance scores, evaluate phrase queries, and determine proximity between terms within documents.

Why This Scales

The inverted index scales because it transforms text search into set operations over precomputed term to document mappings. The cost of a query depends primarily on the number of matching terms and the size of their posting lists, not on the total number of documents stored.

As the index grows, Lucene organizes terms and posting lists into highly optimized on disk structures. Segment files are immutable, and lookups are designed to minimize disk seeks and maximize sequential reads. Query execution becomes a combination of term lookup, posting list traversal, scoring, and result merging.

Connecting Back to OpenSearch

Each shard in OpenSearch contains one or more Lucene segments, each with its own inverted index. When a distributed query runs, it is executed independently against the inverted index within each shard. The coordinating node then merges the partial results into a final response.

Understanding the inverted index explains why field types, analyzers, and mapping decisions are not superficial. They directly determine how terms are generated and stored, which in turn defines how queries behave.


Segments and Merging

In the previous section, segments were mentioned as part of how Lucene stores indexed data. To understand performance, storage behavior, and shard mechanics, segments must be examined directly.

What Is a Segment

A segment is an immutable index file set written to disk by Lucene. When documents are indexed, they are first accumulated in memory and then flushed into a new segment. Each segment contains its own inverted index, term dictionary, posting lists, and stored field data.

Segments are self contained. Once written, they are never modified. If a document is updated or deleted, Lucene does not rewrite the original segment. Instead, it records a deletion marker and writes new data into a future segment.

This immutability is deliberate. It allows search operations to occur without locking entire structures for writes. Readers can continue to query existing segments while new segments are being created.

A Shard Is a Collection of Segments

In OpenSearch and Elasticsearch, a shard is essentially a Lucene index. That Lucene index is composed of multiple segments over time. When a shard receives documents, it accumulates segments as indexing continues.

At any moment, a shard may contain dozens or hundreds of segments, depending on indexing volume and merge policy behavior. Search operations must execute across all active segments within the shard, then merge results internally before returning them to the coordinating node.

On Disk: Inspecting a Shard

Concepts like segments and shards become much clearer when viewed directly on disk. Below is an example directory listing from an OpenSearch data folder:

/var/lib/opensearch/nodes/0/indices/8fzvlKidQvWWuM56LdeoZQ/0/

index/
  _10.cfs
  _10.cfe
  _10.si
  _12.cfs
  _12.cfe
  _12.si
  _u.cfs
  _u.si
  ...
  segments_m
  write.lock

translog/
  translog-17.tlog
  translog.ckp

_state/
  state-0.st

The directory path itself tells a story. The long identifier8fzvlKidQvWWuM56LdeoZQ represents the internal UUID of the index. The 0 directory is shard number zero. Inside that shard directory are three major components: index,translog, and _state.

The index Directory

The index directory contains the Lucene segment files. Each prefix such as _10, _12, _u, or_z represents a segment. A shard is simply the collection of these segments treated as a single logical index.

Common file types include:

  • .si — Segment metadata file. Describes the segment’s structure.
  • .cfs and .cfe — Compound segment files. Lucene often packs multiple internal structures into compound files for efficiency.
  • .fnm — Field metadata describing indexed fields.
  • .dvd and .dvm — Doc values data and metadata. These power aggregations and sorting.
  • .doc, .pos, .tim, .tip— Core inverted index structures containing postings, term dictionaries, and positional information.
  • segments_m — The file that tracks which segments are currently active and visible.
  • write.lock — Prevents concurrent writes from corrupting the index.

Each time new documents are flushed, a new segment prefix appears. Over time, merge operations consolidate smaller segments into larger ones. When that happens, new segment prefixes are created and older ones are eventually removed.

The translog Directory

The transaction log records operations that have been acknowledged but not yet fully committed into stable Lucene segments. If the node crashes, OpenSearch can replay the translog to recover recent writes. Once a flush occurs and data is safely persisted, portions of the translog are trimmed.

The _state Directory

The _state directory contains shard level metadata, including information about shard allocation and retention leases. This data supports replication and cluster coordination.

Viewing these files reinforces an important point: a shard is not abstract. It is a directory of immutable Lucene segment files, a transaction log, and shard state metadata. Search performance, merge behavior, and storage growth are all reflected physically in this structure.

Why Merging Exists

If segments were never merged, their count would grow indefinitely. Each query would need to traverse more posting lists and maintain more file handles. Over time, search performance would degrade and disk usage would fragment.

To prevent this, Lucene runs a background merge process. Smaller segments are combined into larger ones according to a merge policy. During a merge, new consolidated segment files are written, and the older segments are eventually removed once no longer referenced.

Merging is both necessary and expensive. It consumes disk I/O, CPU, and temporary storage space. However, it improves search performance by reducing the number of segments and optimizing internal data structures.

Deletes and Updates

Because segments are immutable, deletes are handled through a bitset that marks documents as no longer visible. Updates are implemented as a delete followed by an insert. The original document remains in its segment until a merge operation rewrites the data into a new segment without the deleted entry.

This behavior explains why high update workloads can increase merge pressure. Large numbers of deleted documents accumulate until segments are merged and rewritten.

Operational Implications

Segment mechanics directly influence indexing throughput and search latency. Heavy ingestion can produce many small segments, triggering aggressive merges. Insufficient disk bandwidth can cause merge backlogs, which in turn slow indexing.

Monitoring segment counts and merge activity provides insight into cluster health. High segment counts per shard often indicate either sustained ingestion or insufficient merge capacity.

Understanding segments clarifies why shard sizing matters. A shard is not a monolithic file. It is a living collection of immutable segments that are continuously created and consolidated. Performance characteristics emerge from how those segments evolve over time.

To reason about shard behavior, one must reason about segments. They are the physical reality beneath the distributed abstraction.


What Are Shards

A shard is a single Lucene index that represents a logical partition of an OpenSearch index. It is not a symbolic division or a metadata construct. It is a concrete directory containing segment files, transaction logs, and shard state information.

When an index is created in OpenSearch, it is divided into one or more primary shards. Each primary shard is capable of indexing and searching independently. Replica shards, when configured, are copies of those primaries and provide redundancy and additional query capacity.

Why Shards Exist

Shards exist to distribute data and workload. A single Lucene index has practical limits in terms of memory usage, file handles, and recovery time. By partitioning an index into shards, OpenSearch can spread storage and query execution across multiple nodes.

Each shard operates as a self contained search engine. When a query is executed, the coordinating node sends that query to every relevant shard. Each shard evaluates the query against its own segments and returns partial results. The coordinating node then merges those results into a final response.

Shard Boundaries Are Real

A shard boundary is not transparent. Documents are assigned to shards based on a routing function, typically a hash of the document ID. Once assigned, a document belongs to that shard unless it is reindexed into a different index configuration.

Because of this, shard count is not a tuning knob that can be changed casually after data is written. Changing the number of primary shards requires reindexing, since the routing function depends on shard count.

Performance Implications

Too few shards can concentrate load and create large shard sizes that are slow to recover or relocate. Too many shards can increase memory overhead, file descriptor usage, and coordination cost during queries.

Every search request becomes a distributed operation across shards. The more shards involved, the more partial results must be merged. This coordination cost grows with shard count.

Shards are therefore units of scale, but also units of overhead. Designing shard strategy requires balancing distribution, recovery time, and operational complexity.

Connecting Back to Segments

Inside each shard are Lucene segments. Merge pressure, segment count, and indexing throughput all occur at the shard level. When monitoring cluster health, one is effectively monitoring the behavior of many independent Lucene indices operating in parallel.

Understanding shards requires understanding segments. Without that foundation, shard configuration becomes guesswork.


Distributed Shards

Shards become powerful only when distributed. The defining capability of OpenSearch is not indexing or searching, but distributing those operations across multiple nodes in a coordinated cluster.

When an index is created with one primary shard and one replica, OpenSearch ensures that the primary and its replica are placed on different nodes. This guarantees redundancy and allows query load to be balanced.

Primary and Replica Roles

The primary shard is responsible for accepting write operations. When a document is indexed, it is routed to the primary shard based on a hashing function. The primary applies the write and then forwards the operation to its replica shard.

Replica shards maintain an identical copy of the primary shard’s data. They can serve read requests and provide fault tolerance if the primary node fails.

Three Node Example

Consider a simple three node cluster with one primary shard and one replica:

Node ANode BNode CPrimaryReplica

In this configuration:

  • Writes are routed to the primary shard on Node A.
  • The primary forwards operations to the replica on Node B.
  • Both primary and replica can serve read queries.
  • Node C remains available for other shards or cluster services.

If Node A fails, the replica on Node B is promoted to primary. Cluster state is updated, and indexing continues without data loss, assuming replication was successful.

Distributed Query Execution

For read operations, the coordinating node can route queries to either the primary or replica shard. This allows search load to be distributed. When multiple shards exist, queries are executed in parallel across all relevant shard copies.

This is the fundamental value of OpenSearch: it distributes indexing, storage, and query execution across many hosts while presenting a unified API to the client.

Without shard distribution, OpenSearch would be a single machine Lucene wrapper. With distribution, it becomes horizontally scalable infrastructure.

Replication and Allocation Tradeoffs

In the previous example, the index was configured with one primary shard and one replica. In a three node cluster, it is technically possible to configure two replicas. That would result in one primary shard and two replica shards, each placed on separate nodes.

This increases read capacity and improves fault tolerance. If any single node fails, at least one additional copy of the shard remains available. Search requests can also be distributed across more shard copies.

However, replication is not free. Each replica consumes the same disk space as the primary. Every indexing operation must be forwarded to each replica, increasing write amplification and network traffic. Recovery time also increases as more copies must be synchronized.

In a small cluster, especially one with only three nodes, adding multiple replicas may provide diminishing returns. The gain in read distribution must be weighed against storage cost, indexing overhead, and operational complexity.

It is also important to consider primary shard count. For example, if this same three node cluster were configured with two primary shards and one replica, the system would need to place four shard copies in total. Because a primary and its replica cannot reside on the same node, allocation pressure increases immediately. With only three nodes available, shard placement becomes constrained and resilience assumptions may not hold as expected.

Shard count and replica count are therefore tightly coupled to cluster size. These settings cannot be chosen in isolation. They define placement rules, recovery behavior, and fault tolerance boundaries.

The interaction between shard sizing, replica strategy, and node count is where many production clusters encounter problems. The next section examines these design constraints in detail under shard gotchas and operational pitfalls.


Shard Design

Shard count is one of the most consequential decisions made when an index is created. It defines data distribution, recovery behavior, memory overhead, and cluster coordination cost. Unlike many settings, primary shard count cannot be changed without reindexing.

Shards Are Units of Scale and Overhead

Each shard is a complete Lucene index. It maintains its own segments, file handles, caches, and memory structures. Increasing shard count increases parallelism, but it also increases overhead.

Every shard consumes heap memory for segment metadata and field data structures. Every shard participates in cluster state updates. Every search request must coordinate across all relevant shards.

Shards improve distribution, but too many shards amplify coordination cost.

Over Sharding

Over sharding occurs when an index is divided into more primary shards than the workload or cluster size justifies. This often happens when shard count is chosen based on anticipated future growth rather than current cluster capacity.

Too many small shards can lead to:

  • Excessive heap usage due to per shard overhead.
  • Increased query coordination latency.
  • Longer cluster state update times.
  • Higher file descriptor consumption.

In extreme cases, the cluster may struggle under metadata pressure rather than data volume.

Shard Size Considerations

Very small shards waste resources. Very large shards increase recovery time and rebalancing cost. When a node fails, large shards must be copied in full to another node before redundancy is restored.

There is no universal perfect shard size. The appropriate size depends on hardware, workload, and acceptable recovery windows. What matters is balancing operational resilience with resource efficiency.

Shard Count vs Cluster Size

Shard and replica configuration must align with node count. For example, configuring multiple primary shards with replicas on a small cluster can result in allocation constraints.

If the cluster cannot place a replica on a different node from its primary, that replica will remain unassigned. This is one common cause of an index entering a yellow state.

Why Indices Turn Yellow

An index is considered yellow when all primary shards are allocated, but one or more replicas are unassigned. This often occurs when the number of replica copies exceeds what the cluster topology can support.

In small clusters, requesting more replicas than available distinct nodes guarantees unassigned replicas. The cluster is operational, but fault tolerance assumptions no longer match configuration intent.

Design Requires Intent

Shard configuration should be based on realistic data volume, indexing rate, node count, and recovery requirements. It should not be copied from example configurations or chosen arbitrarily.

The cost of poor shard design is not immediate failure, but gradual inefficiency. As data grows, coordination overhead increases, recovery slows, and resource pressure accumulates.

Shards are the unit of scale in OpenSearch. Designing them correctly is foundational to long term cluster stability.

Side Note: Index Templates and Future Indices

While the primary shard count of an existing index is immutable, index templates provide flexibility for future indices.

An index template defines settings and mappings that are applied when a new index matching a specific pattern is created. If the shard count is changed in the template, the next index created under that pattern will use the new shard configuration.

This is especially useful in systems that write to time based indices, such as daily or monthly log indices. As data volume grows, the shard count for future indices can be increased without reindexing historical data.

Each index remains internally consistent. Older indices retain their original shard configuration, while newer indices reflect updated design decisions. This allows gradual evolution of shard strategy over time.

Templates do not modify existing indices. They influence index creation. The distinction is important. Immutability applies to live indices, not to the policy that defines future ones.

We will dig deeper into mappings and templates in future sections.


Mappings: The Structural Contract

If shards define how data is distributed, mappings define what the data actually becomes once it is indexed. A mapping is not decoration. It is a structural contract between the document you send and the inverted index Lucene builds on disk.

When a document is indexed, OpenSearch does not store raw JSON in a searchable form. Each field is parsed according to its mapping. That mapping determines how values are tokenized, whether they are searchable, whether they can be aggregated, and how they are stored internally.

Field Types Define Behavior

Field type is not simply a label. It determines how Lucene writes the field into its internal structures. A text field is analyzed and tokenized. A keyword field is indexed as a single exact value. A long field is stored in numeric form and can participate in range queries and aggregations.

The same JSON value mapped differently produces entirely different search behavior. Mapping a field as text instead ofkeyword changes whether exact matches work. Mapping a number as keyword instead of long prevents numeric range queries and efficient aggregations.

Mappings Constrain the Future

Once a field is mapped, its type cannot be changed in place. If a field was incorrectly mapped, the only correction is to create a new index with the correct mapping and reindex the data.

This is why mappings are described as contracts. They determine what operations are possible for the lifetime of the index. Aggregations, sorting, phrase queries, range queries, and exact matches all depend on how fields were defined at creation time.

Example Mapping

Below is a simplified example of an index mapping for operational log data. It demonstrates deliberate type selection rather than relying on dynamic inference.

PUT logs-2026-01
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "date"
      },
      "service": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      },
      "status_code": {
        "type": "integer"
      },
      "latency_ms": {
        "type": "long"
      },
      "message": {
        "type": "text"
      },
      "client_ip": {
        "type": "ip"
      }
    }
  }
}

In this example, service and host are mapped as keyword because they are used for filtering and aggregation. status_code andlatency_ms are numeric types to support range queries and statistical calculations. message is mapped astext to allow full text search.

Each of these choices influences how Lucene builds the inverted index and doc values structures inside each shard. These are not interchangeable definitions. They shape the internal layout of the data.

Dynamic Mapping and Its Risks

OpenSearch can infer mappings automatically when new fields appear. This behavior, known as dynamic mapping, is convenient but can lead to unintended field types. A numeric field accidentally indexed astext or a string inferred as keyword when analysis was expected can constrain future queries.

Dynamic mapping trades control for convenience. In systems where correctness and long term stability matter, explicit mappings are often preferred.

Mapping Collision

Mapping collision occurs when the same field name exists across multiple indices but is defined with different field types. This often surfaces in systems that use time based indices, where mappings evolve over time without strict coordination.

Consider a wildcard search across timestamped indices such aslogs-*. If earlier indices mapped a field likeduration as text, but newer indices map the same field as long, queries that span both indices may behave unpredictably.

Some queries may work as long as they only rely on operations compatible with both field types. But range queries, aggregations, or sorting on that field will eventually fail. The cluster cannot reconcile incompatible field definitions across shards belonging to different indices.

The failure is not random. It reflects a structural inconsistency. Each shard is internally valid, but the cross index query requires a unified field interpretation. When that interpretation does not exist, the query fails.

Mapping collisions are rarely noticed at the moment they are introduced. They appear later, often when historical data is queried alongside newer data. At that point, correction requires reindexing one or more indices to restore type consistency.

In systems that rely on wildcard index patterns, mapping discipline across time is essential. Field names must remain semantically and structurally consistent across all related indices.

Doc Values and Aggregations

Most non text fields use doc values, a column oriented storage structure that powers sorting and aggregations. If a field is mapped incorrectly or doc values are disabled, aggregation behavior may be limited or memory intensive.

Aggregation capability is not a query layer feature. It is a mapping decision reflected in how data is written to disk.

Design Requires Foresight

Mapping design requires understanding how the data will be queried months or years in the future. Fields that seem minor at ingestion time often become critical for filtering, correlation, or aggregation later.

Poor mapping decisions rarely cause immediate failure. Instead, they surface as query limitations, reindex requirements, or inefficient aggregations long after data volume has grown.

Mappings define capability. Shards define distribution. Together, they determine what your cluster can and cannot do.

In the next section, we will move into index templates and examine how mappings are applied consistently across future indices.


Index Templates

An index template defines settings and mappings that are automatically applied when a new index matching a specific pattern is created. It does not modify existing indices. It influences future ones.

Templates are most commonly used in systems that generate time based indices such as daily or monthly log indices. Instead of manually specifying shard counts and mappings for every new index, a template ensures structural consistency at creation time.

Why Templates Are Useful

In systems that continuously create indices likelogs-2026-01-01, logs-2026-01-02, and so on, templates act as a creation policy. They ensure that every new index starts with the intended shard configuration, replica count, and field mappings.

Without templates, operational discipline relies on humans or application code remembering to apply the correct configuration every time. Templates centralize that responsibility inside the cluster.

Example Template

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "refresh_interval": "5s"
    },
    "mappings": {
      "properties": {
        "timestamp": {
          "type": "date"
        },
        "service": {
          "type": "keyword"
        },
        "host": {
          "type": "keyword"
        },
        "status_code": {
          "type": "integer"
        },
        "latency_ms": {
          "type": "long"
        },
        "message": {
          "type": "text"
        }
      }
    }
  }
}

The index_patterns field defines which index names the template applies to. Any new index whose name matcheslogs-* will inherit the configuration inside thetemplate object.

Inside the template block, there are two primary components:

  • settings — Index level configuration such as shard count, replica count, and refresh interval.
  • mappings — Field definitions that determine how documents are parsed and indexed.

Changing Shard Count for Future Indices

Because shard count is immutable once an index is created, templates provide a controlled way to evolve shard strategy over time. If data volume grows, the template can be updated to increasenumber_of_shards.

The next index created under the matching pattern will use the new shard count. Existing indices remain unchanged. This is generally safe and predictable. Each index retains internal consistency.

In time based systems, this allows gradual scaling without reindexing historical data.

Changing Mappings for Future Indices

Templates can also modify field mappings for future indices. This is powerful but more dangerous.

If a field such as latency_ms was previously mapped aslong and the template is updated to map it askeyword, all newly created indices will use the new definition. Older indices will retain the original mapping.

The result is structural divergence across indices. Wildcard queries that span both generations of indices may encounter the mapping collision behavior discussed in the previous section.

Changing shard count across time based indices is generally operationally safe because shards are isolated per index. Changing field mappings, however, alters the interpretation of data. That can introduce subtle failures during cross index queries.

Templates as Policy, Not Repair

Templates do not retroactively correct mistakes. They define how new indices are created. If an existing mapping is wrong, the correction still requires reindexing.

Used carefully, templates provide structural stability for systems that continuously generate indices. Used carelessly, they can introduce long term inconsistency across time.

Templates are the mechanism that connects mapping design and shard strategy to ongoing operational reality. They are where architectural intent becomes repeatable behavior.

To Be Contined...


Ingesting Data

All of the structure described so far, shards, segments, mappings, and templates, becomes real the moment a document is indexed. Ingestion is the bridge between abstract configuration and physical storage.

To make this concrete, we will use the logs-* template defined in the previous section. That template specifies shard count, replica count, and explicit mappings for fields such as timestamp, service, status_code, latency_ms, and message.

A Simple Index Request

Suppose a client sends the following request:

POST logs-2026-02-14/_doc
{
  "timestamp": "2026-02-14T02:17:33Z",
  "service": "sip-gateway",
  "host": "edge-01",
  "status_code": 503,
  "latency_ms": 184,
  "message": "Upstream trunk timeout",
  "client_ip": "198.51.100.42"
}

At this moment, several structural checks occur before any data is written.

Step 1: Does the Index Exist?

If logs-2026-02-14 does not yet exist, OpenSearch evaluates index templates whose index_patterns match the name. The logs-* template applies.

The cluster creates a new index using the template’s settings: number_of_shards, number_of_replicas, refresh_interval, and explicit mappings. From this point forward, the index’s structural contract is fixed.

This is the only moment when shard count and field mappings can be defined without reindexing.

Step 2: Routing to the Primary Shard

The document ID, either auto generated or explicitly provided, is hashed to determine which primary shard will own the document. This routing function depends on the number_of_shards defined at index creation time.

If the receiving node does not host that primary shard, the request is forwarded internally. Routing is deterministic. The same document ID will always resolve to the same shard unless the index is recreated with a different shard count.

Step 3: Mapping Validation and Field Parsing

Once the primary shard receives the document, each field is parsed according to the mapping inherited from the template.

timestamp is parsed as a date and converted into a numeric representation internally.

service and host are indexed as keyword fields. They are not tokenized. The entire value becomes a single term in the inverted index and a single entry in doc values for aggregation.

status_code and latency_ms are stored in numeric form. They are optimized for range queries and statistical aggregation.

message is analyzed as text. It is tokenized into terms such as upstream, trunk, and timeout before being written to the inverted index.

client_ip is stored in a specialized ip field type, enabling both exact filtering and range queries across IP space.

The mapping determines every one of these transformations. This is why mappings are contracts rather than labels.

Step 4: In-Memory Buffer and Translog

After parsing, the document is added to the shard’s in-memory indexing buffer. At the same time, the indexing operation is appended to the transaction log.

The translog ensures durability. If the node crashes before a segment is written, the translog can be replayed to restore recent operations.

At this stage, the document exists in memory and in the translog, but it may not yet be searchable.

Step 5: Replica Synchronization

The primary shard forwards the indexing operation to each replica shard. Replicas perform the same parsing and indexing steps. Once required replicas acknowledge the write, the client receives a success response.

Replication multiplies storage and network cost, but it defines durability and read scalability boundaries.

Step 6: Refresh and Segment Creation

Documents become searchable only after a refresh. A refresh moves buffered documents into a new Lucene segment and makes that segment visible to search operations.

This is why OpenSearch is described as near real time rather than immediate. There is a controlled delay between indexing and search visibility, governed by the refresh_interval setting.

The detailed mechanics of segments, immutability, and background merging were examined earlier in the Segments and Merging section. In the context of ingestion, the key point is that refresh controls search visibility, while merging governs how those segments evolve over time.

Putting It Together

A single indexing request triggers a sequence of structural events: template evaluation, index creation, shard routing, mapping driven parsing, in-memory buffering, translog append, replica replication, refresh, and eventual merge.

None of these steps are incidental. Each reflects earlier design decisions. Shard count influences routing. Mapping influences index structures. Template configuration determines the contract of the index before the first document exists.

Ingestion is therefore not a write operation. It is a distributed structural transformation governed by policy defined at index creation time.


Query Execution: Full Text Search

In the previous section, a document was indexed into logs-2026-02-14. Thousands of similar log entries now exist across multiple daily indices matching logs-*. Some represent successful operations. Others represent failures.

We will now search across those indices for log entries containing the phrase trunk timeout.

The Search Request

GET logs-*/_search
{
  "size": 5,
  "query": {
    "match": {
      "message": "trunk timeout"
    }
  }
}

The logs-* pattern expands to all matching indices. Each of those indices contains one or more primary shards, and each shard contains multiple Lucene segments.

Although the cluster may contain thousands of log entries, only a small number contain both trunk and timeout in the message field. The size parameter limits the final result set to the top five matching documents.

Step 1: Coordinating Node

Any node in the cluster can receive the request. The receiving node becomes the coordinating node for this search. It is responsible for routing the query to the appropriate shards and merging results.

Step 2: Shard Resolution

The coordinating node determines which shards belong to indices matching logs-*. The query is then sent to each relevant shard copy, either primary or replica.

Each shard executes the query independently against its own Lucene segments.

Step 3: Query Phase Within a Shard

The match query is analyzed using the same analyzer that processed the message field during indexing. The phrase trunk timeout is tokenized into terms such as trunk and timeout.

Lucene consults the inverted index within each segment to locate documents containing those terms. Posting lists are traversed, and documents containing both terms receive higher relevance scores.

Each shard computes its own top matching documents based on score. Only the top results from each shard are returned to the coordinating node, not every matching document.

Step 4: Merge Phase

The coordinating node receives partial results from all shards. It merges these results, reorders them by score, and determines the global top five documents.

At this stage, the coordinating node knows which documents should be returned but does not yet have their full _source content.

Step 5: Fetch Phase

The coordinating node requests the full document content for the selected results from the shards that hold them. These documents are retrieved from stored fields and returned in the final response.

The client receives a response containing the top five matching log entries, ordered by relevance score.

Visualizing Distributed Query Execution

The animation below shows three nodes, six monthly indices, primary and replica shard placement, and the coordinating node selecting one shard copy per index before merging results.

What Just Happened

A distributed search request was decomposed into parallel shard level queries. Each shard executed the query locally against its segments. The coordinating node merged partial results and fetched the final documents.

The efficiency of this process depends on decisions made earlier. Shard count affects how many parallel queries are required. Segment count influences how many posting lists must be traversed. Field mappings determine how terms were indexed and how scoring behaves.

Full text search relies on the inverted index and relevance scoring. It is optimized for finding a small number of matching documents within a large corpus.

In the next section, we will examine aggregation queries. Unlike full text search, aggregations are concerned not with retrieving individual documents, but with summarizing patterns across many documents using doc values and column oriented data structures.


Aggregation Querying

Aggregation querying uses the same distributed framework as full text search. A coordinating node receives the request, expands the index pattern, selects one shard copy per index, and executes the request across the cluster.

The mechanics of distribution do not change. What changes is what each shard computes and what the coordinating node merges.

From Retrieval to Computation

Full text search retrieves documents ranked by relevance score. Aggregation querying derives structure from documents. It transforms a large set of matching documents into numeric summaries, buckets, or statistical metrics.

The system is no longer asking, “Which documents are most relevant?” It is asking, “What patterns emerge from this filtered set?”

Example: Counting Timeouts by Host

Assume thousands of log entries exist across six monthly indices matching logs-*. We want to compute how many trunk timeout errors occurred per host.

GET logs-*/_search
{
  "size": 0,
  "query": {
    "match": {
      "message": "trunk timeout"
    }
  },
  "aggs": {
    "timeouts_by_host": {
      "terms": {
        "field": "host"
      }
    }
  }
}

The size parameter is set to 0 because we are not retrieving documents. The query still executes first, filtering the dataset. The aggregation operates only on documents that match the query.

Step 1: Query Phase (Filtering)

Each selected shard copy executes the match query locally using its inverted index. The inverted index identifies document IDs whose message field contains trunk and timeout.

At this stage, the shard produces a set of matching document IDs. No aggregation has occurred yet. The aggregation phase depends entirely on the filtered result set.

Step 2: Doc Values and Bucket Construction

For each matching document ID, the shard reads the value of the host field from doc values. Unlike the inverted index, which maps terms to documents, doc values store field data in a column oriented structure optimized for aggregation and sorting.

Because host was mapped as a keyword field in our template, it was indexed with doc values enabled. This design decision made at index creation time now determines whether this aggregation is efficient.

The shard iterates through matching document IDs and constructs a local bucket map:

{
  "edge-01": 12,
  "edge-02": 5,
  "edge-03": 9
}

This map exists only within the shard. It represents shard local counts for that index partition.

Memory and Cardinality

Terms aggregations allocate memory to track unique bucket keys. If the host field has low cardinality, for example a small number of routers, the bucket map remains small.

If the field has high cardinality, such as unique session IDs or IP addresses, bucket memory usage grows quickly. Aggregation cost becomes a function of unique values rather than total documents.

This is why mapping discipline and aggregation intent must align. A field designed for grouping should have predictable cardinality.

Step 3: Shard Level Partial Results

After computing local buckets, each shard returns partial aggregation results to the coordinating node. These are not documents. They are serialized summaries.

Shard A:
  edge-01 : 12
  edge-02 : 5

Shard B:
  edge-01 : 7
  edge-03 : 9

Each shard operates independently. It has no knowledge of counts in other shards.

Step 4: Reduce Phase at the Coordinator

The coordinating node merges shard level bucket maps into a single global result. For identical keys, counts are summed.

Global:
  edge-01 : 19
  edge-02 : 5
  edge-03 : 9

This reduction step is analogous to a distributed map-reduce operation. Each shard performs the map phase locally. The coordinating node performs the reduce phase globally.

Sorting and Top Buckets

Terms aggregations typically return only the top buckets by count. This means shards may return more buckets than ultimately appear in the final result, depending on shard size and ordering rules.

The coordinator must merge and sort combined counts before truncating to the requested size. This is another distinction from full text search, where each shard returns Top-N documents by score.

Structural Comparison to Full Text Search

In full text search, shards return ranked document lists that the coordinator merges and re-ranks.

In aggregation querying, shards return computed summaries that the coordinator merges and reduces.

The distribution layer is identical. The data structures and merging logic are fundamentally different.

Aggregations shift the system from retrieval to analysis. They are made possible not only by the inverted index, but by doc values, field mappings, and memory allocation strategies defined during ingestion.

At this point, the read path is complete. The cluster can filter, retrieve, rank, compute, and reduce across distributed shards. The remaining question is not capability, but cost.

Caveat: Aggregation Accuracy and Distributed Tradeoffs

Terms aggregations in a distributed system are not always perfectly exhaustive. Each shard computes buckets locally and returns only its top buckets to the coordinating node. The coordinator can merge only what it receives.

By default, shards do not return every unique bucket they encounter. They return a limited set based on internal shard_size calculations. This improves performance and reduces memory pressure, but it introduces bounded approximation.

If a bucket value has moderate counts spread evenly across many shards, but is never large enough to appear in the top results of any single shard, it may be underrepresented or absent in the final reduced result.

OpenSearch exposes this tradeoff through fields such as doc_count_error_upper_bound and sum_other_doc_count. These values indicate that the result set may not represent a fully exhaustive global histogram.

The likelihood of approximation increases with high cardinality fields, uneven data distribution across shards, and small size parameters. Increasing shard_size can improve accuracy, but at the cost of additional memory usage and network transfer.

This caveat connects directly to index and mapping design. Fields intended for aggregation should have predictable cardinality and be mapped appropriately as keyword or numeric types with doc values enabled. Aggregating on fields with extremely high cardinality, such as unique session identifiers, can introduce both performance pressure and approximation effects.

Aggregation querying therefore reflects a broader distributed systems principle: perfect exhaustiveness and optimal performance rarely coexist at scale. Design decisions made during ingestion determine whether aggregation behavior remains predictable or becomes costly.

This bounded approximation should not be confused with structural mapping issues. In cases of mapping collision across time based indices, discussed earlier in the Mapping Collision section, the cluster may produce inconsistent behavior or fail entirely because field types cannot be reconciled across shards. Aggregation approximation is a performance tradeoff. Mapping collision is a schema inconsistency. Both surface during wildcard queries across indices, but they arise from fundamentally different causes.


Cluster State and Shard Allocation

In earlier chapters, we defined what a shard is, how primaries and replicas function, how distributed queries execute, and how shard sizing decisions influence performance and recovery cost.

Now the system is running. Our logs-2026-* indices exist. Each index has one primary shard and one replica shard distributed across three nodes. Documents are flowing in. Queries are executing.

The question is no longer how shards are structured. The question is how they are maintained.

Observing Cluster State

The fastest way to inspect cluster health is the cat health API.

GET _cat/health?v

Example output:

epoch      timestamp cluster status node.total node.data shards pri relo init unassign
1700000000 12:30:12  logs-cluster green          3         3     12  6   0    0    0

A green cluster indicates that all primary and replica shards across logs-2026-01 through logs-2026-06 are allocated.

Inspecting Shard Placement

To see where shards physically reside:

GET _cat/shards/logs-2026-*?v

Example output:

index         shard prirep state   docs store node
logs-2026-01   0     p      STARTED 12450 32mb  node-a
logs-2026-01   0     r      STARTED 12450 32mb  node-b
logs-2026-02   0     p      STARTED 13820 36mb  node-b
logs-2026-02   0     r      STARTED 13820 36mb  node-c
logs-2026-03   0     p      STARTED 14210 38mb  node-c
logs-2026-03   0     r      STARTED 14210 38mb  node-a

This reflects the routing table described earlier. Each primary and replica shard resides on different nodes.

Yellow State: Replica Unassigned

Suppose node-c goes offline. The cluster may enter a yellow state if replicas cannot be reassigned.

GET _cat/health?v
epoch      timestamp cluster status node.total node.data shards pri relo init unassign
1700000100 12:35:40  logs-cluster yellow         2         2     12  6   0    0    3

The cluster is yellow because primary shards remain available, but replica shards are unassigned.

We can see exactly which shards are affected:

GET _cat/shards/logs-2026-*?v
index         shard prirep state      docs store node
logs-2026-03   0     r      UNASSIGNED

Red State: Primary Unavailable

If a node holding a primary shard fails and no in-sync replica exists, the cluster enters red state.

GET _cat/health?v
epoch      timestamp cluster status node.total node.data shards pri relo init unassign
1700000200 12:41:10  logs-cluster red            2         2     12  6   0    0    1

In red state, one or more primary shards are unassigned. Data in those shards is unavailable for search.

Replica Promotion

If a primary shard fails but an in-sync replica exists, the cluster promotes the replica automatically.

The routing table updates, and the replica becomes the new primary. A new replica is then allocated when resources allow.

During this window, cluster state is updated and published to all nodes.

Relocation and Rebalancing

When a new node joins the cluster, shard relocation may occur. This can be observed directly:

GET _cat/shards/logs-2026-*?v
index         shard prirep state      docs store node
logs-2026-04   0     r      RELOCATING 15600 41mb node-b -> node-d

Relocation copies segment files from source to destination. Shard size directly determines how long this process takes.

Allocation Constraints

If disk watermarks are exceeded, shard allocation may be blocked.

GET _cat/allocation?v
node   shards disk.indices disk.used disk.avail disk.percent
node-a  4      150mb        920mb     80mb       92

Nodes exceeding high disk watermark thresholds may prevent new shard assignments until disk pressure is resolved.

Structural Consequences

Shards are not only units of scale. They are units of movement, promotion, relocation, and recovery.

Shard count influences cluster state size. Shard size influences recovery time. Replica count defines fault tolerance boundaries.

Design decisions made during ingestion and index creation determine how the system behaves when failure occurs.

Indexing and querying define capability. Cluster state and shard allocation define resilience.


Conclusion: Structural Foundation

At this point, we have walked the system from the inside out.

We began with Lucene, the library that builds the inverted index. From there, we examined segments, immutability, and merging. We traced how a shard is not an abstraction but a physical directory of segment files, translogs, and shard state metadata. We expanded that into distributed shards, primaries and replicas, routing, and cluster coordination.

We defined mappings as structural contracts rather than labels. We examined how field types influence tokenization, doc values, aggregation capability, and long term query flexibility. We explored index templates as policy mechanisms that govern future indices, especially in time based systems.

We then followed the full ingestion lifecycle. A document triggered template resolution, shard routing, mapping validation, translog durability, replica synchronization, refresh, and segment creation. Nothing in that path was incidental. Each step reflected earlier design decisions.

On the read side, we executed full text search and observed how shards return scored partial results that are merged by a coordinating node. We executed aggregation queries and traced how doc values power bucket construction, how shards compute local summaries, and how the coordinating node performs the global reduce phase. We examined bounded approximation behavior in distributed aggregations and tied those tradeoffs back to mapping discipline and index design.

Finally, we examined cluster state and shard allocation. We observed green, yellow, and red states. We inspected shard placement using cat APIs. We watched replica promotion, relocation, disk watermarks, and allocation constraints shape how the cluster behaves under pressure. Shards are not only units of scale. They are units of movement, recovery, and resilience.

What exists now is a structural model of OpenSearch: the write path, the read path, the storage model, the distribution model, the schema discipline model, and the failure model. That is the foundation.

As this site evolves, OpenSearch will evolve with it. Future sections will move deeper into cluster sizing strategy, durability boundaries, performance tuning, memory behavior, and operational scaling. The intent is not to freeze this into static documentation, but to expand it alongside real systems and real workloads.

For a concrete application of everything covered here, continue to the Nginx Logs project:Nginx Logs

This chapter establishes the structural language. The next systems apply it.