ClickHouse Database Resilience

A simplified explanation of the processes and how the relate to resilience follows. For a more in-depth explanation, consult the FortiSIEM online help and ClickHouse documentation.

The basic components of the ClickHouse database integrated into FortiSIEM are shards, replicas, and the ClickHouse Keeper process. In summary: shards increase performance; replicas increase data resilience; and the keeper process orchestrates the database.

Sharding is a database technology that increases performance by splitting the data into ‘pieces’ and storing each piece on a different node. The data can be written and queried in parallel, significantly improving read and write performance. Generally speaking the more shards a system has the higher the performance. That being said, the combination of FortiSIEM and ClickHouse is very efficient and can scale to high EPS with relatively few shards.

Diagram - Simplified representation of shards and replicas

Shards alone do not significantly increase resilience. If a shard contains a single FortiSIEM worker and that node fails, then the data that was stored in that shard is lost. Replicas can be used within a shard to increase resilience. Replicas are copies of data within a shard. Each replica is stored on a separate Worker node. If Worker node in a shard fails and there is another Worker hosting a replica then the system will continue to function, but at a reduced level of resilience. In most parts of this document, we have assumed that each shard will have two replicas, but more than two can be used for increased data resilience if required.

Databases must maintain their integrity to ensure the data is valid. The Keeper process is responsible for orchestrating database activity across multiple shards to maintain integrity within the cluster. All FortiSIEM deployments using ClickHouse have a Keeper process, the architectural decision is where to host it.

Single node deployments must host the Keeper process on the single Supervisor node
Two-node deployments should host the Keeper process on the Worker node
Larger deployments should consider separate, dedicated node(s) to host the Keeper process

The Keeper process is essential to proper operation. If the node(s) running the Keeper process is lost, or if the Keeper cluster loses quorum, then the database will be read-only and unable to ingest new events, which is a catastrophic issue for the SIEM. Hosting the Keeper process on dedicated node(s) helps to reduce the chance of the Keeper node becoming unavailable as it will be less likely to be affected by other application processes.

The keeper can run on either a single node or on a three node cluster. Running the keeper process on two nodes does not provide the expected resilience, as the loss of one node in a two-node cluster results in a loss of quorum which places the database in a read-only state. Running the keeper process on more than three nodes is not recommended as doing so increases the overhead on the keeper process and can affect system performance.