How to Scale a Distributed SQL Database to 1M+ Inserts Per Sec

gowebing 2024-12-02

stack-of-pancakes-with-fruit

There are a number of well-known experiments where eventually-consistent NoSQL databases were scaled out to perform millions of inserts and queries. Here, we do the same using YSQL, YugabyteDB’sPostgreSQL-compatible, strongly-consistent, distributed SQL API.

We created a 100-node YugabyteDB cluster, ran single-row INSERT and SELECT workloads with high concurrency – each for an hour and measured the sustained performance (throughput and latency). This post details the results of this experiment as well as highlights the key aspects of the YugabyteDB architecture that makes it fit for such high-volume ingest workloads.

You may also like: How Data Sharding Works in a Distributed SQL Database.

Database Cluster Setup

While YugabyteDB can be deployed across multiple zones or regions, this benchmark was focussed on the aggregate performance of a 100 node cluster. Therefore, all the 100 nodes were deployed in the AWS cloud, Oregon (us-west-2) region and in a single-zone (us-west-2a). Each of the instances were of type c5.4xlarge (16 vCPUs). This information is summarized below.

Cluster name: MillionOps.
Cloud: Amazon Web Services.
Region: Oregon (us-west-2).
Zone: us-west-2a.
Number of nodes: 100.
Instance type: c5.4xlarge (16 vCPUs).
Disk on each node: 1TB EBS SSD (gp2).
Replication Factor: 3.
Consistency level: Strong consistency for both writes and reads.

Initial performance dashboard

Initial performance dashboard

Benchmark Setup

The benchmark application was an open-source Java program. The application’s database workload simply does multi-threaded, single-row INSERTs and SELECTs against a table that has a key and a value column. The size of each row was 64 bytes. The insert and select benchmarks were run for one hour each in order to measure the sustained throughput and latency.

This benchmark application was run on 6 instances of 8 cores each. Note that we could not consolidate into a fewer number of more powerful instances since we were hitting the maximum network bandwidth on the network instances. Each of these benchmark instances were prepared as shown below.

Java 8 was installed using the following commands.

$ sudo apt update
$ sudo apt install default-jre

The benchmark program was downloaded on to these machines as shown below.

$ wget -P target https://github.com/YugaByte/yb-sample-apps/releases/download/v1.2.0/yb-sample-apps.jar

The benchmark program is capable of taking a list of servers that comprise the database cluster, and performing random operations across these servers. In order to do this, we set up an environment variable with the list of comma-separated host:port entries of the 100 database servers as shown below.

$ export YSQL_NODES=node-1-ip-addr:5433,node-2-ip-addr:5433,...

Benchmarking the INSERT Workload

The first step was to run an INSERT benchmark on this 100-node cluster. The following command was run on each of the benchmark instances.

java -jar ~/yb-sample-apps-no-table-drop.jar    \
                   --workload SqlInserts        \
                   --nodes $YSQL_NODES          \
                   --num_unique_keys 5000000000 \
                   --num_threads_write 400      \
                   --num_threads_read 0         \
                   --uuid 00000000-0000-0000-0000-00000000000n

The table on which the benchmark was run had the following simple schema.

CREATE TABLE table_name (k varchar PRIMARY KEY, v varchar);

This workload performed a number of INSERTs using prepared statements, as shown below.

INSERT INTO table_name (k, v) VALUES (?, ?);

Note a few points about the benchmark setup.

Each benchmark program writes unique set of rows. The uuid parameter forms a prefix of the row key. It is set differently (by varying the value of n from 1 to 6) on each benchmark instance to ensure it writes separate keys.
A total of 30 billion unique rows will be inserted upon completion. Each benchmark program proceeds to write out 5 billion keys, and there are 6 such programs running in parallel.
There are 2,400 concurrent clients performing inserts. Each benchmark program uses 400 write threads, and there are six such programs running concurrently.

Performance specs

Initial performance dashboard

A screenshot of the write throughput on this cluster while the benchmark was in progress is shown below. The write throughput was 1.26 million inserts/sec.

Total YSQL OPs/Sec

Total YSQL OPs/Sec

The corresponding average insert latency across all the 100 nodes was 1.66ms, as shown below. Note that each insert is replicated three-ways to make the cluster fault tolerant.

PSQL Op Latency

PSQL Op Latency

The average CPU usage across the nodes in the cluster was about 78%, as shown in the graph below.

CPU Usage

CPU Usage

Benchmarking the SELECT Workload

The following command was run for the SELECT workload.

java -jar ~/yb-sample-apps-no-table-drop.jar \
               --workload SqlInserts         \
               --nodes $YSQL_NODES           \
               --max_written_key 500000000   \
               --num_writes 0                \
               --num_reads 50000000000       \
               --num_threads_write 0         \
               --num_threads_read 400        \
               --read_only                   \
               --uuid 00000000-0000-0000-0000-00000000000n

The SELECT workload would lookup random rows on the table that the INSERT workload (described in the previous section) populated. Each SELECT query was done using prepared statements, as shown below.

SELECT * FROM table_name WHERE k=?;

The main points worth noting in this workload are:

There are 2,400 concurrent clients issuing selects. Each benchmark program uses 400 threads, and there are six programs running in parallel.
Each read operation randomly selects one row from a total of 3 billion rows. Each benchmark program randomly queries one row from a total of 500 million rows, and there are six concurrent programs.

Query performance specs

Query performance specs

Once again, a screenshot of the read throughput on this cluster while the benchmark was in progress is shown below. The read throughput was 2.8 million selects/sec. YugabyteDB reads are strongly consistent by default and that is the setting used for this benchmark. Additional throughput can be achieved by simply allowing timeline-consistent reads from follower replicas (see final section of this post).

Total YSQL Ops/sec

Total YSQL Ops/sec

The corresponding average select latency across all the 100 nodes was 0.56ms, as shown below.

YSQL Op Latency

YSQL Op Latency

The average CPU usage across the nodes in the cluster was about 64%, as shown in the graph below.

CPU usage

CPU Usage

Architecture for Horizontal Write Scaling

The architecture of a YugabyteDB cluster is shown in the figure below. The YB-TServer service is responsible for managing the data in the cluster, while the YB-Master service manages the system configuration of the cluster. YB-TServer automatically shards every table into a number of shards (aka tablets).

Given the Replication Factor (RF) of three for the cluster, each tablet is represented as a Raft group of three replicas with one replica considered the leader and other two replicas considered as followers. In a 100-node cluster, each of these three replicas are automatically stored on exactly three (out of 100) different nodes where each node can be thought of as representing an independent fault domain.

YB-Master automatically balances the total number of leader and follower replicas on all the nodes so that no single node becomes a bottleneck and every node contributes its fair share to incoming client requests. The end result is strong write consistency (by ensuring writes are committed at a majority of replicas) and tunable read consistency (by serving strong reads from leaders and timeline-consistent reads from followers) irrespective of the number of nodes in the cluster.

Data architecture

Data architecture

To those new to Raft, the simplest explanation is that it is a protocol with which a cluster of nodes can agree on values. It is arguably the most popular distributed consensus protocol in use today. Business-critical cloud-native systems like etcd (the configuration store for Kubernetes) and consul (Hashicorp’s popular service discovery solution) are built on Raft as a foundation.