Microservices and Distributed Transactions: Beyond the "Happy Path"

In a previous article, the LIXA project and the XTA programming model have been presented to show how two-phase commit distributed transactions can be implemented in polyglot distributed transactional systems. This second part provides some answers to the question: "How do distributed transactional systems behave in the event of failure?" More specifically, four different failure scenarios are explained with practical examples.

Introduction

A critical aspect of the two-phase commit protocol is the concept of "prepared transaction" which sometimes is synonymous with "recovery pending transaction."

A distributed transaction managed by the XA protocol can be in one of the following three states:

  1. "Non-prepared": It's a temporary intermediate state, any change can be lost.
  2. "Prepared": The change has been persisted in a durable way, but it has not been confirmed and data is not visible to other transactions.
  3. "Confirmed": The change has been persisted and data is visible to other transactions.

From a consistency point of view, in the event of a failure, there are two really interesting conditions:

  1. The failure happens during the "non-prepared" state.
  2. The failure happens during the "prepared" state.

Failures that happen when a distributed transaction has already been confirmed are not interesting, because the durability (the "D" of ACID definition) of the data is natively guaranteed by the Resource Manager as it happens for the local transactions.

It must be noted that the sources of truth are the Resource Managers (MySQL and PostgreSQL in our example): one of the consequences of the CAP theorem is the lack of guarantee that two components have the same understanding of the transaction state in a distributed system. The following example explains a typical "split-rain" condition:

  1. The Transaction Manager sends a "prepare" message to a Resource Manager.
  2. The Resource Manager prepares the transaction and replies to the Transaction Manager.
  3. The network between the Transaction Manager and the Resource Manager disconnects (the "P", partition, of the CAP theorem) and the reply message is lost.
  4. The transaction is aborted due to the network partitioning.

In such situations, the Resource Manager and the Transaction Manager have two different "points of view" about the real state of the transaction.

This article explains how LIXA and XTA manage these types of issues.

Architecture

The same architecture presented in Part I of this series will be used:

Architecture diagram

The full design is composed of five containers:

Environment Set Up

The same tools presented in Part I of this series will be used: Git, Docker, and Tive terminal sessions to interact with the containers and to capture the output.

Activate a LIXA State Server

Start the lixad container:

$ docker run --rm -p 2345:2345 --name=lixad -d lixa/lixad
e21ab3f6093081ac4a84b1edaedac907de0b22f1c088253fd0e0096ea8018f0d


Catch its logs:

$ docker logs -f lixad
2019-05-11 15:44:54.583836 [1/140578614647744] NOTICE: LXD000N this process is starting a new LIXA server (lixa package version is 1.7.8)
2019-05-11 15:44:54.584071 [1/140578614647744] INFO: LXD027I parameter 'pid_file' set to value '/opt/lixa/var/run.pid'
2019-05-11 15:44:54.584083 [1/140578614647744] INFO: LXD026I parameter 'min_elapsed_sync_time' set to value 0
2019-05-11 15:44:54.584086 [1/140578614647744] INFO: LXD026I parameter 'max_elapsed_sync_time' set to value 0


Note that the syslog messages are sent to a Docker log.

Activate a MySQL Instance

Start the mysql container and connect to it:

$ docker run --rm -e MYSQL_ROOT_PASSWORD=mysecretpw -p 3306:3306 --name=mysql -d lixa/mysql
85978ec9602e2d5c384aa84012ee506175de964f68dc0f07b4a8b72ddd540093
$ docker exec -ti mysql bash
root@85978ec9602e:/#


Wait a few seconds to allow the MySQL instance to initialize, then connect to the database:

root@85978ec9602e:/# mysql -u lixa -h 192.168.123.35 -D lixa --password=passw0rd
mysql: [Warning] Using a password on the command line interface can be insecure.
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.26 MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>


Activate a PostgreSQL Instance

Start the postgres container and connect to it:

$ docker run --rm -e POSTGRES_PASSWORD=lixa -p 5432:5432 --name=postgres -d lixa/postgres -c 'max_prepared_transactions=10'
61a84ea7d7a9642bc007758d5b8fd72cca1c50434065da8270f0c12f1a0f53a5
$ docker exec -ti postgres bash
root@61a84ea7d7a9:/#


Wait a few seconds to allow the PostgreSQL instance to initialize, then connect to the database:

root@61a84ea7d7a9:/# psql -U lixa
psql (11.2 (Debian 11.2-1.pgdg90+1))
Type "help" for help.

lixa=>


Build the REST Server

The REST server uses the lixa/xta-jdk-maven base image, but the standard one that's available at DockerHub can't be used because the "--enable-crash" config option is necessary to produce system crashes. Building a local customized version is quite easy, just clone the Git repository and enter the xta-java directory:

$ git clone https://github.com/tiian/lixa-docker.git
$ cd lixa-docker/xta-java


Then rebuild the xta-jdk-maven image with the --enable-crash option:

$ docker build -f Dockerfile-jdk-maven --build-arg CONFIG_OPTIONS="--enable-crash" -t lixa/xta-jdk-maven .


And eventually, build the REST server:

$ cd ../examples/PythonJavaREST/
$ docker build -f Dockerfile-server -t rest-server .


Build the REST Client

The REST server uses the lixa/xta-python3 base image, but the standard one available on DockerHub can't be used because the "--enable-crash" config option is necessary to produce system crashes. Building a local customized version is quite easy. Supposing you have already cloned the Git repository in the previous step, enter the xta-python directory, and rebuild the xta-python3 with the --enable-crash option:

$ cd xta-python
$ docker build -f Dockerfile-python3 --build-arg CONFIG_OPTIONS="--enable-crash" -t lixa/xta-python3 .


Then build the REST client:

$ cd ../examples/PythonJavaREST
$ docker build -f Dockerfile-client -t rest-client .


Check Your Environment

If all the above steps worked for you, you should have three running containers:

$ docker ps | grep lixa
e21ab3f60930        lixa/lixad          "/home/lixa/lixad-en…"   30 minutes ago      Up 30 minutes       0.0.0.0:2345->2345/tcp              lixad
61a84ea7d7a9        lixa/postgres       "docker-entrypoint.s…"   44 minutes ago      Up 44 minutes       0.0.0.0:5432->5432/tcp              postgres
85978ec9602e        lixa/mysql          "docker-entrypoint.s…"   44 minutes ago      Up 44 minutes       0.0.0.0:3306->3306/tcp, 33060/tcp   mysql


And a couple of available images:

$ docker images | grep rest
rest-server          latest              363b2d904621        19 hours ago        725MB
rest-client          latest              e3dbab970d5c        19 hours ago        628MB


Failure Scenarios

This article describes four possible failure scenarios: the first two are related to a crash of the REST server and the last two are related to a crash of the REST client.

Server Crashes in "Non-Prepared" State

In this scenario, the transaction branch managed by the REST server will crash before the transaction branch has been prepared. To obtain a crash immediately before prepare, the environment variable LIXA_CRASH_POINT must be set to a value of 12.

Server Crashes in "Non-Prepared" State

Check Data Tables

The "authors" table must be empty both in MySQL:

mysql> select * from authors;
Empty set (0.00 sec)

mysql>


And in PostgreSQL:

lixa=> select * from authors;
 id | last_name | first_name 
----+-----------+------------
(0 rows)

lixa=>


Start the REST Server

Start the REST server with the environment variable set to LIXA_CRASH_POINT=12:

$ docker run -ti --rm -e MAVEN_OPTS="-Djava.library.path=/opt/lixa/lib" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e PQSERVER="192.168.123.35" -e LIXA_CRASH_POINT=12 -p 18080:8080 rest-server

[...]

May 11, 2019 4:24:49 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8080]
May 11, 2019 4:24:49 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Jersey app started with WADL available at http://0.0.0.0:8080/xta/application.wadl
Hit enter to stop it...


Start the REST Client

Start the REST client by specifying the "--no delete" option (the table is empty and the transaction will crash during an SQL INSERT statement):

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" rest-client --no-delete
2019-05-11 16:37:45.361956 [1/140715375716096] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
2019-05-11 16:37:45.363595 [1/140715375716096] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-11 16:37:45.364112 [1/140715375716096] NOTICE: LXC037N recovery pending transaction with xid='1279875137.b129ed360e94455bbd45d63ecd67e2cd.e0130ac8303b5a9fc8383c6b92286a09'
2019-05-11 16:37:45.364128 [1/140715375716096] NOTICE: LXC038N transaction with xid='1279875137.b129ed360e94455bbd45d63ecd67e2cd.e0130ac8303b5a9fc8383c6b92286a09' must be rolled back
2019-05-11 16:37:45.364139 [1/140715375716096] INFO: LXC039I transaction with xid='1279875137.b129ed360e94455bbd45d63ecd67e2cd.e0130ac8303b5a9fc8383c6b92286a09' has been recovered (rolled back)
MySQL: executing SQL statement >INSERT INTO authors VALUES(1840, 'Zola', 'Emile')<
Calling REST service passing: xid='1279875137.874e60ce41784155b0cf557cf0a6c693.e0130ac8303b5a9fc8383c6b92286a09', oper='insert'
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)

[...]

  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))


The REST client catches an exception during the call because the server has crashed:

***** REST service called: xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9fc8383c6b92286a09', oper='insert' *****
2019-05-11 16:43:27.652090 [1/140719437825792] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
Created a subordinate branch with XID '1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9f480eedfdd665454a'
PostgreSQL: executing SQL statement >INSERT INTO authors VALUES(1804, 'Hawthorne', 'Nathaniel')<
Executing first phase of commit (prepare)
2019-05-11 16:43:27.710330 [1/140719437825792] CRIT: LXG000C crash point 12 will immediately terminate the process
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ffbfc6ca529, pid=1, tid=59
#
# JRE version: OpenJDK Runtime Environment (11.0.3+1) (build 11.0.3+1-Debian-1bpo91)
# Java VM: OpenJDK 64-Bit Server VM (11.0.3+1-Debian-1bpo91, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0x34529]  abort+0x269


Looking at rows 5 and 6, we can see that the REST server started the first phase of the commit, but the process crashed as expected.

Neither the client nor the server have prepared their transaction branches and this can be verified by inspecting the Resource Managers (MySQL and PostgreSQL) report:

MySQL:

mysql> select * from authors;
Empty set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)


PostgreSQL:

lixa=> select * from authors;
 id | last_name | first_name 
----+-----------+------------
(0 rows)

lixa=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)


It's time to restart the REST server without specifying the LIXA_CRASH_POINTenvironment variable:

$ docker run -ti --rm -e MAVEN_OPTS="-Djava.library.path=/opt/lixa/lib" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e PQSERVER="192.168.123.35" -p 18080:8080 rest-server

[...]

May 11, 2019 4:53:31 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8080]
May 11, 2019 4:53:31 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Jersey app started with WADL available at http://0.0.0.0:8080/xta/application.wadl
Hit enter to stop it...


Start the client to execute an SQL INSERT statement:

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" rest-client --no-delete
2019-05-11 16:54:06.786644 [1/139978787604224] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
2019-05-11 16:54:06.788080 [1/139978787604224] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-11 16:54:06.788533 [1/139978787604224] NOTICE: LXC037N recovery pending transaction with xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9fc8383c6b92286a09'
2019-05-11 16:54:06.788563 [1/139978787604224] NOTICE: LXC038N transaction with xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9fc8383c6b92286a09' must be rolled back
2019-05-11 16:54:06.788581 [1/139978787604224] INFO: LXC039I transaction with xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9fc8383c6b92286a09' has been recovered (rolled back)
MySQL: executing SQL statement >INSERT INTO authors VALUES(1840, 'Zola', 'Emile')<
Calling REST service passing: xid='1279875137.314745bd7a8345fe9a916c345866bc33.e0130ac8303b5a9fc8383c6b92286a09', oper='insert'
Server replied >PREPARED<
Executing transaction commit


You can see that:

Coming back to the server we can see the below messages:

***** REST service called: xid='1279875137.314745bd7a8345fe9a916c345866bc33.e0130ac8303b5a9fc8383c6b92286a09', oper='insert' *****
2019-05-11 16:54:06.982736 [1/140065675020032] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
2019-05-11 16:54:06.984426 [1/140065675020032] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-11 16:54:06.984934 [1/140065675020032] NOTICE: LXC037N recovery pending transaction with xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9f480eedfdd665454a'
2019-05-11 16:54:06.985086 [1/140065675020032] NOTICE: LXC038N transaction with xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9f480eedfdd665454a' must be rolled back
2019-05-11 16:54:06.991819 [1/140065675020032] INFO: LXC018I Resource Manager 'PostgreSQL' returned XAER_NOTA during recovery rollback: the Resource Manager has already rolled back the transaction with xid '1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9f480eedfdd665454a'
2019-05-11 16:54:06.992025 [1/140065675020032] INFO: LXC039I transaction with xid='1279875137.53a0328502bd4767b026c11ddd76ff3d.e0130ac8303b5a9f480eedfdd665454a' has been recovered (rolled back)
Created a subordinate branch with XID '1279875137.314745bd7a8345fe9a916c345866bc33.e0130ac8303b5a9f43bace840bdd4cd5'
PostgreSQL: executing SQL statement >INSERT INTO authors VALUES(1804, 'Hawthorne', 'Nathaniel')<
Executing first phase of commit (prepare)
Returning 'PREPARED' to the client
Executing second phase of commit


It can be noticed that:

  1. In rows 3-4 (messages LXC036I, LXC037N), the LIXA state server suspects the transaction is in recovery pending mode.
  2. In row 5 (message LXC038N), the LIXA Transaction Manager embedded in the REST server establishes that a rollback must be performed
  3. In row 6, (message LXC018I): the Resource Manager (PostgreSQL) reports XAER_NOTA ("NOT Available").
  4. In row 7 (message LXC039I), the LIXA Transaction Manager declares "recovered" as its transaction branch.

Checking the content of MySQL and its XA status we can see that everything is fine:

mysql> select * from authors;
+------+-----------+------------+
| id   | last_name | first_name |
+------+-----------+------------+
| 1840 | Zola      | Emile      |
+------+-----------+------------+
1 row in set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)


The row has been inserted in the table and there are no recovery pending transactions.

Checking the content of PostgreSQL and its XA status we can see that everything is fine:

lixa=> select * from authors;
  id  | last_name | first_name 
------+-----------+------------
 1804 | Hawthorne | Nathaniel
(1 row)

lixa=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)


Server Crashes in "Prepared" State

In this scenario, the transaction branch managed by the REST server will crash after the transaction branch has been prepared. To obtain a crash immediately after prepare, the environment variable LIXA_CRASH_POINT must be set to value 14.

Server Crashes in "Prepared" State

Start the REST Server

Start the REST server with the environment variable set to LIXA_CRASH_POINT=14:

$ docker run -ti --rm -e MAVEN_OPTS="-Djava.library.path=/opt/lixa/lib" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e PQSERVER="192.168.123.35" -e LIXA_CRASH_POINT=14 -p 18080:8080 rest-server

[...]

May 11, 2019 8:52:26 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8080]
May 11, 2019 8:52:26 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Jersey app started with WADL available at http://0.0.0.0:8080/xta/application.wadl
Hit enter to stop it...


Start the REST Client

Start the REST client by specifying the "--no-insert" option (the table is not empty and the transaction will crash during an SQL DELETE statement):

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" rest-client --no-insert
2019-05-11 20:56:53.017341 [1/140060180203264] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
MySQL: executing SQL statement >DELETE FROM authors WHERE id=1840<
Calling REST service passing: xid='1279875137.3a89df4fd6154470b96f392f70d0a13e.e0130ac8303b5a9fc8383c6b92286a09', oper='delete'
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)

[...]

  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))


The client catches an exception during the call because the server has crashed:

***** REST service called: xid='1279875137.3a89df4fd6154470b96f392f70d0a13e.e0130ac8303b5a9fc8383c6b92286a09', oper='delete' *****
2019-05-11 20:56:53.189414 [1/140464616359680] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
Created a subordinate branch with XID '1279875137.3a89df4fd6154470b96f392f70d0a13e.e0130ac8303b5a9fb3c472b63de54a94'
PostgreSQL: executing SQL statement >DELETE FROM authors WHERE id=1804<
Executing first phase of commit (prepare)
2019-05-11 20:56:53.262442 [1/140464616359680] CRIT: LXG000C crash point 14 will immediately terminate the process
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc0d7eb9529, pid=1, tid=59
#
# JRE version: OpenJDK Runtime Environment (11.0.3+1) (build 11.0.3+1-Debian-1bpo91)
# Java VM: OpenJDK 64-Bit Server VM (11.0.3+1-Debian-1bpo91, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0x34529]  abort+0x269


Looking at rows 5 and 6, we can see that the REST server started the first phase of the commit, but the process crashed as expected.

This time PostgreSQL has prepared its branch, while MySQL has not; the rows are still in the tables:

mysql> select * from authors;
+------+-----------+------------+
| id   | last_name | first_name |
+------+-----------+------------+
| 1840 | Zola      | Emile      |
+------+-----------+------------+
1 row in set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)
lixa=> select * from authors;
  id  | last_name | first_name 
------+-----------+------------
 1804 | Hawthorne | Nathaniel
(1 row)

lixa=> select * from pg_prepared_xacts;
 transaction |                             gid                              |   
        prepared            | owner | database 
-------------+--------------------------------------------------------------+---
----------------------------+-------+----------
         576 | 1279875137_OonfT9YVRHC5bzkvcNChPg==_4BMKyDA7Wp+zxHK2PeVKlA== | 20
19-05-11 20:56:53.249484+00 | lixa  | lixa
(1 row)


It is expected that the transaction will be recovered and rolled back; to verify this behavior we can restart the server without specifying the LIXA_CRASH_POINT environment variable:

$ docker run -ti --rm -e MAVEN_OPTS="-Djava.library.path=/opt/lixa/lib" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e PQSERVER="192.168.123.35" -p 18080:8080 rest-server

[...]

May 11, 2019 9:07:59 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8080]
May 11, 2019 9:07:59 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Jersey app started with WADL available at http://0.0.0.0:8080/xta/application.wadl
Hit enter to stop it...


And execute the client with an SQL DELETE statement:

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" rest-client --no-insert
2019-05-11 21:11:13.562803 [1/139853280184064] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
2019-05-11 21:11:13.563988 [1/139853280184064] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-11 21:11:13.564377 [1/139853280184064] NOTICE: LXC037N recovery pending transaction with xid='1279875137.d6d6279e80e2475fb62289994ce10466.e0130ac8303b5a9fc8383c6b92286a09'
2019-05-11 21:11:13.564392 [1/139853280184064] NOTICE: LXC038N transaction with xid='1279875137.d6d6279e80e2475fb62289994ce10466.e0130ac8303b5a9fc8383c6b92286a09' must be rolled back
2019-05-11 21:11:13.564405 [1/139853280184064] INFO: LXC039I transaction with xid='1279875137.d6d6279e80e2475fb62289994ce10466.e0130ac8303b5a9fc8383c6b92286a09' has been recovered (rolled back)
MySQL: executing SQL statement >DELETE FROM authors WHERE id=1840<
Calling REST service passing: xid='1279875137.fe3b74c46b1a41bb8630dbc5b6562bcb.e0130ac8303b5a9fc8383c6b92286a09', oper='delete'
Server replied >PREPARED<
Executing transaction commit


We can see that:

Coming back to the server we can see something like the below messages:

***** REST service called: xid='1279875137.fe3b74c46b1a41bb8630dbc5b6562bcb.e0130ac8303b5a9fc8383c6b92286a09', oper='delete' *****
2019-05-11 21:11:13.752470 [1/140375670560512] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
2019-05-11 21:11:13.754308 [1/140375670560512] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-11 21:11:13.754738 [1/140375670560512] NOTICE: LXC037N recovery pending transaction with xid='1279875137.3a89df4fd6154470b96f392f70d0a13e.e0130ac8303b5a9fb3c472b63de54a94'
2019-05-11 21:11:13.754751 [1/140375670560512] NOTICE: LXC038N transaction with xid='1279875137.3a89df4fd6154470b96f392f70d0a13e.e0130ac8303b5a9fb3c472b63de54a94' must be rolled back
2019-05-11 21:11:13.772826 [1/140375670560512] INFO: LXC039I transaction with xid='1279875137.3a89df4fd6154470b96f392f70d0a13e.e0130ac8303b5a9fb3c472b63de54a94' has been recovered (rolled back)
Created a subordinate branch with XID '1279875137.fe3b74c46b1a41bb8630dbc5b6562bcb.e0130ac8303b5a9f81d06ef1e3ac45d8'
PostgreSQL: executing SQL statement >DELETE FROM authors WHERE id=1804<
Executing first phase of commit (prepare)
Returning 'PREPARED' to the client
Executing second phase of commit


We can see that:

Checking the content of MySQL and its XA status we can see that everything is fine:

mysql> select * from authors;
Empty set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)


The row has been deleted from the table and there are no recovery pending transactions.

Checking the content of PostgreSQL and its XA status we can see that everything is fine:

lixa=> select * from authors;
 id | last_name | first_name 
----+-----------+------------
(0 rows)

lixa=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)


The row has been deleted from the table and there are no more prepared transactions.

Client Crashes in "Non-Prepared" State

In this scenario, the transaction branch managed by the REST client will crash before the transaction branch has been prepared. To obtain a crash immediately before prepare, the environment variable LIXA_CRASH_POINT must be set to a value of 12.

Client Crashes in "Non-Prepared" State

Start the REST Server

Start the REST server without setting the environment variable LIXA_CRASH_POINT:

$ docker run -ti --rm -e MAVEN_OPTS="-Djava.library.path=/opt/lixa/lib" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e PQSERVER="192.168.123.35" -p 18080:8080 rest-server

[...]

May 12, 2019 8:34:31 AM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8080]
May 12, 2019 8:34:31 AM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Jersey app started with WADL available at http://0.0.0.0:8080/xta/application.wadl
Hit enter to stop it...


Start the REST Client

Make sure the environment variable is set to LIXA_CRASH_POINT=12 and specify the "--no-delete" option (the table is empty and the transaction will crash during an SQL INSERT statement):

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e LIXA_CRASH_POINT=12 rest-client --no-delete
2019-05-12 08:41:17.620743 [1/139970505893632] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
MySQL: executing SQL statement >INSERT INTO authors VALUES(1840, 'Zola', 'Emile')<
Calling REST service passing: xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9fc8383c6b92286a09', oper='insert'
Server replied >PREPARED<
Executing transaction commit
2019-05-12 08:41:17.694272 [1/139970505893632] CRIT: LXG000C crash point 12 will immediately terminate the process


The client crashed during the transaction commit as expected. Row 6 shows that the client received a "PREPARED" answer from the server. Below is the output displayed by the server in its console:

***** REST service called: xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9fc8383c6b92286a09', oper='insert' *****
Created a subordinate branch with XID '1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9f4cb983944b89443f'
PostgreSQL: executing SQL statement >INSERT INTO authors VALUES(1804, 'Hawthorne', 'Nathaniel')<
Executing first phase of commit (prepare)
Returning 'PREPARED' to the client
Executing second phase of commit
2019-05-12 08:41:22.691844 [1/139661518812928] ERR: LXC034E a message has not arrived before timeout expiration (5000 ms) and its socket has been closed
XtaException: LIXA ReturnCode=-32 ('ERROR: a message has not been received before timeout expiration and the TCP socket has been closed')
org.tiian.lixa.xta.XtaException: ERROR: a message has not been received before timeout expiration and the TCP socket has been closed
at org.tiian.lixa.xta.Transaction.commit(Native Method)
at org.tiian.lixa.xta.examples.MyResource$1.run(MyResource.java:126)
at java.base/java.lang.Thread.run(Thread.java:834)


The server prepared its transaction branch, but it was not able to confirm it as shown by the message LXC034E and by the XtaException. If we check the status inside MySQL, the database used by the client, we can see a clean situation:

mysql> select * from authors;
Empty set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)


This is because the crash happened before the prepare phase. On the other hand, inside PostgreSQL we can see a prepared transaction:

lixa=> select * from authors;
 id | last_name | first_name 
----+-----------+------------
(0 rows)

lixa=> select * from pg_prepared_xacts;
 transaction |                             gid                              |   
        prepared            | owner | database 
-------------+--------------------------------------------------------------+---
----------------------------+-------+----------
         578 | 1279875137_ZDKYtxr0T1yZd0pGxe/CdA==_4BMKyDA7Wp9MuYOUS4lEPw== | 20
19-05-12 08:41:17.653142+00 | lixa  | lixa
(1 row)


The JDBC driver of PostgreSQL uses a base64 internal representation for the XID (XA transaction id):

It's time to execute the client without specifying the LIXA_CRASH_POINT environment variable:

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" rest-client --no-delete
2019-05-12 08:50:27.940106 [1/139723327334144] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
2019-05-12 08:50:27.941063 [1/139723327334144] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-12 08:50:27.941458 [1/139723327334144] NOTICE: LXC037N recovery pending transaction with xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9fc8383c6b92286a09'
2019-05-12 08:50:27.941478 [1/139723327334144] NOTICE: LXC038N transaction with xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9fc8383c6b92286a09' must be rolled back
2019-05-12 08:50:27.941811 [1/139723327334144] INFO: LXC018I Resource Manager 'MySQL' returned XAER_NOTA during recovery rollback: the Resource Manager has already rolled back the transaction with xid '1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9fc8383c6b92286a09'
2019-05-12 08:50:27.941821 [1/139723327334144] INFO: LXC039I transaction with xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9fc8383c6b92286a09' has been recovered (rolled back)
MySQL: executing SQL statement >INSERT INTO authors VALUES(1840, 'Zola', 'Emile')<
Calling REST service passing: xid='1279875137.836d2bad098046edaa09bee4e9ac8a2d.e0130ac8303b5a9fc8383c6b92286a09', oper='insert'
Server replied >PREPARED<
Executing transaction commit


As in the previous scenarios, we can see the messages LXC036I, LXC037N, ... LXC039I are related to the automatic recovery on the client side. Looking inside the console of the server we get something like this:

***** REST service called: xid='1279875137.836d2bad098046edaa09bee4e9ac8a2d.e0130ac8303b5a9fc8383c6b92286a09', oper='insert' *****
2019-05-12 08:50:27.954042 [1/139661588543232] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-12 08:50:27.954752 [1/139661588543232] NOTICE: LXC037N recovery pending transaction with xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9f4cb983944b89443f'
2019-05-12 08:50:27.954957 [1/139661588543232] NOTICE: LXC038N transaction with xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9f4cb983944b89443f' must be rolled back
2019-05-12 08:50:27.995375 [1/139661588543232] INFO: LXC039I transaction with xid='1279875137.643298b71af44f5c99774a46c5efc274.e0130ac8303b5a9f4cb983944b89443f' has been recovered (rolled back)
Created a subordinate branch with XID '1279875137.836d2bad098046edaa09bee4e9ac8a2d.e0130ac8303b5a9f3f83b37334a24183'
PostgreSQL: executing SQL statement >INSERT INTO authors VALUES(1804, 'Hawthorne', 'Nathaniel')<
Executing first phase of commit (prepare)
Returning 'PREPARED' to the client
Executing second phase of commit


Even on the server side, automatic recovery has been performed.

Let's check the status inside MySQL again:

mysql> select * from authors;
+------+-----------+------------+
| id   | last_name | first_name |
+------+-----------+------------+
| 1840 | Zola      | Emile      |
+------+-----------+------------+
1 row in set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)


And inside PostgreSQL:

lixa=> select * from authors;
  id  | last_name | first_name 
------+-----------+------------
 1804 | Hawthorne | Nathaniel
(1 row)

lixa=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)


Client Crashes in "Prepared" State

In this scenario, the transaction branch managed by the REST client will crash after the transaction branch has been prepared. To obtain a crash immediately after prepare, the environment variable LIXA_CRASH_POINT must be set to value 14.

Client Crashes in "Prepared" State

Start the REST Server

Start the REST server without setting the environment variable LIXA_CRASH_POINT:

$ docker run -ti --rm -e MAVEN_OPTS="-Djava.library.path=/opt/lixa/lib" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e PQSERVER="192.168.123.35" -p 18080:8080 rest-server

[...]

May 12, 2019 12:27:32 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8080]
May 12, 2019 12:27:32 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Jersey app started with WADL available at http://0.0.0.0:8080/xta/application.wadl
Hit enter to stop it...


Start the REST Client

Setting the environment variable to LIXA_CRASH_POINT=14  and specify the "--no-insert" option (the table contains one row and the transaction will crash during an SQL DELETE statement):

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" -e LIXA_CRASH_POINT=14 rest-client --no-insert
2019-05-12 12:30:31.934668 [1/140414610745088] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
MySQL: executing SQL statement >DELETE FROM authors WHERE id=1840<
Calling REST service passing: xid='1279875137.4bbeee362190416580342d1e0fbc2c1f.e0130ac8303b5a9fc8383c6b92286a09', oper='delete'
Server replied >PREPARED<
Executing transaction commit
2019-05-12 12:30:32.054258 [1/140414610745088] CRIT: LXG000C crash point 14 will immediately terminate the process


The client crashed during the transaction commit as expected, but this time the client has been prepared for the transaction. 

Row 6 shows that the client received a "PREPARED" answer from the server. Below is the output displayed by the server in its console:

***** REST service called: xid='1279875137.4bbeee362190416580342d1e0fbc2c1f.e0130ac8303b5a9fc8383c6b92286a09', oper='delete' *****
Created a subordinate branch with XID '1279875137.4bbeee362190416580342d1e0fbc2c1f.e0130ac8303b5a9fcfb462b57a45455f'
PostgreSQL: executing SQL statement >DELETE FROM authors WHERE id=1804<
Executing first phase of commit (prepare)
Returning 'PREPARED' to the client
Executing second phase of commit


There are no complaints on the REST server side.

If we check the status inside MySQL, the database used by the client, we can see the row is still in the table and a prepared transaction is waiting to be recovered:

mysql> select * from authors;
+------+-----------+------------+
| id   | last_name | first_name |
+------+-----------+------------+
| 1840 | Zola      | Emile      |
+------+-----------+------------+
1 row in set (0.00 sec)

mysql> xa recover;
+------------+--------------+--------------+------------------------------------------------------------------+
| formatID   | gtrid_length | bqual_length | data                                                             |
+------------+--------------+--------------+------------------------------------------------------------------+
| 1279875137 |           32 |           32 | 4bbeee362190416580342d1e0fbc2c1fe0130ac8303b5a9fc8383c6b92286a09 |
+------------+--------------+--------------+------------------------------------------------------------------+
1 row in set (0.00 sec)


Inside PostgreSQL, the situation is quite different:

lixa=> select * from authors;
 id | last_name | first_name 
----+-----------+------------
(0 rows)

lixa=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)


This is fine because the REST client completed the prepare phase and the REST server completed its transaction branch.

Once again, a new client invocation will activate the automatic recovery feature of the embedded LIXA Transaction Manager:

$ docker run -ti --rm -e SERVER="192.168.123.35" -e LIXA_STATE_SERVERS="tcp://192.168.123.35:2345/default" rest-client --no-insert
2019-05-12 12:38:47.404872 [1/140404202567424] INFO: LXC000I this process is starting a new LIXA transaction manager (lixa package version is 1.7.8)
***** REST client *****
2019-05-12 12:38:47.406843 [1/140404202567424] INFO: LXC036I state server noticed there is at least one recovery pending transaction
2019-05-12 12:38:47.407487 [1/140404202567424] NOTICE: LXC037N recovery pending transaction with xid='1279875137.4bbeee362190416580342d1e0fbc2c1f.e0130ac8303b5a9fc8383c6b92286a09'
2019-05-12 12:38:47.407678 [1/140404202567424] NOTICE: LXC038N transaction with xid='1279875137.4bbeee362190416580342d1e0fbc2c1f.e0130ac8303b5a9fc8383c6b92286a09' must be committed
2019-05-12 12:38:47.436083 [1/140404202567424] INFO: LXC039I transaction with xid='1279875137.4bbeee362190416580342d1e0fbc2c1f.e0130ac8303b5a9fc8383c6b92286a09' has been recovered (committed)
MySQL: executing SQL statement >DELETE FROM authors WHERE id=1840<
Calling REST service passing: xid='1279875137.ae47a9f8a8f04d03b6e8f87a904b6ab7.e0130ac8303b5a9fc8383c6b92286a09', oper='delete'
Server replied >PREPARED<
Executing transaction commit


Important fact: Rows 6 and 7 show that the recovery pending transaction must be committed. This is consistent with the status of the global transaction because the server already committed its own branch.

Looking inside the terminal of the server:

***** REST service called: xid='1279875137.ae47a9f8a8f04d03b6e8f87a904b6ab7.e0130ac8303b5a9fc8383c6b92286a09', oper='delete' *****
Created a subordinate branch with XID '1279875137.ae47a9f8a8f04d03b6e8f87a904b6ab7.e0130ac8303b5a9f558b39ba87cf4e4a'
PostgreSQL: executing SQL statement >DELETE FROM authors WHERE id=1804<
Executing first phase of commit (prepare)
Returning 'PREPARED' to the client
Executing second phase of commit


We see that there are no recovery pending transactions and this is consistent with the previous execution because, as already stated above, the transaction branch has already been committed on the server side.

Finally, we have to check the state in the databases. Looking at MySQL we get:

mysql> select * from authors;
Empty set (0.00 sec)

mysql> xa recover;
Empty set (0.00 sec)


And looking at PostgreSQL we obtain:

lixa=> select * from authors;
 id | last_name | first_name 
----+-----------+------------
(0 rows)

lixa=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)


The data is consistent and no prepared transactions are needed to recover!

Summary

In the previous paragraphs, four failure scenarios have been analyzed. Here is a brief summary.

Server Side Crashes Before Prepare

Both server and client don't prepare their own transaction branch, no prepared transaction is required to recover.

Server Side Crashes After Prepare

The server prepares its transaction branch and the client doesn't, only the server is affected by a prepared transaction that needs to be recovered, and the recovery happens automatically as soon as a new REST request is served.

Client Side Crashes Before Prepare

As in the above scenario, the server prepares its transaction branch and the client doesn't, only the server is affected by a prepared transaction that needs to be recovered, and the recovery happens automatically as soon as a new REST request is served.

Client Side Crashes After Prepare

The server prepares and commits its transaction branch, and the client prepares its transaction branch but does not commit. Only the client is affected by a prepared transaction that needs to be recovered and the recovery happens automatically as soon as a new transaction is started by the client.

Conclusion

The XTA API in conjunction with the LIXA state server enables the development of systems that implement ACID distributed transactions among two or more applications (services).

This article shows what happens when one of the two parties crashes: recovery is automatically performed without human intervention and strong consistency is guaranteed.

The issues related to the locks associated with the prepared transactions should not affect a microservices-based architecture: if a table is accessed by a single service and the locks are fine-grained (row lock level for relational databases), the only impact of a prepared transaction is a short recovery phase that's automatically performed by the next invocation of the same service.

 

 

 

 

Top