gdl
    title     = OpenAMQ Clustering
    subtitle  = High-Availabilty and Partitioning
    product   = OpenAMQ
    author    = iMatix Corporation <openamq@imatix.com>
    date      = 2006/08/03
    copyright = Copyright (c) 2006 iMatix Corporation
    version   = 2.1
end gdl

Cover
*****

State of this Document
======================

This document is a technical whitepaper.

Copyright and License
=====================

Copyright (c) 1996-2006 iMatix Corporation

This product is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.

This product is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.

For information on alternative licensing for OEMs, please contact iMatix
Corporation.

Abstract
========

We analyse the requirements for OpenAMQ local-area and wide-area clustering
and we design a generic clustering model based on simple and easy to manage
elements.  The clustering model provides two main functionalities: the use
of high-availability pairs of servers and the ability to partition groups of
client applications, for reasons of scalability and geographic distribution.

Introduction
************

Scope and Requirements
======================

This document describes the requirements, architecture, and design of the
OpenAMQ clustering implementation.  The OpenAMQ clustering implementation
covers two main requirements:

1. The use of high-availability pairs to create automatic failover systems
   in which applications can switch to a backup server if a primary server
   crashes.

2. The interconnection of servers or high-availability pairs into larger
   architectures that can support distributed request-response and pub-sub
   scenarios.

Out-of-Scope
============

We do not attempt to support these functions:

1. The use of 'active backups'.  In a HA pair, the backup server is inactive
   and does no useful work until the primary server goes offline.

2. The load-balancing of applications between servers.  While applications
   can be spread-out between servers (or HA pairs), this happens manually,
   through the technique of "partitioning", i.e. the grouping of applications
   into well-defined blocks, each served by a specific server or HA pair.

3. The handling of persistent messages or transactions in any specific way.
   While our clustering design is compatible with persistent messaging and
   transactions, these introduce specific concerns (such as the processing
   of messages during failover) which we do not address at present.

Technical Terminology
=====================

These terms have significance within the context of this document:

Partition:
    A group of AMQP client applications that are served by a single OpenAMQ
    server, or a pair of servers that collaborate as a "high-availability
    pair".
Partitioned Network:
    A network of OpenAMQ servers or HA pairs (which we call a "node").  Each
    node serves a well-defined set of client applications, where applications
    are tied to a particular node.  (A single application may work with
    multiple nodes, but this must be done explicitly, there is no concept of
    load-balancing between nodes.)
Node:
    A server or HA pair in a partitioned OpenAMQ network.  A node is either
    one server or two servers that collaborate as a high-availability pair
    in a primary-backup relationship designed to provide fast failover in case
    of failure of the primary server.
Cluster:
    A general term that covers any network architecture involving more
    than one OpenAMQ server in a collaborative relationship.

Scenarios
=========

These are typical scenarios that we wish to support in OpenAMQ clustering:

 - A set of applications working with a single message broker need a backup
   broker than can take over in case of problems with the primary broker.
   The failover needs to happen rapidly, and automatically, so that ongoing
   business is not interrupted.

 - A regional location is connected to a central location by a slow satellite
   link.  The regional applications need to access market data and business
   services that are provided centrally.  Traffic across the satellite link
   must be optimised so that messages to multiple subscribes are sent only
   once.

 - A customer requires a message broker installed at their location so that
   local applications can interoperate; these applications also subscribe
   to data feeds from a central server.

Basic Functionality
===================

We can organise the clustering functionality to two layers:

1. High-availability, in which two servers collaborate to ensure that a set
   of applications are well-served.

2. Partitioning of applications into multiple groups so that we can create
   larger networks of servers.

We look at each of these aspects in more detail.

High-Availability
-----------------

High-availability covers the following scenario:

 - A set of applications are working with a server on a computer system.

 - Either the software or the hardware crashes.

 - The applications detect the failure and reconnect to a backup server
   that is ready and running.

When we look at implementing high-availability, we must consider a number
of aspects:

1. How many servers do we need to create a satisfactory level of redundancy?

2. What happens to messages published during the failover process?

3. What level of support is required from client applications (that is, the
   AMQP client API)?

4. Does the high-availability design place restrictions on the network
   architecture?

5. How do we avoid the "split-brain scenario" in which both servers of a HA
   pair believe they are the active server?

Each of these questions has good answers, which we will look at later.  In
gross terms, we will aim to keep the HA design as simple as possible, since
any additional complexity creates its own unreliability.

Application Partitioning
------------------------

Partitioning means "divide and conquer", in a simple and direct sense.  When
we have a large number of applications, it can be more useful to divide these
across a set of servers, rather than concentrate them at a single point.

The alternative to partitioning is "load-balancing", in which a group of
applications is shared across multiple servers.

Whether or not we need load-balancing is a very important question, because
traditional clustering is about load-balancing as much as it is about high
availability, and rather a lot less about partitioning.  (Few middleware
products implement partitioning, which is the basis for grids, wide-area
networks, and other interesting organisations).

Traditionally, load-balancing is a performance option.  This makes sense in
servers that have severe limitations - perhaps limited by the number of
file handles (=sockets) that a single process can open, or simply by an
inefficient protocol and implementation that does not allow scaling.

In our experience, AMQP and OpenAMQ are fast enough that load-balancing is
not a key requirement.  However, we can also divide typical messaging into
two domains, request-response and publish-subscribe, and say that while we
do not need to use performance boosting architectures for request-response
processing, which is usually low on volume and high on reliability, we do
potentially need performance-boosting architectures for publish-subscribe.

On analysis, we discover that we can use partitioning as an effective and
simple model for performance improvement, in a publish-subscribe scenario.
This works by using a 'fanout' architecture (a tree or star model) in which
messages are distributed from a central point to multiple servers, each
serving a well-defined set of applications.

Partitioning thus covers these needs:

1. High-performance pub-sub architectures, for market-data and other very
   high-volume scenarios.

2. Geographic partitioning of applications, e.g. between central and
   regional offices.

General Architecture
====================

The Network "Node"
------------------

We construct a partitioned network out of nodes that are either standalone
servers, or high-availability pairs.  From the perspective of the network,
whether a node is a single server or a HA pair makes zero difference; this
is a purely local decision based on the need for service continuity at that
node.

To simplify discussion, we use the term "node" to cover either of one
OpenAMQ server, or a HA pair of OpenAMQ servers, collaborating to serve a
single set of client applications.

Partitioning Models
-------------------

These are a number of plausible network architectures, in which each node in
the network is a HA pair or a single stand-alone server:

1. A star network, with a single central node and many distributed nodes.
   This would typically be used to send information from a central point to
   regional networks, each serving a set of local applications and users.

2. A tree network, with one top-level node, a small number of second-level
   nodes, and more nodes connected to these, in a tree hierarchy.  This
   would typically be used to create extremely-high volume data publishing
   networks (capable of serving over a million messages per second).

3. A loose network, with ad-hoc relationships between nodes organised at
   regional, national, and global levels.  This would typically be used for
   real-life global organisations with diverse information streams, each
   involving a set of nodes.

We do not favour any of these organisations, but partitioning provides the
tools to build them all, and others that we have not yet considered.

Message Pull and Push
---------------------

AMQP is easily compatible with organising the flow of messages between
partitions.  The main semantics for message flow between two partitions are:

1. The receiving partition can subscribe to a set of messages.

2. The sending partition can forward a set of messages.

These are typical "pull" and "push" models.  Within the context of AMQP we can
define exactly what each of these mean.

Subscription, in AMQP terms, means creating a private queue and binding this
to one or more exchanges with specific binding conditions.  Messages published
to those exchanges, which match the conditions, are copied into the private
queue where they can be consumed.  This is how ordinary applications subscribe
to messages; this is also how partitions pull messages from each other.

We can in fact construct a full partitioned network using only pull mechanisms.
However, this means that for each pair of connected partitions we have to
configure subscriptions at each side.  The administrative cost for this can be
significant.

The push mechanism is a useful alternative because it lets us configure the
connection between the partitions at one node only (both pull and push).  This
means that adding a partition to an existing network can be done without any
configuration changes to the existing nodes.

To forward messages usefully we need some clear semantics.  From our analysis
it makes sense to forward messages on a per-exchange basis under one of
these (configurable) conditions:

 - Forward all messages, but process local bindings additionally.

 - Forward only messages that do not match any local bindings.

There are other possibilities (such as forwarding all messages) but we do not
have use cases for those.

Architecture and Design
***********************

Layers of Operation
===================

We break the high-availability and partitioning architecture into a number of
specific layers, each solving one problem:

1. The general ability to connect pairs of servers together.  This is a core
   requirement for both high-availability and partitioning.  We call this
   "Server Peering".

2. The ability for peers to exchange arbitrary non-application information.
   This is a requirement for high-availability (so that peers can agree on
   who is the active server).  We call this "Internal Messaging".

3. A layer in each server that handles the high-availability functionality.
   This is called the "High-availability Controller", or HAC.

4. A layer in each server that pushes messages to or pulls messages from
   another server, on behalf of its own client applications.  We call this the
   "Message Transfer Agent", or MTA.

5. The use of globally-unique names for reply queues, so that messages can be
   routed successfully across a partitioned network.  We call this "Global
   Routing".

We explicitly will NOT look at:

 - The ability to automatically define the topology, called "discovery". In
   our current OpenAMQ use cases, the cluster topology is defined manually by
   the system administrator.

 - The ability to spread large numbers of clients across many servers, called
   "load balancing".  First, current OpenAMQ implementations are capable of
   handling very large loads; secondly we can use partitioning to support more
   clients if necessary.

 - The ability to use clustered servers in any other role, for example as part
   of a message persistence scheme.

 - The ability to replicate the state of objects such as exchanges and shared
   queues between servers.

Server Peering
==============

The basic relationship between servers is called a "peering" and consists of
an asymmetric relationship from one server to another.

A peering goes in one direction, from a 'client' host onto a 'target' host.
To create a symmetric relationship between two servers, we define one
peering in each direction.

Peerings are internal objects, not configured by administrators, nor visible
to them, but created by other parts of the OpenAMQ server as needed.  Peerings
have these properties:

 - A peering will automatically detect network failures and target host
   failures, and will suspend and reconnect as required.

 - A peering acts as a remote client, with the only difference from the
   perspective of the target host is that peerings may use their own
   authentication data.

 - A peering uses a single configured network route.  If two peers talk over
   multiple network interfaces (for redundancy), this will imply multiple
   peerings.

A peering automatically creates a private queue on the target host and
consumes messages off this queue.  Incoming messages are delivered to whatever
server entity created the peering.

Internal Messaging
==================

Internal messaging is the exchange of information between entities at each
side of a peering.  AMQP does not provide any semantics for such exchanges, so
we must build our own on top of AMQP.

The simplest way to add internal messaging semantics to AMQP is to use the
exchange concept as a routing point.  That is, when one entity wishes to send
status information to another, across a peering, it would create a content (a
message) and publish that to a specific exchange at the destination server.

While the AMQP specifications include a "tunnel" class that could also be used
for this purpose, the semantics of publishing via an exchange are simple and
generic enough that we will use that mechanism.

Internal messaging is thus handled by an exchange called "amq.peer", and using
a set of specific binding criteria and messages that allow the creation of
simple request-response and pub-sub messaging models between peer entities.
(Note that such messages never leave the high-availability pair.)

High-availability Controller (HAC)
==================================

The HAC is an entity that operates within each server.  There is no external
controller process.

The two HACs in a HA pair use peering and peer messaging to talk to each
other.  They exchange messages that include:

 - Their own current status (active or passive).

 - The number of clients they have connected (must be zero for passive
   servers).

 - Their server identity ('name').

Each HAC implements an algorithm that continually detects the state of the
other HAC and decides whether to become passive or active.  The two HACs
do not use the identical algorithm - there is a "primary HAC" and a "backup
HAC" and the algorithm is slightly different in each case.

High-Availability Requirements
------------------------------

We can list our requirements for a high-availability architecture:

1. The time for a failover must be under 60 seconds and preferrably under 10
   seconds.

2. Failover must happen automatically, while recovery must happen manually,
   so that application service interruptions are minimise.

3. We need a single backup server to ensure an acceptable level of security.
   Multiple backup servers creates extra complexity that creates its own
   risks.

4. We assume that the HA pair is connected by a reliable network, possibly by
   multiple independent networks.

5. We assume that the HA backup server is passive, not doing any work until
   called upon to replace the HA primary server.  That is, we do not require
   an active-backup server (which also adds complexity without necessarily
   adding any value).

6. We assume that the HA backup and HA primary server are equally capable, or
   at least that the HA backup can carry the full application load.  There is
   no attempt to load-balance the application load across multiple backup
   servers.

7. We need simple semantics for how client applications should work with a HA
   pair.

8. We need clear instructions for network architects on how to avoid designs
   that could result in a "split-brain" syndrome in which the HA pair becomes
   split into two nodes through specific network failure scenarios.

9. The startup sequence of a HA pair must not matter; that is, there must be
   no timing-dependent issues at startup.

10. It must be possible to make planned stops and restarts of either server
    without stopping the client applications.

11. Operators must be able to monitor both servers at all times.

General Principles
------------------

The HAC has three internal states:

 - Pending: the server has started and is waiting for its peer to start.

 - Active: the server has become the active server for the partition.

 - Passive: the server is the passive (inactive failover) server for the
   partition.

When a server is running in HA mode, and a client application attempts to
connect, what happens depends on the the HAC state, and the type of client:

 - If the client is the other peer in a HA pair, or a console user, it
   is allowed to connect as usual, irrespective of the HAC state.

 - If the HAC is pending, normal clients are held in a connection 'opening'
   state until the HAC leaves the pending state, or the client times out.

 - If the HAC is active, the client connection is accepted as usual.

 - If the HAC is passive, the client connection is closed.

If a server becomes passive, and has normal clients, it disconnects those
clients.  Console users and HA peers are not disconnected.

Primary HAC Behaviour
---------------------

The primary HAC has these states and transitions:

 * (Primary Pending) - backup server comes online and is pending or passive
   ---> Primary Active.  Meaning: normal HA startup.

 * (Primary Active) - backup server comes online and is active ---> Primary
   Passive.  Meaning: primary has been restarted, and backup has taken over
   active role.

 * (Primary Passive) - backup server has gone offline and there is at least
   one connected client or client attempting to connect ---> Primary Active.
   Meaning: primary has been restarted, and now backup server has gone down.

 * (Primary Passive) - backup server is passive ---> Primary Active. Meaning:
   operator has forced backup server into passive mode to allow HA pair to
   resume normal configuration.

All other events result in no state changes.

Backup HAC Algorithm
--------------------

The backup HAC has these states and transitions:

 * (Backup Pending) - primary server comes online and is pending or active
   ---> Backup Passive.  Meaning: normal HA startup.

 * (Backup Passive) - primary server has gone offline and a client attempts to
   connect ---> Backup Active.  Meaning: primary has crashed or stopped and
   backup is still correctly connected to network.

 * (Backup Passive) - primary server is passive ---> Backup Active. Meaning:
   operator has forced primary server into passive mode to force HA pair into
   failover configuration.

 * (Backup Active) - primary server has gone offline and backup server has no
   connected clients ---> Backup Passive.  Meaning: backup is not connected
   to network any longer.

Other events result in no state changes.

Failover Scenarios
------------------

Here are a number of scenarios that will result in failover happening. Each
starts from the same point, two HA servers, an active primary and a passive
backup:

1. The primary server process or system crashes, and the applications
   reconnect to the backup server.

2. The primary server is disconnected from the network long enough for at
   least one application to disconnect (perhaps using agressive heartbeating)
   and reconnect to the backup server while the primary server is offline.

3. The operator forces the primary server into passive mode using the console.

Recovery Process
----------------

There are several ways to recover from a failover.  First, the operators
can always restart the primary server, with no impact on applications.  The
primary server will not become active; it will defer to the backup server.

Next, the operators can either stop and restart the backup server, or they
can use the operator console to force it to go into passive mode.

Normal Shutdown Process
-----------------------

The normal shutdown process is to either:

1. Stop the passive server and then stop the active server, or

2. Stop both servers in any order but within a few seconds of each other.

Stopping the active and then the passive server with any intervening delay
will force applications to disconnect, then reconnect, then disconnect
again, which may disturb users.

Split-Brain Prevention
----------------------

The algorithm for selecting the active/passive server eliminates split-brain
situations in all usual cases.

In some unusual cases, split-brain situations can still happen, and we can
document these cases, and avoid them.

The basic rule is that a peer will not become the active server unless it can
see client applications.  For both servers to become active, we need a network
configuration in which the servers cannot see each other, but each server can
see a group of client applications.

A typical scenario would a HA pair distributed between two buildings, where
each building also had a set of applications, and there was a single network
link between both buildings.  Breaking this link would create two sets of
client applications, each with half of the HA pair, and each HA peer would
become active.

To prevent split-brain situations, we must connect HA peers using a dedicated
network link, which can be as simple as plugging them both into the same
switch.

We must not split a HA pair into two islands, each with a set of applications.
While this may be a common type of network architecture, we use partitioning,
not high-availability, in such cases.

Client-side Semantics
---------------------

Our goal with high-availability is to make the client semantics very simple:

1. Client applications should know both HA peer addresses.  In the client APIs
   we allow the "hostname" to be specified as a list.  This list would contain
   two server names/addresses.

2. The client API will attempt to connect to either of these, in an undefined
   order.

3. The server will either hold the connection (if the server is pending),
   accept it (if the server is active, or the connection is 'special'), or
   reject it (if the server is passive).

4. If the client cannot connect to the first server, it will attempt to connect
   to the second.

5. The client may, optionally, retry this connection sequence, with appropriate
   pauses, before abandoning.

Note that we do NOT use the AMQP Connection.Redirect semantics, and we do not
provide the client with alternative hosts to connect to (the known-hosts field
of the Connection.Open method).  Both these semantics have proven complex and
error-prone in the past, and today appear to be unnecessary.

Message Transfer Agent (MTA)
============================

Requirements
------------

We can list our requirements for message transfer between partitions:

1. Message transfer must work equally well over slow WAN connections as fast
   LAN connections.  That is, message traffic must be optimised so that a
   message destined for multiple recipients at a node is sent only once to
   that node.

2. Message transfer must be easy to configure, operate, and debug.

4. We must support a distributed publish-subscribe fanout model in which data
   streams are distributed from a central source to a distributed network of
   nodes.

5. We must also support a distributed service request-response model in which
   services and the applications which use them can be put onto different
   partitions.

6. It must be possible to configure a working message transfer scenario from a
   single node, i.e. without no configuration changes on the target node.

7. We want to be able to create multiple independent message routing flows,
   so that different application architectures can be implemented within a
   single OpenAMQ server.

General MTA Design
------------------

The general MTA design is this:

 - An MTA is attached to an exchange.  An MTA instance always operates on a
   single, well-specified exchange.

 - An MTA pulls or pushes messages from/to an eopnymous exchange on another
   partition.

 - MTAs are configured objects that have the same lifespan as the exchanges
   they operate on.

Message "Pull" Processing
-------------------------

The message pull functionality is implemented as follows:

 - The MTA connects to the specified partition using a peering.

 - The MTA (or the peering, on behalf of the MTA) creates a temporary private
   queue and consumes off this queue.

 - When the exchange receives a new binding request (from an application), it
   passes this to the MTA, which makes an identical binding request on the
   remote partition, for the MTA's private queue.

 - The MTA thus receives all messages at the remote partition which correspond
   to the binding criteria that were specified.

 - When the exchange destroys a binding, it also tells the MTA.  Note: AMQP does
   not currently have any semantics for unbinding a queue, so any current MTA
   implementation will gradually become inefficient, with old unused bindings.

 - Incoming messages are published to the local exchange and thus to any local
   queues that requested them.

Message "Push" Processing
-------------------------

The message push functionality is implemented as follows:

 - The MTA connects to the specified partition using a peering.

 - The MTA tells the local exchange under what conditions it wants to receive
   messages.

 - When the exchange passes a message to the MTA, the MTA publishes it on the
   remote partition.

The conditions for message forwarding are:

 1. Forward all messages, but process local bindings additionally.

 2. Forward only messages that do not match any local bindings.

We assume that the connection between partitions is perhaps slow, but
reliable.  That is, we do not attempt to queue and forward messages in case
of a network failure - messages will simply be dropped.

Global Routing
==============

Routing Keys
------------

To enable request-response across partitions, using message transfer agents,
we must ensure that reply-to values are global.  We do not specifically need
service names to be global although this may be useful at some stage.

AMQP routing semantics are based primarily on a "routing key".  To make
global routing work, we need to ensure that routing keys are unique across
all interconnected servers.

The simplest way to create global routing keys is to use a naming convention
that includes unique server-dependent information.  We propose a convention
that has been well-proven in the Internet domain:

    queuename@hostname

To implement global routing, we must:

 1. Ensure that every server is properly configured with a unique identity.

 2. Use the server identity when naming temporary private queues (which are
    the basis for replies).

Default Routes
--------------

Although we do not have a concrete use case for this, we are analysing the
use of default routes for messages, based on the routing keys that follow a
systematic queuename@hostname format.  This would allow automatic message
transfer without the need for manual MTA configuration for each message
stream.

Configuration and Operation
***************************

We look at how the clustering and inter-clustering functions explained in this
document are configured for real scenarios.

HAC Configuration
=================

The configuration data for a HA pair must specify:

 - The primary and backup server identities (their 'names').

 - For each of these two servers, one or more internal network addresses, used
   for connections between servers.

Operation of the HA pair is as follows:

 - When a server process is started, we must tell it what configuration data
   to load, and whether or not it is half of a HA pair.  A server must know
   whether it is the primary or backup server.

 - The HAC in a particular server can be interrogated via the console, started
   and stopped.  This is also how we test the failover and recovery mechanisms
   without stopping and starting processes.

MTA Configuration
=================

The configuration for an MTA needs to specify:

 - Authentication information on the central server(s) so that the remote
   server can connect.

 - On the remote server(s), the name of each exchange to operate on, and
   the central server to federate it with.

 - For each MTA, whether or not to pull messages.

 - For each MT, whether or not to push messages, and under what conditions.