Given that there was no further development of this spec by any of the contributors since it was posted, I'm tagging it as "deleted".
I'm concerned that some of the layers here are already performed by IP routers, ICMP, UDP and TCP.
Protocols shouldn't re-invent things which already work (well), but leverage them.
Perhaps I need more information and look forward to discussing.
Change management process
Problem statement, goals, vision
The AMQP protocol development process suffers from a lack of cross-team organisation, which is most visible in the lack of progress made on most fronts, alternated with large, unclear changes in some areas.
The goal of this change management process is to:
- Ensure that changes are properly scoped into manageable units of work;
- Ensure that these units of work are properly documented and recorded;
- Ensure that this work is either carried out as planned, or stopped for clear reasons;
- Allow the PMC to have a meaningfull overview of all work in progress;
- Allow outside teams to participate by submitting requests and bugs;
- Ensure that we can provide a change log for protocol releases.
The vision is one in which changes are more formally scoped, better managed, and can this flow more predictably and smoothly from problem statement to tested proposal.
The cost of changing the protocol must be reduced, so that it becomes attractive for teams to contribute. Today the cost is unreasonably high and this is preventing active participation from all protocol developers.
Change scoping and types of change
Unclear scope is the biggest risk factor in change. If the scope of a unit of work is not very clearly defined, it becomes unreasonably expensive to manage the change. For software development, we tend to restrict the scope of a change by tying it to a specific issue:
- A bug
- A feature request
- A refactoring
Good developers learn to do one or other of these, never to mix them. We should apply the same discipline to the protocol development. That is, a protocol change should be one of these categories, never more than one.
The Change Manager
The responsibilities within the protocol development workgroup do not need to be formalised except for one position, namely the Change Manager (CM). Note that other responsibilities such as 'Editor' and 'Chair' may be formalised but this is beyond the scope of this present proposal.
The CM supervises change management process and reports to the PMC. The CM has the authority of rejecting any change proposal based on process quality criteria. For example, a change proposal that is not clearly scoped, or that does not follow the agreed process.
Automation and communications
The principal tool used for automating the change process is Jira, an issue tracking system. Jira has the advantage of collecting comments on a per-issue basis, avoiding the problem of "loose emails". Additional wikis and documents can be used to support the Jira-based documentation.
Workflow design principles
The goal of the workflow design is to keep it very simple to make changes, with a minimum of traffic by email and decisions by the PMC. At the same time, we assume that peer-review of requests and proposals will lead to quality solutions. That is, we promote clarity and communications above formality. However, this process is meant to be improved, and has probably been over-simplified to work as an initial version.
- Any developer can create a new issue, which is one of (a) a bug, (b) a feature request, (c) a refactoring request.
- The CM accepts a new issue on behalf of PMC, if the issue meets basic quality standards for explanation, clarity, etc.
- When an issue is accepted, Jira will send an email is sent to the dev list. At this stage other developers can comment on the issue, and challenge its validity, utility, etc.
- The CM assigns new issues to workgroups when the responsibilities are clear. When not, the CM will hold the issue until the PMC can decide on whether a new workgroup must be started or not.
- The workgroup has an unlimited time to answer the issue. However, all open issues will be reviewed at each PMC, and issues that are not resolved within 3-4 weeks will be reassigned, closed, or suspended.
- When the workgroup has answered the issue, the CM will review the solution and if it meets basic quality standards for documentation and argumentation, the CM will resolve the issue.
- When an issue is resolved, Jira will send an email to the dev list.
- The CM will present all resolved issues to the PMC, who will vote on changes following a 2-3 week review period.
- When the PMC accepts a change, the CM will close the issue.
- The CM can also suspend, cancel, reopen issues as needed.
We prefer the term "workgroup" to "sig", being more accurate. A properly constituted workgroup MUST contain developers of at least two competing AMQP implementations. The reason for this is to ensure that challenge and argumentation happens early in the process, not when solutions are presented for voting.
These are the basic quality rules that the CM is expected to enforce, with PMC backing:
- Only meaningful issues are entered into the Jira.
- Issues are properly specified as problems, not solutions.
- Workgroups are properly constituted.
- Workgroups make proper minutes of meetings.
- Solution proposals are properly argued and documented.
- Issues are handled by workgroups within a reasonable period.
Documentation should be a workgroup, not a responsibility. That is, there should be a workgroup of "editors" who can handle documentation issues.
Chair - elected representative of WG as a whole. It is an elected position. Election is done once every six months. Chair may be reelected. (1 person)
Process manager - person responsible for guarding and enforcing formal process within the WG. It is an elected position. Election takes place either when process manager in charge resigns or when PMC is unhappy with process manager's performance. (1 person)
Documentation manager - person responsible for incorporating changes to AMQP documents. It is an elected position. Election takes place either when documentation manager in charge resigns or when PMC is unhappy with documentation manager's performance. (1 person)
Issue manager - person responsible for work on specific issue. Issue manager is selected from developers volunteering to do the task by the PMC. If there is no volunteer, representative that brought the issue to the the PMC is obliged to become the issue manager. (1 person per issue solved)
Representative - each party joined in WG has single representative with the right to vote, make decisions etc. How to select the representative is internal issue of each member party. (1 person per WG member)
Developer - person assigned to work on AMQP by WG member or reviewer. Developer is not necessarily an employee of member party that assigned him/her. PMC MAY remove a person from developer list either if he/she is not active on AMQP for a long time ("zombie developer") or if he/she deliberately and consistently ignores the process or if the member party that assigned him/her asks for the removal explicitly – for example when the developer quits working for the member. (Unlimited number of developers is allowed.)
When there is single person responsible (process manager, documentation manager, issue manager, representative) he/she should choose a substitute in advance that will be in charge when he/she is ill, on leave etc. Moving of position from original official to substitute and vice versa MUST be announced publicly (on mailing list) by either the official or by the substitute in case original official is unable to do it personally. The substitute assumes all the rights and responsibilities of the person being substituted once the public announcement is sent. Once the original person is available again and sends public announcement about it, he/she automatically resumes all the rights and responsibilities.
1. Statement of problem is submitted by a representative.2
3. Representative that posted the statement of problem has right to move the issue to PMC for confirmation.
4. If an issue is not passed to PMC within 1 month4, process manager SHOULD mark it as rejected.
5. PMC confirms that the problem falls into the scope of AMQP and should be worked on. Confirmation is based on majority vote5. If PMC doesn't confirm the problem, it MUST be rejected. PMC also assigns the issue to chosen volunteer thus making him/her issue manager and sets a deadline for its completion.
7. Issue manager writes a formal proposal based on concensus reached in discussion.
10. When concensus is reached, issue manager incorporates the changes into formal proposal and passes the solution to PMC.
11. PMC votes on the proposal. Acceptance is based on majority vote8. If proposal fails, it MUST be rejected.
12. Documentation manager incorporates accepted proposals into the documentation9. This SHOULD be done within 1 week.
1 Issue wokrflow is a linear process. There are no loops there so it is impossible for any party to slow down the development by reiterating the process. Each step in the process has clearly defined deadline. Deadline can be changed by PMC vote if convincing reasons are presented. It follows that each issue is either rejected or solved in the time determined in advance. Also each step of the process has a single person responsible for it. This avoids the problem of shared responsibility. If something goes wrong there's single person responsible for it. Failure will thus result in loss of credibility and PMC will be obviously hesitant about assigning new tasks to the same person.
2 In fact, statements of problem MAY come from any developer or from the public, however, submission must be done by a representative to have a single person responsible for the issue.
3 Each round of discussion MUST take at least 1 week; it may take longer but issue deadline should be kept on mind. Issue manager is obliged to report on the progress of the issue to PMC and WG as a whole at least once a week (in case discussion lasts longer than 1 week). Also note that there are 3 rounds of discussion in the workflow. This copies existing 3 week process.
4 As the statement of problem haven't passed through the PMC so far, there's nobody to set deadline for it. Therefore one month period is used as default deadline.
5 Each voting results either in issue being moved to another step of the process or rejecting it altogether. This seems to be dangerous but NB that the issue is proposed for voting by the party that advocates the issue, so it is in their best interest not to move controversial issues for voting until group concensus is reached.
6 See note 3.
7 See note 3.
8 See note 4.
9 There is no formal reviewing step for the documentation. Typos can be corrected using rapid editorial process. If the proposal introduced into documentation does not match the one accepted by PMC, the issue should be brought back to PMC either by process manager or by a representative.
Rapid editorial process
- Editorial changes like corrections of typos and obvious errors should be passed to PMC.
- They are voted on en masse on the next PMC meeting.
- If at least one representative objects about specific editorial change, the change MUST be rejected, but may be entered into process once more as a statement of problem (following standard process).
Process manager MUST watch all the WG traffic and cry out if process is not being observed.
Process manager MUST keep all the deadlines under control.
Process manager MAY warn in advance if specific issue looks like it won't match the deadline
If process is not observed or deadline is not matched, process manager MUST pass the issue to PMC. It MUST be done publicly to allow everyone to monitor the work of process manager.
PMC either warns the issue manager and asks him/her to get the things right or reassigns the issue to different person.
PMC and WG as a whole should observe the work of process manager and complain if there are problems.
PMC can change process manager at any point if he/she doesn't fulfil his/her duties well.
Documentation manager MUST incorporate all the proposals accepted by PMC into the documentation in 1 week time.
Documentation manager SHOULD try to keep original wording of the proposal intact.
Process manager and WG as a whole should monitor documentation manager's work and report any problems to the PMC.
PMC can change documentation manager at any point if he/she doesn't fulfil his/her duties well.
On 11/24/06, Steven Shaw <moc.liamg|wahsets#moc.liamg|wahsets> wrote:
I thought you guys at iMatix had already implemented clustering for HA
We have implemented the clustering model described on wiki.amqp.org but we're doing transient messaging, so no persistence, no transactions.
We're using AMQP like IP, and the traffic-control layers are implemented at a higher level (in the API framework at the application side). This lets us do very rapid (and simple) transient messaging for market data, and also end-to-end transactions for order processing.
This is a key question about AMQP: does the broker (a) need to act as a guaranteed data store, or does it (b) need only to act as a message switch. I'm well aware that probably everyone on this list thinks the answer is obviously (a). But what this obliges us to do is define a full data-safe HA model, and as we've seen this has serious complexifying effects on the protocol.
I do not think we have considered this basic question sufficiently. We have assumed that the classic MQ Series model of a "mainframe broker" is the only plausible one. Note that even MQ Series still sometimes drops messages, and that HA clustering is probably one of the most complex problems for any database or middleware. Do we really want to solve this problem? Is this even a problem we should be solving?
I'd like to suggest an alternate, simpler vision for an AMQP network, based on our real experience with a large scale deployments.
First, the protocol would be stripped down to remove all persistence and transaction-oriented functionality. These could be moved to optional classes (content classes), but for my proposal, they are irrelevant.
Second, all brokers are seen as fully transient black boxes, where their only durable data is a security profile. All broker wiring and data is transient. This is largely our current philosophy, but it can be reinforced.
Thirdly, brokers create HA pairs (using our proven design). A HA pair appears as a single broker (perhaps with two IP addresses) to the outside world. HA failover and recovery is done by a dialogue between the HA pair.
Forthly, HA pairs (or stand-alone brokers) can be internetworked into large architectures, to allow geographical distribution and (more importantly) high-volume fanout. Clients connect to a well-specified local broker/HA pair.
So far this gives us a very simple and scalable model, with brokers that can be cast into hardware, and where the reliability of individual pieces increases to the point where failure is extremely rare. Something like a modern IP network.
Next, application frameworks implement reliability on top of that architecture, using proven traffic-control mechanisms, namely acknowledgement and retry. The entire TC layer is point-to-point and 100% ignorant of the network architecture, and HA configuration. TC is only used for those parts of the work that need it.
This "thin-AMQP" approach gives us some significant advantages:
1. We can simplify the protocol, and finish the basic WLP work rapidly.
2. We can simplify the HA question, which we've already solved & proven.
3. We can experiment with different TC mechanisms *without* affecting AMQP.
4. Brokers will be simpler, so more reliable.
5. High-performance scenarios can ignore the TC layer.
Given the significant advantages of such an approach, and given that it is very close to what we are doing in production at JPMC, and given that it's close to the IP - TCP/IP model, I'd like to ask why we're not going in that direction, instead of the direction of more complexity?
Note that vendors can still differentiate themselves, by making brokers that are faster, easier to administrate, and run on smaller boxes. They can define better TC layers, even proprietary ones. And because the protocol would be simpler, it would get adoption much faster.
I could go on with this… but perhaps someone can falsify my basic assumption, which is that TC can be done between the end-layers.
I think that you will need an error code in the returned message. Take the following 2 use cases:
1) Publisher sends message then gets it back. You have no idea why. I guess you know there is a range of reasons:
Queue not declared, wrong routing key, no subscribers(if immediate), expired message(w/ ttl).
I could imagine a client app that the end user types in a value that is used as the routing key. If they get the value wrong then there is no way to tell them they made a mistake.
2) Later when security is introduced so publishers have to be authenticated to send to a particular queue:
In a middleware layer without error codes using the wrong routing key that happens to exist but to which the publisher has no rights could not be interpreted correctly.
The returned message may indicated that there is a configuration problem i.e. publisher has not been granted publish rights or that the routing key is wrong.
However the middleware would have to assume that it came back because it has expired/no subscribers and so may simply resend it which would waste a lot of bandwidth.
The AMQP documentation likes to compare itself to SMTP here they do modify the headers of messages as it passes through each MTA so the routing information can be used in processing.
Perhaps we don't want to expose the routing information to the consumers (of which the producer is also one) but knowledge about the processing could be useful to the producer to understand the reason for failure. Indeed other features such as billing and SPAM detection(perhaps this is really unlikely in the real world but it might be an issue) could benefit from routing information.
Yes, you are right and it is a good point. Not having a chained structure (where each exchange can be linked to another for 'return' processing) does prevent a multi-stage approach to handling 'returns'.
Yes, it's philosophical discussion so we may never agree on how it should work. However, consider Robert's use case where message is delivered to primary service, it that one is not available, it is delivered to secondary service and if even that one is not available, it is delivered back to sender. This scenario is possible to implement with original proposal, but not when reject exchange is specified in the message.
The publish method contains an exchange to use for delivery, if you want to change that exchange then you need to reconfigure your sender (or maybe change the type of the exchange). It does not seem unreasonable to me for the same approach to be applied to the exchange to use for handling delivery failures. You still decouple the producer of the message from any consumer which is the ultimate goal. I think we have to agree to disagree.
Routing_key is _used_ for routing. However, it does not have any strictly given routing semantics as say "deliver message to queue specified in routing key". Same applies to reject_key. Thus, if we need to deliver messages somewhere else we can do that by simply rewiring without need to modify the sender application.
'reject-exchange' field in Basic.Publish (or in the headers field, doesn't matter) does have such strict routing semantics: "if message isn't matched by any binding, deliver it to such and such exchange". So if you want to change routing mechanism, rewiring doesn't do - you have to modify sender apllication.
Re: the exchange being deleted already, that is a good point. However there is no gurantee there anyway. I'd suggest it is an uncommon scenario and can be handled by the routing the message to the default amq.return (of whatever type that is).
Re: the philosophical point, the routing key seems to have as its primary purpose the routing mechanism along with the exchange, both of which are specified in publication. You are also proposing setting the reject_key to allow the publisher to influence the routing of the message once it is deemed undeliverable. I don't see a great difference in allowing the reject_exchange to be specified. However, I think Alan's suggestion is better anyway. Other than the infinite looping fear, is there any other objection to that?
"But why not just deliver them to the reject-exchange for the exchange to which the message was published? I don't see why a separate exchange is required."
The exchange may already not exist at the time the queue is deleted.
That seems easy enough to prevent: if the reject-exchange is me and I don't have a matching binding, drop it.
I haven't wanted to go into this again, hoped that would be able to convince you with non-philosophical arguments :), but ok. This is in fact a philosphical question. Protocol tries to separate sender of message from routing specifics. I.e. sender publishes a message and sets its fields but none of the data he have set does affect routing mechanism in any strictly given way. How message is routed is specified by wiring. Adding reject-exchange to Basic.Publish can be thought of as overriding primary exchange's routing mechanism _by sender_ - thus violating above principle.
"Separate exchange for queue is required to implement DLQs. Messages stored in queue in the
moment the queue is deleted should be delivered to DLQ."
But why not just deliver them to the reject-exchange for the exchange to which the message was published? I don't see why a separate exchange is required.
"Error codes in rejected message: I personally don't like idea of modifying the message while being routed"
I do see your point on this. I'm not sure one way or the other. Lets leave it unless anyone comes up with a convincing case.
"Imagine rejected message is sent to reject-exchange where it is rejected once more! Reject-exchange would send it to itself, thus creating infinte loop."
That seems easy enough to prevent: if the reject-exchange is me and I don't have a matching binding, drop it.
Separate exchange for queue is required to implement DLQs. Messages stored in queue in the moment the queue is deleted should be delivered to DLQ.
Making reject exchange use topic exhcnage matching algorithm: I haven't seen use case that would require such semantics, but maybe. Don't know.
Need for a separate exchange type and the reject_key: This is not meant as something you MUST use. Whole reject_key/reject exchange concept is meant only as something to make life easier for naive users. You can do as well without it.
Error codes in rejected message: I personally don't like idea of modifying the message while being routed. So far we don't do that. If you are interested whether message was rejected by exchange or queue, you can use separate reject exchange for exchage and queue.
Imagine rejected message is sent to reject-exchange where it is rejected once more! Reject-exchange would send it to itself, thus creating infinte loop.
I like that! A little bit less explicit, but it sets a good example for furture changes in having no impact on the wire format.
Here's a proposal: If a message is to be rejected brokers and/or queues look for "x-reject-exchange" in the field table. If it is found then forward to that exchange. If not drop the message. No other changes needed.
Now supposing I actually want reject-key behavior as proposed above. Then I do the following
* Create an ordinary header exchange called "reject"
* Publishers create their personal queue and bind to "reject" exchange with "reject-key=mykey"
* Publishers send messages with "x-reject-exchange=reject, reject-key=mykey"
In other words the desired behaviour can be wired together using existing exchange types. I think all the other scenarios discussed on this list can also be wired together by the consumer given this single degree of flexibility. For example Gordon would like rejects to go to a topic exchange - no problem. The nice thing here is that the reject behavior is determined per message so the consumer can make individual rejected messages go literally anywhere that AMQ can take them.
We could put reject-exchange into basic.publish, I'm not too pushed either way.
Is there a need for a separate reject exchange for a queue? I can see that rejection is a consumer activity, and the consumer is perhaps more interested in the queue than the exchange. However the message being rejected was published through a particular exchange and the publisher may also wish to handle rejections.
It seems simpler to me to just define a reject exchange for an exchange, not for a queue.
Semantics similar to the topic exchange might be better for the reject exchange type as it would allow wildcard matching which could be useful for catch-all bindings.
I'm still a little unconvinced on the need for a separate exchange type and the reject_key. As the publisher sets the reject_key, why not just allow them to override the reject-exchange at that level if required? We could then I suspect tackle most use-cases with the existing set of exchange types and without the need for a special header.
Some headers indicating (a) that a message was rejected or unroutable and (b) the reasons were discussed at the face-to-face. I can see value in those, but perhaps they could be put in the field table (with the reserved 'x-' prefix)?
I've written a wiki page with all use cases for exchanges I am aware of. Please, feel free to add your own use cases, so that we can base exchange design on them. Thanks.
I realised that 'binding-priority' proposal and 'return' exchange proposal aren't mutually exclusive. They are in fact two solutions, each for one side of message rejection problem.
'binding-priority' is a solution for "What is the algorithm for messages getting rejected within an exchange?"
'return' exchange is a solution for "What's the most common message-returning scenario?"
In other words, 'binding-priority' allows us to use any message-returning scenario, whereas 'return' exchange defines most common one, one that should be made default in the same way as each queue is bound to 'default' exchange when created.
Modifying messages on refcounting exchanges can be done efficiently using some variation of the copy-on-write pattern. "Efficiently" in the sense that it doesn't force additional copies in no-modify cases. It is more complicated than simple refcounts but that's an implementor's problem. It shouldn't affect decisions on how to do things in the spec.