Scaling RabbitMQ

Jared W. Robinson

Vivint, Inc.

Who I am

What I believe

What's covered: 1 of 2

What's covered: 2 of 2

What's not covered

Notes

Excellent summary of the differences and trade-offs between RabbitMQ and Apache Kafka by Stuart Charlton, of Pivotal Software: http://www.quora.com/Which-one-is-better-for-durable-messaging-with-good-query-features-RabbitMQ-or-Kafka

a) Use Kafka if you have a fire hose of events (100k+/sec) you need delivered in partitioned order 'at least once' with a mix of online and batch consumers, you want to be able to re-read messages, you can deal with current limitations around node-level HA (or can use trunk code), and/or you don't mind supporting incubator-level software yourself via forums/IRC.

b) Use Rabbit if you have messages (20k+/sec) that need to be routed in complex ways to consumers, you want per-message delivery guarantees, you don't care about ordered delivery, you need HA at the cluster-node level now, and/or you need 24x7 paid support in addition to forums/IRC.

Basics

Notes

  • Rabbit's AMQP concepts: https://www.rabbitmq.com/tutorials/amqp-concepts.html
  • Glossary of terms: http://pythonhosted.org/nucleon.amqp/glossary.html

In the beginning

Notes

  • **** ASK: Who uses RabbitMQ?
  • **** ASK Who uses work queues? What are they for?
  • Why is round-robin distribution bad?
  • Which part of the system should ideally persist messages? As much as possible, the original producers, or the leaves in the graph.

Did you know?

Latency

Latency & Prefetch

  channel = amqp.Connection("my.broker.com:5672", ...)


  channel.basic_qos(
     prefetch_size=0, 
     prefetch_count=10, 
     a_global=False)
  

  channel.basic_consume("weather", ...)
      

Notes

import amqplib.client_0_8 as amqp
channel = amqp.Connection("mr.broker.com:5672", "myuser", "mypassword")
channel.basic_qos(prefetch_size=0, prefetch_count=10, a_global=False)

channel.basic_consume("weather", no_ack=False, callback=my_callback, consumer_tag=CONSUMER_TAG_CONST)
    

Did you know?

Throughput & Speed

I benchmarked the maximum throughput of our hardware around 10,000 messages/second for non-durable queues, non-persistent messages. Each message was less than 4K. YMMV.

Throughput needs

Throughput via sharding

Throughput: Are you my solution?

Consistent Hash Exchange

Notes

  • rabbitmq-plugins enable rabbitmq_consistent_hash_exchange
  • Use in combination with RabbitMQ clustering.
  • https://github.com/rabbitmq/rabbitmq-consistent-hash-exchange
  • http://rabbitmq.1065348.n5.nabble.com/Unexpected-Behavior-When-Using-the-quot-X-Consistent-Hash-quot-Exchange-Type-td30561.html

Consistent Hash Exchange configuration

$ umask 0022    -- important if you're doing this as root
$ rabbitmq-plugins enable rabbitmq_consistent_hash_exchange
The following plugins have been enabled:
  rabbitmq_consistent_hash_exchange
  Plugin configuration has changed. Restart RabbitMQ for changes to take effect.

$ service rabbitmq-server restart

$ /usr/sbin/rabbitmq-plugins list -E

Consistent Hash Exchange configuration

Notes

  • Gotchas of using the exchange: http://rabbitmq.1065348.n5.nabble.com/Unexpected-Behavior-When-Using-the-quot-X-Consistent-Hash-quot-Exchange-Type-td30561.html
  • Port and firewall config: http://www.gettingcirrius.com/2013/01/configuring-iptables-for-rabbitmq.html
Documentation on the routing key:

The more points in the hash space each binding has, the closer the actual distribution will be to the desired distribution (as indicated by the ratio of points by binding). However, large numbers of points (many thousands) will substantially decrease performance of the exchange type.

Equally, it is important to ensure that the messages being published to the exchange have a range of different routing_keys: if a very small set of routing keys are being used then there's a possibility of messages not being evenly distributed between the various queues. If the routing key is a pseudo-random session ID or such, then good results should follow.

Consistent Hash Exchange Uneven Distribution

Notes

  • The binding key for the queues to the exchange was "2"

Consistent Hash Exchange Uneven Distribution

Notes

  • The binding key for the queues to the exchange was "10"

Consistent Hash Exchange Caveats

rabbitmq-sharding

  • https://github.com/rabbitmq/rabbitmq-sharding/blob/master/README.extra.md
  • rabbitmq-sharding configuration

    Download from https://www.rabbitmq.com/community-plugins/v3.3.x/
    cp rabbitmq_sharding-3.3.x.ez  \
      /usr/lib/rabbitmq/lib/rabbitmq_server-3.3.5/plugins/ 
    
    rabbitmq-plugins enable rabbitmq_sharding
    rabbitmq-plugins enable rabbitmq_consistent_hash_exchange
    
    service rabbitmq-server restart
    
    rabbitmqctl set_policy history-shard "^history" \
      '{"shards-per-node": 2, "routing-key": "1234"}' \
      --apply-to exchanges
    

    rabbitmq-sharding client usage

    rabbitmq-sharding caveats

    Did you know?

    Source: http://stackoverflow.com/questions/18531072/rabbitmq-by-example-multiple-threads-channels-and-queues

    Client-side sharding

    Client-side sharding

    Producers, Consumers, YAML

    Client-side sharding

    Benefits

    Client-side sharding monitoring

    Homegrown monitoring

    Client-side sharding Caveats

    Federated Queues

    Federated Queues image, from rabbitmq.com

    Notes

    https://www.rabbitmq.com/federated-queues.html

    Pitfall: "cannot currently cause messages to traverse multiple hops between brokers based solely on need for messages in one place. For example, if you federate queues on nodes A, B and C, with A and B connected and B and C connected, but not A and C, then if messages are available at A and consumers waiting at C then messages will not be transferred from A to C via B unless there is also a consumer at B."

    Federated Queues

    Federated Queues image, from rabbitmq.com

    Better together?

    Did you know?

    Default limits.conf

    Default (insufficient) limits.conf for RHEL/CentOS/Fedora
    
    *  soft  nproc   1024 # threads/processes
    *  soft  nofile  1024 # Number of open files
    *  hard  nofile  4096 # Number of open files
    
    
    

    Notes

    See /etc/security/limits.d/90-nproc.conf and /etc/security/limits.conf I don't see "nofile" in any configuration on RHEL/CentOS

    limits.conf for RabbitMQ

    Here's what I'm using for /etc/security/limits.conf
    
    rabbitmq  soft  nproc   16384
    rabbitmq  hard  nofile  16000
    rabbitmq  soft  nofile  16000
    

    Did you know?

    Notes

    • http://www.rabbitmq.com/ttl.html
    • https://www.rabbitmq.com/dlx.html

    Fail-over

    Transient vs Durable

    Did you know?

    Backpressure and RabbitMQ death

    Notes

    • http://stackoverflow.com/questions/10030227
    • http://www.rabbitmq.com/memory.html
    • https://www.rabbitmq.com/disk-alarms.html
    RabbitMQ will block producers when free disk space drops below a certain limit. This is a good idea since even transient messages can be paged to disk at any time, and running out of disk space can cause the server to crash. By default RabbitMQ will block producers, and prevent memory-based messages from being paged to disk, when free disk space drops below 50MB. This will reduce but not eliminate the likelihood of a crash due to disk space being exhausted. In particular, if messages are being paged out rapidly it is possible to run out of disk space and crash in the time between two runs of the disk space monitor. A more conservative approach would therefore be to set the limit to the same as the amount of memory installed on the system (see the configuration below).

    ....By default 50MB is required to be free on the database partition.

    Possible method to prevent RabbitMQ from crashing: Add the following setting to /etc/rabbitmq/rabbitmq.config
    "{vm_memory_high_watermark_paging_ratio, 1.1}"

    Reference: http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2013-September/030458.html

    Grouping messages

    Summary 1 of 2

    Summary 2 of 2

    References