Joy Gao
2018-09-06 17:53:21 UTC
Hi all,
We are fairly new to Cassandra. We began looking into the CDC feature
introduced in 3.0. As we spent more time looking into it, the complexity
began to add up (i.e. duplicated mutation based on RF, out of order
mutation, mutation does not contain full row of data, etc). These
limitations have already been mentioned in the discussion thread in
CASSANDRA-8844, so we understand the design decisions around this. However,
we do not want to push solving this complexity to every downstream
consumers, where they each have to handle
deduping/ordering/read-before-write to get full row; instead we want to
solve them earlier in the pipeline, so the change message are
deduped/ordered/complete by the time they arrive in Kafka. Dedupe can be
solved with a cache, and ordering can be solved since mutations have
timestamps, but the one we have the most trouble with is not having the
full row of data.
We had a couple discussions with some folks in other companies who are
working on applying CDC feature for their real-time data pipelines. On a
high-level, the common feedback we gathered is to use a stateful processing
approach to maintain a separate db which mutations are applied to, which
then allows them to construct the "before" and "after" data without having
to query the original Cassandra db on each mutation. The downside of this
is the operational overhead of having to maintain this intermediary db for
CDC.
We have an unconventional idea (inspired by DSE Advanced Replication) that
eliminates some of the operational overhead, but with tradeoff of
increasing code complexity and memory pressure. The high level idea is a
stateless processing approach where we have a process in each C* node that
parse mutation from CDC logs and query local node to get the "after" data,
which avoid network hops and thus making reading full-row of data more
efficient. We essentially treat the mutations in CDC log as change
notifications. To solve dedupe/ordering, only the primary node for each
token range will send the data to Kafka, but data are reconciled with peer
nodes to prevent data loss.
We have a* WIP design doc
<https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that goes
over this idea in details.
We haven't sort out all the edge cases yet, but would love to get some
feedback from the community on the general feasibility of this approach.
Any ideas/concerns/questions would be helpful to us. Thanks!
Joy
We are fairly new to Cassandra. We began looking into the CDC feature
introduced in 3.0. As we spent more time looking into it, the complexity
began to add up (i.e. duplicated mutation based on RF, out of order
mutation, mutation does not contain full row of data, etc). These
limitations have already been mentioned in the discussion thread in
CASSANDRA-8844, so we understand the design decisions around this. However,
we do not want to push solving this complexity to every downstream
consumers, where they each have to handle
deduping/ordering/read-before-write to get full row; instead we want to
solve them earlier in the pipeline, so the change message are
deduped/ordered/complete by the time they arrive in Kafka. Dedupe can be
solved with a cache, and ordering can be solved since mutations have
timestamps, but the one we have the most trouble with is not having the
full row of data.
We had a couple discussions with some folks in other companies who are
working on applying CDC feature for their real-time data pipelines. On a
high-level, the common feedback we gathered is to use a stateful processing
approach to maintain a separate db which mutations are applied to, which
then allows them to construct the "before" and "after" data without having
to query the original Cassandra db on each mutation. The downside of this
is the operational overhead of having to maintain this intermediary db for
CDC.
We have an unconventional idea (inspired by DSE Advanced Replication) that
eliminates some of the operational overhead, but with tradeoff of
increasing code complexity and memory pressure. The high level idea is a
stateless processing approach where we have a process in each C* node that
parse mutation from CDC logs and query local node to get the "after" data,
which avoid network hops and thus making reading full-row of data more
efficient. We essentially treat the mutations in CDC log as change
notifications. To solve dedupe/ordering, only the primary node for each
token range will send the data to Kafka, but data are reconciled with peer
nodes to prevent data loss.
We have a* WIP design doc
<https://wepayinc.box.com/s/fmdtw0idajyfa23hosf7x4ustdhb0ura>* that goes
over this idea in details.
We haven't sort out all the edge cases yet, but would love to get some
feedback from the community on the general feasibility of this approach.
Any ideas/concerns/questions would be helpful to us. Thanks!
Joy