Completely agree. I'm not familiar with Calvin protocol, but that sounds
interesting (reading time...).
Post by Joy Gao
Thank you all for the feedback so far.
The immediate use case for us is setting up a real-time streaming data
pipeline from C* to our Data Warehouse (BigQuery), where other teams can
access the data for reporting/analytics/ad-hoc query. We already do
this with MySQL
where we stream the MySQL Binlog via Debezium <https://debezium.io>'s
MySQL Connector to Kafka, and then use a BigQuery Sink Connector to stream
data to BigQuery.
Re Jon's comment about why not write to Kafka first? In some cases that
may be ideal; but one potential concern we have with writing to Kafka first
is not having "read-after-write" consistency. The data could be written to
Kafka, but not yet consumed by C*. If the web service issues a (quorum)
read immediately after the (quorum) write, the data that is being returned
could still be outdated if the consumer did not catch up. Having web
service interacts with C* directly solves this problem for us (we could add
a cache before writing to Kafka, but that adds additional operational
complexity to the architecture; alternatively, we could write to Kafka and
C* transactionally, but distributed transaction is slow).
Having the ability to stream its data to other systems could make C*
more flexible and more easily integrated into a larger data ecosystem. As
Dinesh has mentioned, implementing this in the database layer means there
is a standard approach to getting a change notification stream (unlike
trigger which is ad-hoc and customized). Aside from replication, the change
events could be used for updating Elasticsearch, generating derived views
(i.e. for reporting), sending to an audit services, sending to a
notification service, and in our case, streaming to our data warehouse for
analytics. (one article that goes over database streaming is Martin
Kleppman's Turning the Database Inside Out with Apache Samza
which seems relevant here). For reference, this turning database into a
stream of change events is pretty common in SQL databases (i.e. mysql
binlog, postgres WAL) and NoSQL databases that have primary-replica setup
(i.e. Mongodb Oplog). Recently CockroachDB introduced a CDC feature as well
(and they have master-less replication too).
Hope that answers the question. That said, dedupe/ordering/getting full
row of data via C* CDC is a hard problem, but may be worth solving for
reasons mentioned above. Our proposal is an user approach to solve these
problems. Maybe the more sensible thing to do is to build it as part of C*
itself, but that's a much bigger discussion. If anyone is building a
streaming pipeline for C*, we'd be interested in hearing their approaches
On Tue, Sep 11, 2018 at 7:01 AM Rahul Singh <
Post by Rahul Singh
You know what they say: Go big or go home.
Right now candidates are Cassandra itself but embedded or on the side
not on the actual data clusters, zookeeper (yuck) , Kafka (which needs
zookeeper, yuck) , S3 (outside service dependency, so no go. )
Jeff, Those are great patterns. ESP. Second one. Have used it several
times. Cassandra is a great place to store data in transport.
Also using Calvin means having to implement a distributed monotonic
sequence as a primitive, not trivial at all ...
On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh <
Post by Rahul Singh
In response to mimicking Advanced replication in DSE. I understand the
goal. Although DSE advanced replication does one way, those are use cases
with limited value to me because ultimately itâs still a master slave
Iâm working on a prototype for this for two way replication between
clusters or databases regardless of dB tech - and every variation I can get
to comes down to some implementation of the Calvin protocol which basically
verifies the change in either cluster , sequences it according to impact to
underlying data, and then schedules the mutation in a predictable manner on
both clusters / DBS.
All that means is that I need to sequence the change before it happens
so I can predictably ensure itâs Scheduled for write / Mutation. So Iâm
Back to square one: having a definitive queue / ledger separate from
the individual commit log of the cluster.
Chief Executive Officer
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007
We build and manage digital business technology platforms.
There may be some use cases for it.. but I'm not sure what they are.
It might help if you shared the use cases where the extra complexity is
required? When does writing to Cassandra which then dedupes and writes to
Kafka a preferred design then using Kafka and simply writing to Cassandra?
From the reading of the proposal, it seems bring functionality similar
to MySQL's binlog to Kafka connector. This is useful for many applications
that want to be notified when certain (or any) rows change in the database
primarily for a event driven application architecture.
Implementing this in the database layer means there is a standard
approach to getting a change notification stream. Downstream subscribers
can then decide which notifications to act on.
LinkedIn's databus is similar in functionality -
https://github.com/linkedin/databus However it is for heterogenous
Post by Joy Gao
We have a* WIP design doc
goes over this idea in details.
We haven't sort out all the edge cases yet, but would love to get
some feedback from the community on the general feasibility of this
approach. Any ideas/concerns/questions would be helpful to us. Thanks!
Interesting idea. I did go over the proposal briefly. I concur with
Jon about adding more use-cases to clarify this feature's potential