Discussion:
What is your backup strategy for Cassandra?
Gene
2015-09-06 07:32:59 UTC
Permalink
Hello everyone,

I'm new to this mailing list, and still fairly new to Cassandra. I'm a
systems administrator and have had a 3-node Cassandra cluster with a
replication factor of 3 running in Production for about a year now. We
have about 200 GB of data per node currently.

Up until recently I have just been performing snapshots and clearing them
out as needed. I recently implemented an automated process to perform
snapshots of our data and copy them off of our cluster via rsync+ssh.
Pretty soon I'll also be utilising the incremental backup feature for
sstables (cassandra.yaml:incremental_backups), and will be taking a look at
archiving for commitlog as well (commitlog_archiving.properties).

I've seen quite a few blog posts here and there about various back up
strategies. I'm wondering if anyone on this list would be willing to share
theirs.

Things I'm curious about:

1. Data size
2. Frequency for full snapshots
3. Frequency for copying snapshots off of the Cassandra nodes
4. Do you use the incremental backups feature
5. Do you use commitlog archiving
6. What method you use to copy data off of the cluster (e.g. NFS, rsync,
rsync+ssh, etc)
7. Do you compress your backups, if so how soon (e.g. compress backups
older than N days)
8. Do you use any Off the Shelf scripts for your backups (e.g. tablesnap,
cassandra_snapshotter, etc)
9. Do you utilise AWS for your backups, or do you keep it local (or offsite
on your own hardware)
10. Anything else you'd like to add, especially if I missed something
important

I'm not asking for the best, perfect method for Cassandra backups. I'd just
like to see what others are doing and hopefully use some ideas to improve
our processes.

Thanks in advance for any responses, and sorry for the wall of text.

-Gene
Robert Coli
2015-09-10 00:34:18 UTC
Permalink
Post by Gene
I've seen quite a few blog posts here and there about various back up
strategies. I'm wondering if anyone on this list would be willing to share
theirs.
https://github.com/JeremyGrosser/tablesnap
Post by Gene
1. Data size
Up to hundreds of gigs per node.
Post by Gene
2. Frequency for full snapshots
Never/always (depends on your perspective).
Post by Gene
3. Frequency for copying snapshots off of the Cassandra nodes
As SSTables are flushed.
Post by Gene
4. Do you use the incremental backups feature
No.
Post by Gene
5. Do you use commitlog archiving
No.
Post by Gene
6. What method you use to copy data off of the cluster (e.g. NFS, rsync,
rsync+ssh, etc)
S3 upload.
Post by Gene
7. Do you compress your backups, if so how soon (e.g. compress backups
older than N days)
My SSTables are already snappy compressed, so I am skeptical of benefit
from re-compression.
Post by Gene
8. Do you use any Off the Shelf scripts for your backups (e.g. tablesnap,
cassandra_snapshotter, etc)
tablesnap
Post by Gene
9. Do you utilise AWS for your backups, or do you keep it local (or
offsite on your own hardware)
AWS.

tl;dr - tablesnap works. There are awkward aspects to its use, but if you
are operating Cassandra in AWS it's probably the best off the shelf
off-node backup.

Loading...