Rules for Major Compaction

Discussion:

Raj N

2012-06-19 14:51:40 UTC

DataStax recommends not to run major compactions. Edward Capriolo's
Cassandra High Performance book suggests that major compaction is a good
thing. And should be run on a regular basis. Are there any ground rules
about running major compactions? For example, if you have write-once kind
of data that is never updated then it probably makes sense to not run
major compaction. But if you have data which can be deleted or overwritten
does it make sense to run major compaction on a regular basis?

Thanks
-Raj

Edward Capriolo

2012-06-19 19:30:20 UTC

Permalink

Hey my favorite question! It is a loaded question and it depends on
your workload. The answer has evolved over time.

In the old days <0.6.5 the only way to remove tombstones was major
compaction. This is not true in any modern version.

(Also in the old days you had to run cleanup to clear hints)

Cassandra now has two compaction strategies SizeTiered and Leveled.
Leveled DB can not be manually compacted.

You final two sentences are good ground rules. In our case we have
some column families that have high churn, for example a gc_grace
period of 4 days but the data is re-written completely every day.
Write activity over time will eventually cause tombstone removal but
we can expedite the process by forcing a major at night. Because the
tables are not really growing the **warning** below does not apply.

**Warning** this creates one large sstable. Which is not always
desirable, because it fiddles with the heuristics of SizeTiered
(having one big table and other smaller ones).

The updated answer is "You probably do not want to run major
compactions, but some use cases could see some benefits"

Post by Raj N
DataStax recommends not to run major compactions. Edward Capriolo's
Cassandra High Performance book suggests that major compaction is a good
thing. And should be run on a regular basis. Are there any ground rules
about running major compactions? For example, if you have write-once kind of
data that is never updated then it probably makes sense to not run major
compaction. But if you have data which can be deleted or overwritten does it
make sense to run major compaction on a regular basis?
Thanks
-Raj

Raj N

2012-06-19 21:31:18 UTC

Permalink

Thanks Ed. I am on 0.8.4. So I don't have Leveled option, only SizeTiered.
I have a strange problem. I have a 6 node cluster(DC1=3, DC2=3). One of the
nodes has 105 GB data where as every other node has 60 GB in spite of each
one being a replica of the other. And I am contemplating whether I should
be running compact/cleanup on the node with 105GB. Btw side question, does
it make sense to run it just for 1 node or is it advisable to run it for
all? This node also giving me some issues lately. Last night during some
heavy load, I got a lot of TimedOutExceptions from this node. The node was
also flapping. I could see in the logs that it could see the peers dying
ans coming back up, utlimately throwing UnavailableException(and sometimes
TimedOutException) on my requests. I use JNA mlockAll. So the JVM is
definitely not swapping. I see a full GC running(according to GCInspector)
for 15 secondsaround the same time. But even after the GC, requests were
timing out. Cassandra runs with Xmx8G, Xmn800M. Total RAM on the machine
62GB. I don't use any meaningful Key cache or row cache and rely on OS file
cache. Top shows VIRT as 116G(which makes sense since I have 105GB data).
Have you seen any issues with data this size on a node?

-Raj

Post by Edward Capriolo
Hey my favorite question! It is a loaded question and it depends on
your workload. The answer has evolved over time.
In the old days <0.6.5 the only way to remove tombstones was major
compaction. This is not true in any modern version.
(Also in the old days you had to run cleanup to clear hints)
Cassandra now has two compaction strategies SizeTiered and Leveled.
Leveled DB can not be manually compacted.
You final two sentences are good ground rules. In our case we have
some column families that have high churn, for example a gc_grace
period of 4 days but the data is re-written completely every day.
Write activity over time will eventually cause tombstone removal but
we can expedite the process by forcing a major at night. Because the
tables are not really growing the **warning** below does not apply.
**Warning** this creates one large sstable. Which is not always
desirable, because it fiddles with the heuristics of SizeTiered
(having one big table and other smaller ones).
The updated answer is "You probably do not want to run major
compactions, but some use cases could see some benefits"

kind of

Post by Raj N
data that is never updated then it probably makes sense to not run major
compaction. But if you have data which can be deleted or overwritten

does it

Post by Raj N
make sense to run major compaction on a regular basis?
Thanks
-Raj

Jonathan Ellis

2012-06-20 00:49:56 UTC

Permalink

Post by Edward Capriolo
You final two sentences are good ground rules. In our case we have
some column families that have high churn, for example a gc_grace
period of 4 days but the data is re-written completely every day.
Write activity over time will eventually cause tombstone removal but
we can expedite the process by forcing a major at night. Because the
tables are not really growing the **warning** below does not apply.

Note that Cassandra 1.2 will automatically compact sstables that have
more than a configurable amount of expired data (default 20%). So you
won't have to force a major for this use case anymore.

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Continue reading on narkive:

Search results for 'Rules for Major Compaction' (Questions and Answers)

replies

Is this the Catch 22 of atheism?

started 2012-04-01 10:13:12 UTC

religion & spirituality

replies

what are chromosomes and how important are they in the body?

started 2007-10-01 00:02:46 UTC

biology

replies