how to increase compaction rate?

Discussion:

Thorsten von Eicken

2012-03-12 03:04:30 UTC

I'm having difficulties with leveled compaction, it's not making fast
enough progress. I'm on a quad-core box and it only does one compaction
at a time. Cassandra version: 1.0.6. Here's nodetool compaction stats:

# nodetool -h localhost compactionstats
pending tasks: 2568
compaction type keyspace column family bytes
compacted bytes total progress
Compactionrslog_production req_text
4974195 314597326 1.58%

The number of pending tasks decreases extremely slowly. In the log, I
can see it perform 3-4 compactions but the number of tasks only
decreases by one. (I turned clients off and even disabled thrift to
ensure this is not because of writes happening at the same time.) In the
log, I see nicely paired Compacting... Compacted... lines after each
other, it doesn't look like there's ever more than one compaction
running at a time. I have 3 cpus sitting idle. My cassandra.yaml has:

snapshot_before_compaction: false
column_index_size_in_kb: 128
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true

I've issued a "nodetool -h localhost setcompactionthroughput 100", which
didn't seem to make a difference. Here are some sample log lines:

INFO [CompactionExecutor:117] 2012-03-12 02:54:43,963
CompactionTask.java (line 113)
Compacting
[SSTableReader(path='/mnt/ebs/data/rslog_production/req_text-hc-753342-Data.db')]
INFO [CompactionExecutor:117] 2012-03-12 02:54:47,793
CompactionTask.java (line 218)
Compacted to
[/mnt/ebs/data/rslog_production/req_text-hc-753992-Data.db,].
30,198,523 to 30,197,052 (~99% of original) bytes for 39,269 keys at
7.519100MB/s. Time: 3,830ms.
INFO [CompactionExecutor:119] 2012-03-12 02:54:47,795
CompactionTask.java (line 113)
Compacting
[SSTableReader(path='/mnt/ebs/data/rslog_production/req_text-hc-753933-Data.db')]
INFO [CompactionExecutor:119] 2012-03-12 02:54:51,731
CompactionTask.java (line 218)
Compacted to
[/mnt/ebs/data/rslog_production/req_text-hc-753994-Data.db,].
31,462,495 to 31,462,495 (~100% of original) bytes for 40,267 keys at
7.625152MB/s. Time: 3,935ms.
INFO [CompactionExecutor:119] 2012-03-12 02:54:51,734
CompactionTask.java (line 113)
Compacting
[SSTableReader(path='/mnt/ebs/data/rslog_production/req_text-hc-753343-Data.db')]
INFO [CompactionExecutor:119] 2012-03-12 02:54:56,093
CompactionTask.java (line 218)
Compacted to
[/mnt/ebs/data/rslog_production/req_text-hc-753996-Data.db,].
32,643,675 to 32,643,958 (~100% of original) bytes for 57,473 keys at
7.141937MB/s. Time: 4,359ms.
INFO [CompactionExecutor:118] 2012-03-12 02:54:56,095
CompactionTask.java (line 113)
Compacting
[SSTableReader(path='/mnt/ebs/data/rslog_production/req_text-hc-753934-Data.db')]
INFO [CompactionExecutor:118] 2012-03-12 02:54:59,635
CompactionTask.java (line 218)
Compacted to
[/mnt/ebs/data/rslog_production/req_text-hc-753998-Data.db,].
30,709,285 to 30,709,285 (~100% of original) bytes for 32,172 keys at
8.275404MB/s. Time: 3,539ms.
INFO [CompactionExecutor:118] 2012-03-12 02:54:59,638
CompactionTask.java (line 113)
Compacting
[SSTableReader(path='/mnt/ebs/data/rslog_production/req_text-hc-753344-Data.db')]

I recently added a second node to the ring and, in what I suspect is
related, I can't get it to have data transferred (RF=1). Somewhere I
read that the compaction executor does the data streaming? I'm wondering
whether those tasks are all queued. nodetool ring:

#nodetool -h localhost ring
Address DC Rack Status State Load
Owns Token

85070591730234615865843651857942052865
10.102.37.168 datacenter1 rack1 Up Normal 811.87 GB
50.00% 0
10.80.161.101 datacenter1 rack1 Up Normal 1.08 MB
50.00% 85070591730234615865843651857942052865

Last thing I attempted here is to move the empty node from ...864 to ...865:

# nodetool -h localhost move 85070591730234615865843651857942052865
INFO 19:59:19,625 Moving /10.80.161.101 from
85070591730234615865843651857942052864 to
85070591730234615865843651857942052865.
INFO 19:59:19,628 Sleeping 30000 ms before start streaming/fetching ranges.
INFO 19:59:49,639 MOVING: fetching new ranges and streaming old ranges
INFO 19:59:52,680 Finished streaming session 97489049918693 from
/10.102.37.168
INFO 19:59:52,681 Enqueuing flush of
Memtable-***@227137515(36/45 serialized/live bytes, 1 ops)
INFO 19:59:52,682 Writing Memtable-***@227137515(36/45
serialized/live bytes, 1 ops)
INFO 19:59:52,706 Completed flushing
/mnt/ebs/data/system/LocationInfo-hc-19-Data.db (87 bytes)
INFO 19:59:52,708 Node /10.80.161.101 state jump to normal

I'm pretty stumped at this point... Any pointers to what to do or what I
may have done wrong?
Thanks!
Thorsten

Peter Schuller

2012-03-12 04:17:32 UTC

Permalink

Post by Thorsten von Eicken
multithreaded_compaction: false

Set to true.

--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Thorsten von Eicken

2012-03-12 06:21:44 UTC

Permalink

Post by Peter Schuller

Post by Thorsten von Eicken
multithreaded_compaction: false

Set to true.

I did try that. I didn't see it go any faster. The cpu load was lower,
which I assumed meant fewer bytes/sec being compressed
(SnappyCompressor). I didn't see multiple compactions in parallel.
Nodetool compactionstats behaved strange and instead of showing
individual compactions with a %-complete it showed a running count of
total bytes compacted. (Darn, I don't have the output of that anymore in
my terminal buffer.) It just didn't look good to me. Are you positive
that it is faster with leveled compaction? I don't understand why I
don't get multiple concurrent compactions running, that's what would
make the biggest performance difference. Is the compaction parallelism
perhaps only across multiple CFs? That would explain what I see.

Thorsten

aaron morton

2012-03-12 09:44:50 UTC

Permalink

Post by Thorsten von Eicken
I don't understand why I
don't get multiple concurrent compactions running, that's what would
make the biggest performance difference.

concurrent_compactors
Controls how many concurrent compactions to run, by default it's the number of cores on the machine.

If you are not CPU bound check iostats (http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html_)

Cheers
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Post by Thorsten von Eicken

Post by Peter Schuller

Post by Thorsten von Eicken
multithreaded_compaction: false

Set to true.

Brandon Williams

2012-03-12 13:52:35 UTC

Permalink

Post by Thorsten von Eicken
I don't understand why I
don't get multiple concurrent compactions running, that's what would
make the biggest performance difference.
concurrent_compactors
Controls how many concurrent compactions to run, by default it's the number
of cores on the machine.

With leveled compaction, I don't think you get any concurrency because
it has to compact an entire level, and it can't proceed to the next
level without completing the one before it.

In short, if you want maximum throughput, stick with size tiered.

-Brandon

Thorsten von Eicken

2012-03-13 17:27:28 UTC

Permalink

Post by Brandon Williams

I'm on a quad-core machine so not setting concurrent_compactors should
not be a limiting factor...

Post by Brandon Williams
With leveled compaction, I don't think you get any concurrency because
it has to compact an entire level, and it can't proceed to the next
level without completing the one before it.
In short, if you want maximum throughput, stick with size tiered.

I switched the CFs to tiered compaction and I still get no concurrency
for the same CF. I now have two compactions running concurrently but
always for different CFs. I've briefly seen a third for one of the small
CFs, so it's willing to run more than two concurrently. Looks like I
have to wait for a few days for all the compactions to complete. Talk
about compaction hell!

Post by Brandon Williams
-Brandon

Viktor Jevdokimov

2012-03-13 23:13:14 UTC

Permalink

After loosing one node we had to repair, CFs was on leveled compaction.
For one CF each node had about 7GB of data.
Running a repair without primary range switch ended up some nodes exhausted
to about 60-100GB of 5MB sstables for that CF (a lot of files).
After switching back from leveled to tiered we ended up completely blocked
compactions on all nodes since this CF were compacting forever.
On one node a major compaction for that CF is CPU bound and may run with
unlimited compaction speed for 4-7 days at maximum 1MB/s rate, finally
compacting to 3GB of data (some data is deleted by TTL, some merged).

What we did to speedup this process to return all exhausted nodes into
normal state faster:
We have created a 6 temporary virtual single Cassandra nodes with 2 CPU
cores and 8GB RAM.
Stopped completely a compaction for CF on a production node.
Leveled sstables from this production node was divided into 6 ranges and
copied into 6 temporary empty nodes.
On each node we ran a major compaction to compact just 1/6 of data, about
10-14GB. It took 1-2 hours to compact them into 1GB of data.
Then all 6 sstables was copied into one of 6 nodes for a major compaction,
finally getting expected 3GB sstable.
Stopping production node, deleting files that was copied, returning
compacted (may need renaming) and node is back to normal.

Using separate nodes we saved original production nodes time not to compact
exhausted CF forever, blocking compactions for other CFs. With 6 separate
nodes we have compacted 2 productions nodes a day, so maybe it took the
same time, but production nodes were free for regular compactions for other
CFs.

After back to normal for our use case we stick to tiered compaction with a
major compaction nightly.
With our insertion/TTL deletion rates a leveled compaction is a nightmare,
even if amount of data is not very huge, just a few GBs/node.

Post by Thorsten von Eicken

Post by Brandon Williams

number

Post by Brandon Williams

Post by Thorsten von Eicken
of cores on the machine.

I'm on a quad-core machine so not setting concurrent_compactors should
not be a limiting factor...

Post by Brandon Williams
-Brandon

Thorsten von Eicken

2012-03-14 03:32:15 UTC

Permalink

Post by Viktor Jevdokimov
What we did to speedup this process to return all exhausted nodes into
We have created a 6 temporary virtual single Cassandra nodes with 2
CPU cores and 8GB RAM.
Stopped completely a compaction for CF on a production node.
Leveled sstables from this production node was divided into 6 ranges
and copied into 6 temporary empty nodes.
On each node we ran a major compaction to compact just 1/6 of data,
about 10-14GB. It took 1-2 hours to compact them into 1GB of data.
Then all 6 sstables was copied into one of 6 nodes for a major
compaction, finally getting expected 3GB sstable.
Stopping production node, deleting files that was copied, returning
compacted (may need renaming) and node is back to normal.
Using separate nodes we saved original production nodes time not to
compact exhausted CF forever, blocking compactions for other CFs. With
6 separate nodes we have compacted 2 productions nodes a day, so maybe
it took the same time, but production nodes were free for regular
compactions for other CFs.

Yikes, that's quite the ordeal, but I totally get why you had to go
there. Cassandra seems to work well within some use-case bounds and
lacks the sophistication to handle others well. I've been wondering
about the way I use it, which is to hold the last N days of logs and
corresponding index. This means that every day I make a zillion inserts
and a corresponding zillion of deletes for the data inserted N days ago.
The way the compaction works this is horrible. The data is essentially
immutable until it's deleted, yet it's copied a whole bunch of times. In
addition, it takes forever for the deletion tombstones to "meet" the
original data in a compaction and actually compact it away. I've also
run into the zillions of files problem with level compaction you did. I
ended up with over 30k SSTables for ~1TB of data. At that point the
compaction just ceases to make progress. And starting cassandra takes

Post by Viktor Jevdokimov
30 minutes just for it to open all the SSTables and when done 12GB of

memory are used. Better algorithms and some tools will be needed for all
this to "just work". But then, we're also just at V1.0.8...
TvE

Edward Capriolo

2012-03-14 03:59:36 UTC

Permalink

On Tue, Mar 13, 2012 at 11:32 PM, Thorsten von Eicken

Post by Thorsten von Eicken

Post by Viktor Jevdokimov
30 minutes just for it to open all the SSTables and when done 12GB of

memory are used. Better algorithms and some tools will be needed for all
this to "just work". But then, we're also just at V1.0.8...
TvE

You are correct to say that the way Cassandra works it is not idea for
a dataset where you completely delete and re add the entire dataset
each day. In fact that may be one of the worst use cases for
Cassandra. this has to do with the structured log format and with the
tombstones and grace period. Maybe you can set a lower base.

LevelDB is new and not as common in the wild as the Sized Tiered.
Again it works the way it works. Google must think it is brilliant
after all they invented it.

For a 1TB of data your 12GB is used by bloom filters. Again this is
just a fact of life. Bloom filters are their to make negative lookups
faster. Maybe you can lower the bloom filter sizes and the index
interval. This should use less memory and help the system start up
faster respectively.

But nodes stuffed with a trillion keys may not be optimal for many
reasons. In out case we want a high portion of the data set in memory.
So a 1TB node might need say 256 GB ram :) We opt for more smaller
boxes.

Maxim Potekhin

2012-03-13 19:15:01 UTC

Permalink

Dear All,

after all the testing and continuous operation of my first cluster,
I've been given an OK to build a second production Cassandra cluster in
Europe.

There were posts in recent weeks regarding the most stable and solid
Cassandra version.
I was wondering is anything better has appeared since it was last discussed.

At this juncture, I don't need features, just rock solid stability. Are
0.8.* versions still acceptable,
since I have experience with these, or should I take the plunge to 1+?

I realize that I won't need more than 8GB RAM because I can't make Java
heap too big. Is worth it
still to pay money for extra RAM? Is the cache located outside of heap
in recent versions?

Thanks to all of you for the advice I'm receiving on this board.

Best regards

Maxim

Edward Capriolo

2012-03-13 21:40:51 UTC

Permalink

I am 1.0.7. I would suggest that. The memtable and JAMM stuff is very
stable. I would not setup 0.8.X because with 1.1 coming soon 0.8.X is
not likely to see to many more minor releases. You can always do
better with more RAM up to the size of your data, having more ram them
data size will not help noticeably . The off heap row cache can use
this and the OS can cache disk blocks with it.

Edward

Post by Maxim Potekhin
Dear All,
after all the testing and continuous operation of my first cluster,
I've been given an OK to build a second production Cassandra cluster in
Europe.
There were posts in recent weeks regarding the most stable and solid
Cassandra version.
I was wondering is anything better has appeared since it was last discussed.
At this juncture, I don't need features, just rock solid stability. Are
0.8.* versions still acceptable,
since I have experience with these, or should I take the plunge to 1+?
I realize that I won't need more than 8GB RAM because I can't make Java heap
too big. Is worth it
still to pay money for extra RAM? Is the cache located outside of heap in
recent versions?
Thanks to all of you for the advice I'm receiving on this board.
Best regards
Maxim

Maxim Potekhin

2012-03-13 23:37:41 UTC

Permalink

Thank you Edward.

As can be expected, my data volume is a multiple of whatever RAM I can
realistically buy, and in fact much bigger. In my very limited experience,
the money might be well spent on multicore CPUs because it makes routine
operations like compact/repair (which always include writes) so much
faster, hence reducing
the periods of high occupancy. I'm trying to scope out how much SSD I
will need because
it appears to be an economical solution to problems I had previously had.

Regards,
Maxim

Post by Edward Capriolo
I am 1.0.7. I would suggest that. The memtable and JAMM stuff is very
stable. I would not setup 0.8.X because with 1.1 coming soon 0.8.X is
not likely to see to many more minor releases. You can always do
better with more RAM up to the size of your data, having more ram them
data size will not help noticeably . The off heap row cache can use
this and the OS can cache disk blocks with it.
Edward

Edward Capriolo

2012-03-14 04:03:08 UTC

Permalink

Agreed if you are using SSD you likely will not need much as much RAM.
I said "You could always do better with more RAM" not "You should
definitely get more RAM" :)

Post by Maxim Potekhin
Thank you Edward.
As can be expected, my data volume is a multiple of whatever RAM I can
realistically buy, and in fact much bigger. In my very limited experience,
the money might be well spent on multicore CPUs because it makes routine
operations like compact/repair (which always include writes) so much faster,
hence reducing
the periods of high occupancy. I'm trying to scope out how much SSD I will
need because
it appears to be an economical solution to problems I had previously had.
Regards,
Maxim