reducing disk usage advice

Discussion:

Karl Hiramoto

2011-03-13 18:10:13 UTC

Hi,

I'm looking for advice on reducing disk usage. I've ran out of disk
space two days in a row while running a nightly scheduled nodetool
repair && nodetool compact cronjob.

I have 6 nodes RF=3 each with 300 GB drives at a hosting company.
GCGraceSeconds= 260000 (3.1 days)

Every column in the database has a TTL of 86400 (24 hours) to handle
deletion of stale data. 50% of the time the data is only written once,
read 0 or many times then expires. The other 50% of the time it's
written multiple times, resetting the TTL to 24 hours each time.

One question, since I use a TTL is it safe to set GCGraceSeconds to
0? I don't manually delete ever, I just rely on the TTL for deletion,
so are forgotten deletes an issue?

cfstats:
Read Count: 32052
Read Latency: 3.1280378135529765 ms.
Write Count: 9704525
Write Latency: 0.009527474760485443 ms.
Pending Tasks: 0
Column Family: Offer
SSTable count: 12
Space used (live): 59865089091
Space used (total): 76111577830
Memtable Columns Count: 39355
Memtable Data Size: 14726313
Memtable Switch Count: 414
Read Count: 32052
Read Latency: 3.128 ms.
Write Count: 9704525
Write Latency: 0.010 ms.
Pending Tasks: 0
Key cache capacity: 1000
Key cache size: 1000
Key cache hit rate: 2.4805931214280473E-4
Row cache: disabled
Compacted row minimum size: 36
Compacted row maximum size: 1597
Compacted row mean size: 1319

aaron morton

2011-03-13 20:27:23 UTC

Permalink

The CF Stats are reporting you have 70GB total space taken up by SSTables, of which 55GB is live. The rest is available for deletion, AFAIK this happens when cassandra detects free space is running low. I've never dug into how/when this happens though.

With that amount of data it seems odd to fill a 300GB space during repair or compaction. Some thoughts:

- What version are you using ? There were issues in early 0.7 that resulted in a lot of temp files been left around.
- Take a look on disk. In the commit log directory are there a lot of files hanging around? In the data directory for your keyspace do you have any snapshots ? Is there anything in the saved caches directory?
- When you've ran out of space what did the data directory look like ? How did you fix the issue?

Aaron

Hi,
I'm looking for advice on reducing disk usage. I've ran out of disk space two days in a row while running a nightly scheduled nodetool repair && nodetool compact cronjob.
I have 6 nodes RF=3 each with 300 GB drives at a hosting company. GCGraceSeconds = 260000 (3.1 days)
Every column in the database has a TTL of 86400 (24 hours) to handle deletion of stale data. 50% of the time the data is only written once, read 0 or many times then expires. The other 50% of the time it's written multiple times, resetting the TTL to 24 hours each time.
One question, since I use a TTL is it safe to set GCGraceSeconds to 0? I don't manually delete ever, I just rely on the TTL for deletion, so are forgotten deletes an issue?
Read Count: 32052
Read Latency: 3.1280378135529765 ms.
Write Count: 9704525
Write Latency: 0.009527474760485443 ms.
Pending Tasks: 0
Column Family: Offer
SSTable count: 12
Space used (live): 59865089091
Space used (total): 76111577830
Memtable Columns Count: 39355
Memtable Data Size: 14726313
Memtable Switch Count: 414
Read Count: 32052
Read Latency: 3.128 ms.
Write Count: 9704525
Write Latency: 0.010 ms.
Pending Tasks: 0
Key cache capacity: 1000
Key cache size: 1000
Key cache hit rate: 2.4805931214280473E-4
Row cache: disabled
Compacted row minimum size: 36
Compacted row maximum size: 1597
Compacted row mean size: 1319

Karl Hiramoto

2011-03-13 20:59:32 UTC

Permalink

Post by aaron morton
The CF Stats are reporting you have 70GB total space taken up by
SSTables, of which 55GB is live. The rest is available for deletion,
AFAIK this happens when cassandra detects free space is running low.
I've never dug into how/when this happens though.

If there was a way to force this deletion it would be nice.

du -hs data/
78G data/

Post by aaron morton
With that amount of data it seems odd to fill a 300GB space during

I maybe have 50 to 100GB of other stuff including mysql and the OS
(RHEL 5.5). So about 200GB for cassandra

Post by aaron morton
- What version are you using ? There were issues in early 0.7 that
resulted in a lot of temp files been left around.

0.7.3

Post by aaron morton
- Take a look on disk. In the commit log directory are there a lot of
files hanging around? In the data directory for your keyspace do you
have any snapshots ? Is there anything in the saved caches directory?

No snapshots and not much other stuff

du -h saved_caches/
52K saved_caches/

du -h commitlog/
218M commitlog/

no snapshots

Post by aaron morton
- When you've ran out of space what did the data directory look like ?
How did you fix the issue?

data directory had nearly 200GB of stuff. When i kill and restart the
cassandra process data dir goes down to 60GB or so. I can do the same
if i do a "Force GC" in jconsole. A few days ago i asked if anyone
knew how to force the GC in a cron job and got no response.

http://wiki.apache.org/cassandra/FAQ#cleaning_compacted_tables
After compaction is done I have lots of extra SSTables laying around and
it seems the only way to get rid of them is restarting cassandra, or
jconsole GC. I'd like a better way to automate it than restarting
cassandra.

If at all possible I'd like to somehow be able to cleanup these files
without paying for more storage of files I don't need.

Thanks,

Karl

Sylvain Lebresne

2011-03-14 14:33:51 UTC

Permalink

As it turns out, the compaction algorithm is pretty much the worst
possible for this use case. Because we compact files that have a
similar size, the older a column gets, the less often it is compacted.
If you always set a fixed TTL for all columns, you would want to do
some compaction of recent sstable, for the sake of not having too many
sstables, but you also want to compact old sstable, that are
guaranteed to just go away. And for those, it's actually fine to
compact them alone (only for the sake of purging).
But as compaction works, you will end up with big sstables of stuffs
that are expired, and you may even not be able to compact simply
because compaction "thinks" it doesn't have enough room.

But I do think that your use case (having a CF where are columns have
the same TTL and you only rely on it for deletion) is a very useful
one, and we should handle it better. In particular, CASSANDRA-1610
could be an easy way to get this.

CASSANDRA-1537 is probably also a partial but possibly sufficient
solution. That's also probably easier than CASSANDRA-1610 and I'll try
to give it a shot asap, that had been on my todo list way too long.

One question, since I use a TTL is it safe to set GCGraceSeconds to 0? I don't manually delete ever, I just rely on the TTL for deletion, so are forgotten deletes an issue?

The rule is this. Say you think that m is a reasonable value for
GCGraceSeconds. That is, you make sure that you'll always put back up
failing nodes and run repair within m seconds. Then, if you always use
a TTL of n (in your case 24 hours), the actual GCGraceSeconds that you
should set is m - n.

So putting a GCGrace of 0 in you would would be roughly equivalent to
set a GCGrace of 24h on a "normal" CF. That's probably a bit low.

--
Sylvain

Read Count: 32052
        Read Latency: 3.1280378135529765 ms.
        Write Count: 9704525
        Write Latency: 0.009527474760485443 ms.
        Pending Tasks: 0
                Column Family: Offer
                SSTable count: 12
                Space used (live): 59865089091
                Space used (total): 76111577830
                Memtable Columns Count: 39355
                Memtable Data Size: 14726313
                Memtable Switch Count: 414
                Read Count: 32052
                Read Latency: 3.128 ms.
                Write Count: 9704525
                Write Latency: 0.010 ms.
                Pending Tasks: 0
                Key cache capacity: 1000
                Key cache size: 1000
                Key cache hit rate: 2.4805931214280473E-4
                Row cache: disabled
                Compacted row minimum size: 36
                Compacted row maximum size: 1597
                Compacted row mean size: 1319

Karl Hiramoto

2011-03-14 19:17:25 UTC

Permalink

Post by Sylvain Lebresne
CASSANDRA-1537 is probably also a partial but possibly sufficient
solution. That's also probably easier than CASSANDRA-1610 and I'll try
to give it a shot asap, that had been on my todo list way too long.

Thanks, eager to see CASSANDRA-1610 someday. What I've been doing the
last day has been multiple restarts across the cluster when one node's
data/ dir gets to 150GB. restarting cassandra brings the nodes data/
directory down to around 60GB, I see cassandra deleteing a lot of
SSTables on startup.

One question, since I use a TTL is it safe to set GCGraceSeconds to 0? I don't manually delete ever, I just rely on the TTL for deletion, so are forgotten deletes an issue?

Post by Sylvain Lebresne
The rule is this. Say you think that m is a reasonable value for
GCGraceSeconds. That is, you make sure that you'll always put back up
failing nodes and run repair within m seconds. Then, if you always use
a TTL of n (in your case 24 hours), the actual GCGraceSeconds that you
should set is m - n.
So putting a GCGrace of 0 in you would would be roughly equivalent to
set a GCGrace of 24h on a "normal" CF. That's probably a bit low.

What do you mean by "normal"? If I were to set GCGrace to 0 would risk
data corruption? Wouldn't setting GCGrace to 0 help reduce disk space
pressure?

Thanks,
Karl

Sylvain Lebresne

2011-03-15 09:14:41 UTC

Permalink

This is because cassandra lazily remove the compacted files. You don't have to
restart a node though, you can just trigger a full java GC through
jconsole and this
should remove the files.

One question, since I use a TTL is it safe to set GCGraceSeconds to 0? I don't manually delete ever, I just rely on the TTL for deletion, so are forgotten deletes an issue?

What do you mean by "normal"? If I were to set GCGrace to 0 would risk
data corruption? Wouldn't setting GCGrace to 0 help reduce disk space
pressure?

Actually, if you really only use TTL on that column family and you
always set the
same TTL, it's ok to set a GCGrace of 0.
If you don't always put the same TTL, the kind of scenario that could
happen are this:
- you insert a column with ttl=24h.
- after 3h, you overwrite the column with a ttl of 2h.
At that point, you expect that you have updated the column ttl so that
it will only leave
2 hours. However, if you have GCGrace=0 and you are unlucky enough that a node
got the first insert but not the second one, and stay dead for more
than 2h, then when
you put it back up, it will not receive anything related to the second
insert, because
the column has expired and no tombstone has been created for it (since
GCGrace=0) and
thus the initial column will reappear (for a few hours but still). If
you have a bigger value
for GCGrace, then when the failing node is back up, it will receive a
tombstone about
the second insert and thus will not make the first insert reappear.

So the rule is, if you never lower down the TTL of a column (expanding
it is fine), you can
safely set GCGrace to 0.