Discussion:
Instability and memory problems
(too old to reply)
James Golick
2010-06-20 16:24:30 UTC
Permalink
As I alluded to in another post, we just moved from 2-4 nodes. Since then,
the cluster has been incredibly

The memory problems I've posted about before have gotten much worse and our
nodes are becoming incredibly slow/unusable every 24 hours or so. Basically,
the JVM reports that only 14GB is committed, but the RSS of the process is
22GB, and cassandra is completely unresponsive, but still having requests
routed to it internally, so it completely destroys performance.

I'm at a loss for how to diagnose this issue.

In addition to that, read performance has gone way downhill, and query
latency is much slower than it was with a 2 node cluster. Perhaps this was
to be expected, though.

We really like cassandra for the most part, but these stability issues are
going to force us to abandon it. Our application is like a yoyo right now,
and we can't live with that.

Help resolving these issues would be greatly appreciated.
Ran Tavory
2010-06-20 16:30:39 UTC
Permalink
I don't have the answer but if you provide jmap output, cfstats output that
may help.
Are you using mmap files?
Do you see swap? Gc in the logs?

On Jun 20, 2010 7:25 PM, "James Golick" <***@gmail.com> wrote:

As I alluded to in another post, we just moved from 2-4 nodes. Since then,
the cluster has been incredibly

The memory problems I've posted about before have gotten much worse and our
nodes are becoming incredibly slow/unusable every 24 hours or so. Basically,
the JVM reports that only 14GB is committed, but the RSS of the process is
22GB, and cassandra is completely unresponsive, but still having requests
routed to it internally, so it completely destroys performance.

I'm at a loss for how to diagnose this issue.

In addition to that, read performance has gone way downhill, and query
latency is much slower than it was with a 2 node cluster. Perhaps this was
to be expected, though.

We really like cassandra for the most part, but these stability issues are
going to force us to abandon it. Our application is like a yoyo right now,
and we can't live with that.

Help resolving these issues would be greatly appreciated.
Peter Schuller
2010-06-20 18:21:24 UTC
Permalink
Post by James Golick
The memory problems I've posted about before have gotten much worse and our
nodes are becoming incredibly slow/unusable every 24 hours or so. Basically,
the JVM reports that only 14GB is committed, but the RSS of the process is
22GB, and cassandra is completely unresponsive, but still having requests
routed to it internally, so it completely destroys performance.
I'm at a loss for how to diagnose this issue.
Sorry, I don't know the history of this (you mentioned you've alluded
to the problems before), so maybe I am being redundant or missing
something, but:

(1) Is the machine swapping? (Actively swapping in/out as reported by
e.g. vmstat)
(2) Do the logs indicate that GC is running excessively, thus
indicating an almost-out-of-heap condition?
(3) mmap():ed memory that is currently resident will count towards
RSS; if you're using mmap():ed I/O (the default), that is to be
expected.
(4) If you are using mmap():ed I/O, that is also in and of itself
something which can cause trouble if the operating system decides to
swap your application out in favor of the mmap()
(5) If you are swapping (see (1)), try switching from mmap():ed to
standard I/O (due to (4)), and/or try decreasing the swappyness if
you're on Linux (see /proc/sys/vm/swappiness).
(6) Is Cassandra CPU bound or disk bound in general, regardless of swapping?
--
/ Peter Schuller
James Golick
2010-06-20 19:37:52 UTC
Permalink
Post by James Golick
Post by James Golick
The memory problems I've posted about before have gotten much worse and
our
Post by James Golick
nodes are becoming incredibly slow/unusable every 24 hours or so.
Basically,
Post by James Golick
the JVM reports that only 14GB is committed, but the RSS of the process
is
Post by James Golick
22GB, and cassandra is completely unresponsive, but still having requests
routed to it internally, so it completely destroys performance.
I'm at a loss for how to diagnose this issue.
Sorry, I don't know the history of this (you mentioned you've alluded
to the problems before), so maybe I am being redundant or missing
(1) Is the machine swapping? (Actively swapping in/out as reported by
e.g. vmstat)
Yes, somewhat, although swappiness is set to 0.
Post by James Golick
(2) Do the logs indicate that GC is running excessively, thus
indicating an almost-out-of-heap condition?
It runs, but I wouldn't say excessively.
Post by James Golick
(3) mmap():ed memory that is currently resident will count towards
RSS; if you're using mmap():ed I/O (the default), that is to be
expected.
This is where I'm a little confused. I thought that mmap()'d IO didn't
actually allocate memory. I thought it was just IO through a faster code
path.
Post by James Golick
(4) If you are using mmap():ed I/O, that is also in and of itself
something which can cause trouble if the operating system decides to
swap your application out in favor of the mmap()
(5) If you are swapping (see (1)), try switching from mmap():ed to
Post by James Golick
standard I/O (due to (4)), and/or try decreasing the swappyness if
you're on Linux (see /proc/sys/vm/swappiness).
I tried switching to standard IO mode, but it was very, very slow. What I'm
confused about here is that if mmap()'d IO actually allocates memory that
can put pressure on other processes' memory, is there no way to bound that?
If not, how can anybody safely use mmap()'d IO on the JVM without risking
pushing their process's important pages out of memory.

swappiness is already at 0.
Post by James Golick
(6) Is Cassandra CPU bound or disk bound in general, regardless of swapping?
Hard to tell because of the paging.
Post by James Golick
--
/ Peter Schuller
James Golick
2010-06-20 19:58:02 UTC
Permalink
uh. wow. I just read up on all this again, and read the code, and I'm a
little surprised, to be honest.

There's no attempt to manage the total size of the mmap()'d IO, and the
default buffer allocation is quite sizeable. So, basically, if you have any
data, over time, you will run out of memory, and there's no way at all to
control it.

Can we consider changing the default?
On Sun, Jun 20, 2010 at 2:21 PM, Peter Schuller <
Post by James Golick
Post by James Golick
The memory problems I've posted about before have gotten much worse and
our
Post by James Golick
nodes are becoming incredibly slow/unusable every 24 hours or so.
Basically,
Post by James Golick
the JVM reports that only 14GB is committed, but the RSS of the process
is
Post by James Golick
22GB, and cassandra is completely unresponsive, but still having
requests
Post by James Golick
routed to it internally, so it completely destroys performance.
I'm at a loss for how to diagnose this issue.
Sorry, I don't know the history of this (you mentioned you've alluded
to the problems before), so maybe I am being redundant or missing
(1) Is the machine swapping? (Actively swapping in/out as reported by
e.g. vmstat)
Yes, somewhat, although swappiness is set to 0.
Post by James Golick
(2) Do the logs indicate that GC is running excessively, thus
indicating an almost-out-of-heap condition?
It runs, but I wouldn't say excessively.
Post by James Golick
(3) mmap():ed memory that is currently resident will count towards
RSS; if you're using mmap():ed I/O (the default), that is to be
expected.
This is where I'm a little confused. I thought that mmap()'d IO didn't
actually allocate memory. I thought it was just IO through a faster code
path.
Post by James Golick
(4) If you are using mmap():ed I/O, that is also in and of itself
something which can cause trouble if the operating system decides to
swap your application out in favor of the mmap()
(5) If you are swapping (see (1)), try switching from mmap():ed to
Post by James Golick
standard I/O (due to (4)), and/or try decreasing the swappyness if
you're on Linux (see /proc/sys/vm/swappiness).
I tried switching to standard IO mode, but it was very, very slow. What I'm
confused about here is that if mmap()'d IO actually allocates memory that
can put pressure on other processes' memory, is there no way to bound that?
If not, how can anybody safely use mmap()'d IO on the JVM without risking
pushing their process's important pages out of memory.
swappiness is already at 0.
Post by James Golick
(6) Is Cassandra CPU bound or disk bound in general, regardless of swapping?
Hard to tell because of the paging.
Post by James Golick
--
/ Peter Schuller
James Golick
2010-06-20 21:23:58 UTC
Permalink
I opened #1214 about this. I hope people will take a look and provide their
feedback.

https://issues.apache.org/jira/browse/CASSANDRA-1214

Thanks.
Post by James Golick
uh. wow. I just read up on all this again, and read the code, and I'm a
little surprised, to be honest.
There's no attempt to manage the total size of the mmap()'d IO, and the
default buffer allocation is quite sizeable. So, basically, if you have any
data, over time, you will run out of memory, and there's no way at all to
control it.
Can we consider changing the default?
On Sun, Jun 20, 2010 at 2:21 PM, Peter Schuller <
Post by James Golick
Post by James Golick
The memory problems I've posted about before have gotten much worse and
our
Post by James Golick
nodes are becoming incredibly slow/unusable every 24 hours or so.
Basically,
Post by James Golick
the JVM reports that only 14GB is committed, but the RSS of the process
is
Post by James Golick
22GB, and cassandra is completely unresponsive, but still having
requests
Post by James Golick
routed to it internally, so it completely destroys performance.
I'm at a loss for how to diagnose this issue.
Sorry, I don't know the history of this (you mentioned you've alluded
to the problems before), so maybe I am being redundant or missing
(1) Is the machine swapping? (Actively swapping in/out as reported by
e.g. vmstat)
Yes, somewhat, although swappiness is set to 0.
Post by James Golick
(2) Do the logs indicate that GC is running excessively, thus
indicating an almost-out-of-heap condition?
It runs, but I wouldn't say excessively.
Post by James Golick
(3) mmap():ed memory that is currently resident will count towards
RSS; if you're using mmap():ed I/O (the default), that is to be
expected.
This is where I'm a little confused. I thought that mmap()'d IO didn't
actually allocate memory. I thought it was just IO through a faster code
path.
Post by James Golick
(4) If you are using mmap():ed I/O, that is also in and of itself
something which can cause trouble if the operating system decides to
swap your application out in favor of the mmap()
(5) If you are swapping (see (1)), try switching from mmap():ed to
Post by James Golick
standard I/O (due to (4)), and/or try decreasing the swappyness if
you're on Linux (see /proc/sys/vm/swappiness).
I tried switching to standard IO mode, but it was very, very slow. What
I'm confused about here is that if mmap()'d IO actually allocates memory
that can put pressure on other processes' memory, is there no way to bound
that? If not, how can anybody safely use mmap()'d IO on the JVM without
risking pushing their process's important pages out of memory.
swappiness is already at 0.
Post by James Golick
(6) Is Cassandra CPU bound or disk bound in general, regardless of swapping?
Hard to tell because of the paging.
Post by James Golick
--
/ Peter Schuller
Peter Schuller
2010-06-21 07:09:30 UTC
Permalink
Post by James Golick
Post by Peter Schuller
(1) Is the machine swapping? (Actively swapping in/out as reported by
e.g. vmstat)
Yes, somewhat, although swappiness is set to 0.
Ok. While I have no good suggestion to fix it other than moving away
from mmap(), given that a low swappiness didn't help, I'd say that as
long as you're swapping you're pretty screwed as far as production
systems go and maintaining low latency. That is, unless you're
definitely swapping less than what might account for the performance
issues you're having.
Post by James Golick
It runs, but I wouldn't say excessively.
Ok.
Post by James Golick
Post by Peter Schuller
(3) mmap():ed memory that is currently resident will count towards
RSS; if you're using mmap():ed I/O (the default), that is to be
expected.
This is where I'm a little confused. I thought that mmap()'d IO didn't
actually allocate memory. I thought it was just IO through a faster code
path.
(The below refers only to mmap() as used when mapping files; mmap() in
and of itself is used for other purposes too, such as by malloc()
under some conditions. Please remember this even though I don't repeat
it on every mention.)

What mmap() will do when used to map files, is to allocate address
space in the virtual memory, which the operating system does not need
to actually allocate from physical RAM (though it may need swap
depending on whether the operating system is configured to allow
over-commit).

The application then proceeds touching pages of memory in the range
allocated by mmap() and it is up to the kernel to page data in and out
using some algorithm that is up to the operating system. Often
something similar to LRU behavior is used with respect to page
eviction, and during page-in read-ahead may be applied.

The "faster" bit comes from the fact that for data that is already
paged in memory, your program is doing nothing but touching memory
through the normal virtual memory system. No system call is required,
and no copying of data to/from user space for reads, and only
asynchronously on writes.

A downside with mmap() (in my opinion) is that your application no
longer has control over when/what is being read from or written to
disk since it is entirely up to the operating system. It also tends to
be more difficult to understand what is going on when a system is
under high I/O load; such as what the memory is in fact being used
for, what is causing disk I/O, etc.

A related problem in the sense that the operating system gets the
control, is that the operating system does not know what you know, as
an application. One of the problems in this area is specifically - how
should the mmap():ed data be balanced with that of the application
(some combination of brk() and mmap() (this time not to file) backed
address space).

If the operating system makes the "wrong" decision, such as swapping
out the JVM, you've got a problem. And it is not always trivial to
fix. If someone knows how to convince Linux to de-prioritize mmap();ed
I/O, other than decreasing swappiness, I'd love to hear about it.

Anyways: The problem in cases like these is that while mmap() does
give you a performance boost under some circumstances along some axis
of performance measurement, you also lose control - and if the
operating system doesn't happen to do what you want it to do, the OS
does not always give you appropriate tuning/control facilities.

But to be clear - no, mmap():ing, say, 1 TB of memory does not imply
that you actually need that much physical RAM available. It's just
that the memory that *is* paged into physical RAM at any given moment,
accounts towards RSS of the process (on Linux).

In your case: I'm not sure what the load is on your cluster. Is it
possible the periods of poor performance are correlated with
concurrent mark/sweep phases in the CMS GC? If the JVM is getting
swapped out slowly over time, you would expect this to primarily apply
to data outside of the active working set. Then when the mark/sweep GC
finally kicks in, touching most of the JVM heap, you begin (1)
swapping, causing the CMS process itself to be slow, and (2)
drastically change the set of data cached in RAM.

How much of your physical RAM is dedicatd to the JVM?

I forgot to say that you probably should consider lowering it
significantly (to be continued, getting off the subway...).
Post by James Golick
I tried switching to standard IO mode, but it was very, very slow. What I'm
confused about here is that if mmap()'d IO actually allocates memory that
can put pressure on other processes' memory, is there no way to bound that?
If not, how can anybody safely use mmap()'d IO on the JVM without risking
pushing their process's important pages out of memory.
swappiness is already at 0.
You can use mmap() mostly because of its behavior as described above;
that the operating system can dynamically choose what to keep in
physical memory and not. But you do need the address *space* (tends to
be a problem on 32 bit platforms and in the case of the JVM for legacy
reasons where you can only mmap() 2 GB at a time).
--
/ Peter Schuller
Peter Schuller
2010-06-21 07:30:03 UTC
Permalink
Post by Peter Schuller
How much of your physical RAM is dedicatd to the JVM?
I forgot to say that you probably should consider lowering it
significantly (to be continued, getting off the subway...).
So, it occurred to be you reported a 16 GB maximum heap size. If that
is a substantial portion of your total physical RAM, that would
probably be the reason why you see swapping. The operating system will
tend to use some heuristics to figure out what to keep in memory, and
whether to evict pages from cache or actually swap out applications.
The greater the memory pressure the greater the chance that the
operating system will start swapping you out. Swappiness 0 won't help
in this case.

(For some reason it never occurred to me before, but the JVM should,
if it doesn't already, provide a command line argument to make it use
mlock()/mlockall() to lock data in memory...)

Anyways: The solution in this case, if your 16 GB heap size is a
significant portion of system memory, is probably not to force the OS
not to swap, but rather lowering the JVM heap size. Consider that all
memory used by the JVM will be used for its heap, instead of caching
disk I/O. I am spekulating now on what your settings are, but suppose:

(1) Your *actual* need for JVM heap size is roughly 1 GB (I picked
something random and low).
(2) Your cassandra is configured to not keep extreme amounts of data
cached in-JVM (thus causing the real need to be roughly 1 GB, among
other things).
(3) Your maximum heap size is 16 GB.

(3) means that garbage collection becomes more efficient, because you
have lots more memory than you need, if looked at in isolation. But if
you are wasting 15 GB of would-be disk caching on maintaining the
oversize JVM heap, you will instead completely kill performance.

Probably there are two extremes, and a spectrum in between:

(1) Tune cassandra and the JVM to gobble up a *lot* of memory and
cache data in the JVM/cassandra.
(2) Tune cassandra and the JVM to gobble up an appropriate amount of
memory, and leave as much as possible for the operating system cache
mechanism.

(2) is likely to be a lot more performant than (1), assuming it works
and you don't swap.

If you do have a significantly oversized heap, one potential resulting
behavior is overall poor performance due to lack of caching, and
periodically much worse performance in relation to swap storms during
GC:s.
--
/ Peter Schuller
James Golick
2010-06-21 15:07:32 UTC
Permalink
Just an update here. We're now entirely on standard IO mode, and everything
is stable and happy. There hasn't been much of a performance hit, if at all.

- James
Post by Peter Schuller
Post by Peter Schuller
How much of your physical RAM is dedicatd to the JVM?
I forgot to say that you probably should consider lowering it
significantly (to be continued, getting off the subway...).
So, it occurred to be you reported a 16 GB maximum heap size. If that
is a substantial portion of your total physical RAM, that would
probably be the reason why you see swapping. The operating system will
tend to use some heuristics to figure out what to keep in memory, and
whether to evict pages from cache or actually swap out applications.
The greater the memory pressure the greater the chance that the
operating system will start swapping you out. Swappiness 0 won't help
in this case.
(For some reason it never occurred to me before, but the JVM should,
if it doesn't already, provide a command line argument to make it use
mlock()/mlockall() to lock data in memory...)
Anyways: The solution in this case, if your 16 GB heap size is a
significant portion of system memory, is probably not to force the OS
not to swap, but rather lowering the JVM heap size. Consider that all
memory used by the JVM will be used for its heap, instead of caching
(1) Your *actual* need for JVM heap size is roughly 1 GB (I picked
something random and low).
(2) Your cassandra is configured to not keep extreme amounts of data
cached in-JVM (thus causing the real need to be roughly 1 GB, among
other things).
(3) Your maximum heap size is 16 GB.
(3) means that garbage collection becomes more efficient, because you
have lots more memory than you need, if looked at in isolation. But if
you are wasting 15 GB of would-be disk caching on maintaining the
oversize JVM heap, you will instead completely kill performance.
(1) Tune cassandra and the JVM to gobble up a *lot* of memory and
cache data in the JVM/cassandra.
(2) Tune cassandra and the JVM to gobble up an appropriate amount of
memory, and leave as much as possible for the operating system cache
mechanism.
(2) is likely to be a lot more performant than (1), assuming it works
and you don't swap.
If you do have a significantly oversized heap, one potential resulting
behavior is overall poor performance due to lack of caching, and
periodically much worse performance in relation to swap storms during
GC:s.
--
/ Peter Schuller
Peter Schuller
2010-06-21 16:24:51 UTC
Permalink
Post by James Golick
Just an update here. We're now entirely on standard IO mode, and everything
is stable and happy. There hasn't been much of a performance hit, if at all.
Cool. Just be aware that if my speculation was correct that you're (1)
dedicating a very large portion of system memory to cassandra, but (2)
most of that is unused in the sense of being live, you are probably
going to see more disk I/O on reads than you would otherwise due to
less data fitting in cache (assuming your access pattern has some kind
of locality to make caching effective, and assuming you did not crank
up the in-JVM caching when switching to standard I/O mode).
--
/ Peter Schuller
James Golick
2010-06-21 17:15:01 UTC
Permalink
On Mon, Jun 21, 2010 at 12:24 PM, Peter Schuller <
Post by James Golick
Post by James Golick
Just an update here. We're now entirely on standard IO mode, and
everything
Post by James Golick
is stable and happy. There hasn't been much of a performance hit, if at
all.
Cool. Just be aware that if my speculation was correct that you're (1)
dedicating a very large portion of system memory to cassandra, but (2)
most of that is unused in the sense of being live, you are probably
going to see more disk I/O on reads than you would otherwise due to
less data fitting in cache (assuming your access pattern has some kind
of locality to make caching effective, and assuming you did not crank
up the in-JVM caching when switching to standard I/O mode).
Of course. We've got about 75% of available memory dedicated to Cassandra
row caches which leaves just under 6GB for fs caches, which is just under
10% of the load on the machine. We're probably going to put a lot more
memory in these nodes at this point.
Post by James Golick
--
/ Peter Schuller
Continue reading on narkive:
Loading...