Discussion:
Sporadic high IO bandwidth and Linux OOM killer
Oleksandr Shulgin
2018-12-05 17:38:25 UTC
Permalink
Hello,

We are running the following setup on AWS EC2:

Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.

Apache Cassandra 3.0.17, installed from .deb packages.

$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

We have a total of 36 nodes. All are r4.large instances, they have 2 vCPUs
and ~15 GB RAM.
On each instance we have:
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).

Non-default settings in cassandra.yaml:
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch

JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.

We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node. The issue also happens outside of repair time.

The symptoms:
============

Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours. According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume. I repeat, on the root volume, *not* on
the data/commitlog volume. Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.

The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.

Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.

Starting from the middle of last week we have seen this problem repeatedly
1-3 times a day, affecting different instances in a seemingly random
fashion. Most of the time it affects only one instance, but we've had one
incident when 9 nodes (3 from each of the 3 Availability Zones) were down
at the same time due to this exact issue.

Actually, we've had the same issue previously on the same Cassandra cluster
around 3 months ago (beginning to mid of September 2018). At that time we
were running on m4.xlarge instances (these have 4 vCPUs and 16GB RAM).

As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large. The last migration was
completed before Nov 13th. No changes to the cluster or application
happened since that time.

Now, after some weeks the issue appears again...

When we are not fast enough to react and reboot the affected instance, we
can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra. After that the instance becomes available
almost immediately. This allows us to rule out other processes running in
background as potential offenders.

We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia). At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.

We are running atop with sampling interval of 60 seconds. After the fact
we observe that the java process is the one responsible for the most of
disk activity during unavailability period. We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth. There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).

We also attempted to troubleshoot by running jstack, jmap and pmap against
Cassandra process in background every few minutes. The idea was to compare
dumps taken before and during unavailability, but that didn't lead to any
findings so far. Ultimately we had to stop doing this, once we've seen
that jmap can also become stuck burning CPU cycles. Now the output of jmap
is not that useful, but we fear that jstack might also expose the same
behavior. So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.

Now to my questions:

1. Is there anything in Cassandra or in the JVM that could explain all of a
sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?

2. Why does JVM heap utilization never reaches the 8GB that we provide to
it?

3. Why Committed virtual memory is so much bigger than sum of the heap and
off-heap reported by JMX? To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
while Commited_AS is ~12GB. Is there a more direct way to monitor off-heap
memory usage by JVM?

4. The only Jira issue related to Linux OOM, we've found is this one:
https://issues.apache.org/jira/browse/CASSANDRA-13931 This might be
related to our OOM, but still doesn't explain the unexpected IO anomalies.

I would really appreciate any hints / pointers / insights! The more I
learn about this issue, the less I understand it...

Regards,
--
Alex
Riccardo Ferrari
2018-12-05 18:34:27 UTC
Permalink
Hi Alex,

I saw that behaviout in the past. I can tell you the kswapd0 usage is
connected to the `disk_access_mode` property. On 64bit systems defaults to
mmap. That also explains why your virtual memory is so high (it somehow
matches the node load, right?). I can not find and good reference however
googling for "kswapd0 cassandra" you'll find quite some resources.

HTH,

On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
Post by Oleksandr Shulgin
Hello,
Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.
Apache Cassandra 3.0.17, installed from .deb packages.
$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build
1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
We have a total of 36 nodes. All are r4.large instances, they have 2
vCPUs and ~15 GB RAM.
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch
JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.
We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node. The issue also happens outside of repair time.
============
Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours. According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume. I repeat, on the root volume, *not* on
the data/commitlog volume. Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.
The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.
Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.
Starting from the middle of last week we have seen this problem repeatedly
1-3 times a day, affecting different instances in a seemingly random
fashion. Most of the time it affects only one instance, but we've had one
incident when 9 nodes (3 from each of the 3 Availability Zones) were down
at the same time due to this exact issue.
Actually, we've had the same issue previously on the same Cassandra
cluster around 3 months ago (beginning to mid of September 2018). At that
time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
RAM).
As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large. The last migration was
completed before Nov 13th. No changes to the cluster or application
happened since that time.
Now, after some weeks the issue appears again...
When we are not fast enough to react and reboot the affected instance, we
can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra. After that the instance becomes available
almost immediately. This allows us to rule out other processes running in
background as potential offenders.
We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia). At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.
We are running atop with sampling interval of 60 seconds. After the fact
we observe that the java process is the one responsible for the most of
disk activity during unavailability period. We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth. There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).
We also attempted to troubleshoot by running jstack, jmap and pmap against
Cassandra process in background every few minutes. The idea was to compare
dumps taken before and during unavailability, but that didn't lead to any
findings so far. Ultimately we had to stop doing this, once we've seen
that jmap can also become stuck burning CPU cycles. Now the output of jmap
is not that useful, but we fear that jstack might also expose the same
behavior. So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.
1. Is there anything in Cassandra or in the JVM that could explain all of
a sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?
2. Why does JVM heap utilization never reaches the 8GB that we provide to
it?
3. Why Committed virtual memory is so much bigger than sum of the heap and
off-heap reported by JMX? To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
while Commited_AS is ~12GB. Is there a more direct way to monitor off-heap
memory usage by JVM?
https://issues.apache.org/jira/browse/CASSANDRA-13931 This might be
related to our OOM, but still doesn't explain the unexpected IO anomalies.
I would really appreciate any hints / pointers / insights! The more I
learn about this issue, the less I understand it...
Regards,
--
Alex
Jon Meredith
2018-12-05 18:43:55 UTC
Permalink
The kswapd issue is interesting, is it possible you're being affected by
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I
don't see a fix for Trusty listed on there?
Post by Riccardo Ferrari
Hi Alex,
I saw that behaviout in the past. I can tell you the kswapd0 usage is
connected to the `disk_access_mode` property. On 64bit systems defaults to
mmap. That also explains why your virtual memory is so high (it somehow
matches the node load, right?). I can not find and good reference however
googling for "kswapd0 cassandra" you'll find quite some resources.
HTH,
On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
Post by Oleksandr Shulgin
Hello,
Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.
Apache Cassandra 3.0.17, installed from .deb packages.
$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build
1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
We have a total of 36 nodes. All are r4.large instances, they have 2
vCPUs and ~15 GB RAM.
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch
JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.
We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node. The issue also happens outside of repair time.
============
Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours. According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume. I repeat, on the root volume, *not* on
the data/commitlog volume. Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.
The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.
Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.
Starting from the middle of last week we have seen this problem
repeatedly 1-3 times a day, affecting different instances in a seemingly
random fashion. Most of the time it affects only one instance, but we've
had one incident when 9 nodes (3 from each of the 3 Availability Zones)
were down at the same time due to this exact issue.
Actually, we've had the same issue previously on the same Cassandra
cluster around 3 months ago (beginning to mid of September 2018). At that
time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
RAM).
As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large. The last migration was
completed before Nov 13th. No changes to the cluster or application
happened since that time.
Now, after some weeks the issue appears again...
When we are not fast enough to react and reboot the affected instance, we
can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra. After that the instance becomes available
almost immediately. This allows us to rule out other processes running in
background as potential offenders.
We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia). At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.
We are running atop with sampling interval of 60 seconds. After the fact
we observe that the java process is the one responsible for the most of
disk activity during unavailability period. We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth. There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).
We also attempted to troubleshoot by running jstack, jmap and pmap
against Cassandra process in background every few minutes. The idea was to
compare dumps taken before and during unavailability, but that didn't lead
to any findings so far. Ultimately we had to stop doing this, once we've
seen that jmap can also become stuck burning CPU cycles. Now the output of
jmap is not that useful, but we fear that jstack might also expose the same
behavior. So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.
1. Is there anything in Cassandra or in the JVM that could explain all of
a sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?
2. Why does JVM heap utilization never reaches the 8GB that we provide to
it?
3. Why Committed virtual memory is so much bigger than sum of the heap
and off-heap reported by JMX? To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
while Commited_AS is ~12GB. Is there a more direct way to monitor off-heap
memory usage by JVM?
https://issues.apache.org/jira/browse/CASSANDRA-13931 This might be
related to our OOM, but still doesn't explain the unexpected IO anomalies.
I would really appreciate any hints / pointers / insights! The more I
learn about this issue, the less I understand it...
Regards,
--
Alex
Jonathan Haddad
2018-12-05 18:52:39 UTC
Permalink
Seeing high kswapd usage means there's a lot of churn in the page cache.
It doesn't mean you're using swap, it means the box is spending time
clearing pages out of the page cache to make room for the stuff you're
reading now. The machines don't have enough memory - they are way
undersized for a production workload.

Things that make it worse:
* high readahead (use 8kb on ssd)
* high compression chunk length when reading small rows / partitions.
Nobody specifies this, 64KB by default is awful. I almost always switch to
4KB-16KB here but on these boxes you're kind of screwed since you're
already basically out of memory.

I'd never put Cassandra in production with less than 30GB ram and 8 cores
per box.
Post by Jon Meredith
The kswapd issue is interesting, is it possible you're being affected by
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I
don't see a fix for Trusty listed on there?
Post by Riccardo Ferrari
Hi Alex,
I saw that behaviout in the past. I can tell you the kswapd0 usage is
connected to the `disk_access_mode` property. On 64bit systems defaults to
mmap. That also explains why your virtual memory is so high (it somehow
matches the node load, right?). I can not find and good reference however
googling for "kswapd0 cassandra" you'll find quite some resources.
HTH,
On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
Post by Oleksandr Shulgin
Hello,
Host system (AWS AMI): Ubuntu 14.04.4 LTS,
Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Cassandra process runs inside a docker container.
Docker image is based on Ubuntu 18.04.1 LTS.
Apache Cassandra 3.0.17, installed from .deb packages.
$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build
1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
We have a total of 36 nodes. All are r4.large instances, they have 2
vCPUs and ~15 GB RAM.
- 2TB gp2 SSD EBS volume for data and commit log,
- 8GB gp2 SSD EBS for system (root volume).
num_tokens: 16
memtable_flush_writers: 1
concurrent_compactors: 1
snitch: Ec2Snitch
JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
Garbage collection: CMS with default settings.
We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
segments per node. The issue also happens outside of repair time.
============
Sporadically a node becomes unavailable for a period of time between few
minutes and a few hours. According to our analysis and as pointed out by
AWS support team, the unavailability is caused by exceptionally high read
bandwidth on the *root* EBS volume. I repeat, on the root volume, *not* on
the data/commitlog volume. Basically, the amount if IO exceeds instance's
bandwidth (~52MB/s) and all other network communication becomes impossible.
The root volume contains operating system, docker container with OpenJDK
and Cassandra binaries, and the logs.
Most of the time, whenever this happens it is too late to SSH into the
instance to troubleshoot: it becomes completely unavailable within very
short period of time.
Rebooting the affected instance helps to bring it back to life.
Starting from the middle of last week we have seen this problem
repeatedly 1-3 times a day, affecting different instances in a seemingly
random fashion. Most of the time it affects only one instance, but we've
had one incident when 9 nodes (3 from each of the 3 Availability Zones)
were down at the same time due to this exact issue.
Actually, we've had the same issue previously on the same Cassandra
cluster around 3 months ago (beginning to mid of September 2018). At that
time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
RAM).
As a mitigation measure we have migrated away from those to r4.2xlarge.
Then we didn't observe any issues for a few weeks, so we have scaled down
two times: to r4.xlarge and then to r4.large. The last migration was
completed before Nov 13th. No changes to the cluster or application
happened since that time.
Now, after some weeks the issue appears again...
When we are not fast enough to react and reboot the affected instance,
we can see that ultimately Linux OOM killer kicks in and kills the java
process running Cassandra. After that the instance becomes available
almost immediately. This allows us to rule out other processes running in
background as potential offenders.
We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
Jolokia). At the same time, Committed_AS on each host is constantly around
11-12GB, as reported by atop(1) and prometheus.
We are running atop with sampling interval of 60 seconds. After the
fact we observe that the java process is the one responsible for the most
of disk activity during unavailability period. We also see kswapd0 high on
the list from time to time, which always has 0K reads, but non-zero write
bandwidth. There is no swap space defined on these instances, so not
really clear why kswapd appears at the top of the list all (measurement
error?).
We also attempted to troubleshoot by running jstack, jmap and pmap
against Cassandra process in background every few minutes. The idea was to
compare dumps taken before and during unavailability, but that didn't lead
to any findings so far. Ultimately we had to stop doing this, once we've
seen that jmap can also become stuck burning CPU cycles. Now the output of
jmap is not that useful, but we fear that jstack might also expose the same
behavior. So we wanted to avoid making the issue worse than it currently
is and disabled this debug sampling.
1. Is there anything in Cassandra or in the JVM that could explain all
of a sudden reading from non-data volume at such a high rate, for prolonged
periods of time as described above?
2. Why does JVM heap utilization never reaches the 8GB that we provide
to it?
3. Why Committed virtual memory is so much bigger than sum of the heap
and off-heap reported by JMX? To what can this difference be attributed?
I've just visited a node at random and collected "off heap memory used"
numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
while Commited_AS is ~12GB. Is there a more direct way to monitor off-heap
memory usage by JVM?
https://issues.apache.org/jira/browse/CASSANDRA-13931 This might be
related to our OOM, but still doesn't explain the unexpected IO anomalies.
I would really appreciate any hints / pointers / insights! The more I
learn about this issue, the less I understand it...
Regards,
--
Alex
--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade
Oleksandr Shulgin
2018-12-05 20:58:37 UTC
Permalink
Post by Jonathan Haddad
Seeing high kswapd usage means there's a lot of churn in the page cache.
It doesn't mean you're using swap, it means the box is spending time
clearing pages out of the page cache to make room for the stuff you're
reading now.
Jon,

Thanks for your thoughts!

machines don't have enough memory - they are way undersized for a
Post by Jonathan Haddad
production workload.
Well, they were doing fine since around February this year. The issue
started to appear out of the blue sky.
Post by Jonathan Haddad
* high readahead (use 8kb on ssd)
* high compression chunk length when reading small rows / partitions.
Nobody specifies this, 64KB by default is awful. I almost always switch to
4KB-16KB here but on these boxes you're kind of screwed since you're
already basically out of memory.
That's interesting, even though from my understanding Cassandra is mostly
doing sequential IO. But I'm not sure this is really relevant to the issue
at hand, as reading is done from the root device.

What could it be reading from there? After the JVM has started up and
config file is parsed I really don't see why should it read anything
additionally. Or am I missing something?

To make it clear: normally the root EBS on these nodes is doing at most 10
reads per second. When the issue starts, reads per second jump to hundreds
within few minutes (sometimes there's a preceding period of slow build up,
but in the end it's really exponential).

I'd never put Cassandra in production with less than 30GB ram and 8 cores
Post by Jonathan Haddad
per box.
We had to tweak the heap size once we started to run repairs, because
default heuristic aimed too low for us. Otherwise, as I've said we've seen
zero problems with our workload.

Cheers,
--
Alex
Oleksandr Shulgin
2018-12-05 21:36:45 UTC
Permalink
Post by Riccardo Ferrari
Hi Alex,
I saw that behaviout in the past.
Riccardo,

Thank you for the reply!

Do you refer to kswapd issue only or have you observed more problems that
match behavior I have described?

I can tell you the kswapd0 usage is connected to the `disk_access_mode`
Post by Riccardo Ferrari
property. On 64bit systems defaults to mmap.
Hm, that's interesting, I will double-check.

That also explains why your virtual memory is so high (it somehow matches
Post by Riccardo Ferrari
the node load, right?).
Not sure what do you mean by "load" here. We have a bit less than 1.5TB per
node on average.

Regards,
--
Alex
Riccardo Ferrari
2018-12-06 10:14:30 UTC
Permalink
Alex,

I had few instances in the past that were showing that unresponsivveness
behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
on a single thread processing (full throttle) for seconds. According to
iotop that was the kswapd0 process. That system was an ubuntu 16.04
actually "Ubuntu 16.04.4 LTS".

From there I started to dig what kswap process was involved in a system
with no swap and found that is used for mmapping. This erratic (allow me to
say erratic) behaviour was not showing up when I was on 3.0.6 but started
to right after upgrading to 3.0.17.

By "load" I refer to the load as reported by the `nodetool status`. On my
systems, when disk_access_mode is auto (read mmap), it is the sum of the
node load plus the jmv heap size. Of course this is just what I noted on my
systems not really sure if that should be the case on yours too.

I hope someone with more experience than me will add a comment about your
settings. Reading the configuration file, writers and compactors should be
2 at minimum. I can confirm when I tried in the past to change the
concurrent_compactors to 1 I had really bad things happenings (high system
load, high message drop rate, ...)

I have the "feeling", when running on constrained hardware the underlaying
kernel optimization is a must. I agree with Jonathan H. that you should
think about increasing the instance size, CPU and memory mathters a lot.

Best,


On Wed, Dec 5, 2018 at 10:36 PM Oleksandr Shulgin <
Post by Oleksandr Shulgin
Post by Riccardo Ferrari
Hi Alex,
I saw that behaviout in the past.
Riccardo,
Thank you for the reply!
Do you refer to kswapd issue only or have you observed more problems that
match behavior I have described?
I can tell you the kswapd0 usage is connected to the `disk_access_mode`
Post by Riccardo Ferrari
property. On 64bit systems defaults to mmap.
Hm, that's interesting, I will double-check.
That also explains why your virtual memory is so high (it somehow matches
Post by Riccardo Ferrari
the node load, right?).
Not sure what do you mean by "load" here. We have a bit less than 1.5TB
per node on average.
Regards,
--
Alex
Oleksandr Shulgin
2018-12-06 13:32:41 UTC
Permalink
Post by Riccardo Ferrari
I had few instances in the past that were showing that unresponsivveness
behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
on a single thread processing (full throttle) for seconds. According to
iotop that was the kswapd0 process. That system was an ubuntu 16.04
actually "Ubuntu 16.04.4 LTS".
Riccardo,

Did you by chance also observe Linux OOM? How long did the
unresponsiveness last in your case?

From there I started to dig what kswap process was involved in a system
Post by Riccardo Ferrari
with no swap and found that is used for mmapping. This erratic (allow me to
say erratic) behaviour was not showing up when I was on 3.0.6 but started
to right after upgrading to 3.0.17.
By "load" I refer to the load as reported by the `nodetool status`. On my
systems, when disk_access_mode is auto (read mmap), it is the sum of the
node load plus the jmv heap size. Of course this is just what I noted on my
systems not really sure if that should be the case on yours too.
I've checked and indeed we are using disk_access_mode=auto (well,
implicitly because it's not even part of config file anymore):
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.

I hope someone with more experience than me will add a comment about your
Post by Riccardo Ferrari
settings. Reading the configuration file, writers and compactors should be
2 at minimum. I can confirm when I tried in the past to change the
concurrent_compactors to 1 I had really bad things happenings (high system
load, high message drop rate, ...)
As I've mentioned, we did not observe any other issues with the current
setup: system load is reasonable, no dropped messages, no big number of
hints, request latencies are OK, no big number of pending compactions.
Also during repair everything looks fine.

I have the "feeling", when running on constrained hardware the underlaying
Post by Riccardo Ferrari
kernel optimization is a must. I agree with Jonathan H. that you should
think about increasing the instance size, CPU and memory mathters a lot.
How did you solve your issue in the end? You didn't rollback to 3.0.6?
Did you tune kernel parameters? Which ones?

Thank you!
--
Alex
Riccardo Ferrari
2018-12-06 14:29:59 UTC
Permalink
Hi,

To be honest I've never seen the OOM in action on those instances. My Xmx
was 8GB just like yours and that let me think you have some process that is
competing for memory, is it? Do you have any cron, any backup, anything
that can trick the OOMKiller ?

My unresponsiveness was seconds long. This is/was bad becasue gossip
protocol was going crazy by marking nodes down and all the consequences
this can lead in distributed system, think about hints, dynamic snitch, and
whatever depends on node availability ...
Can you share some number about your `tpstats` or system load in general?

On the tuning side I just went through the following article:
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

No rollbacks, just moving forward! Right now we are upgrading the instance
size to something more recent than m1.xlarge (for many different reasons,
including security, ECU and network).Nevertheless it might be a good idea
to upgrade to the 3.X branch to leverage on better off-heap memory
management.

Best,


On Thu, Dec 6, 2018 at 2:33 PM Oleksandr Shulgin <
Post by Oleksandr Shulgin
Post by Riccardo Ferrari
I had few instances in the past that were showing that unresponsivveness
behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
on a single thread processing (full throttle) for seconds. According to
iotop that was the kswapd0 process. That system was an ubuntu 16.04
actually "Ubuntu 16.04.4 LTS".
Riccardo,
Did you by chance also observe Linux OOM? How long did the
unresponsiveness last in your case?
From there I started to dig what kswap process was involved in a system
Post by Riccardo Ferrari
with no swap and found that is used for mmapping. This erratic (allow me to
say erratic) behaviour was not showing up when I was on 3.0.6 but started
to right after upgrading to 3.0.17.
By "load" I refer to the load as reported by the `nodetool status`. On my
systems, when disk_access_mode is auto (read mmap), it is the sum of the
node load plus the jmv heap size. Of course this is just what I noted on my
systems not really sure if that should be the case on yours too.
I've checked and indeed we are using disk_access_mode=auto (well,
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.
I hope someone with more experience than me will add a comment about your
Post by Riccardo Ferrari
settings. Reading the configuration file, writers and compactors should be
2 at minimum. I can confirm when I tried in the past to change the
concurrent_compactors to 1 I had really bad things happenings (high system
load, high message drop rate, ...)
As I've mentioned, we did not observe any other issues with the current
setup: system load is reasonable, no dropped messages, no big number of
hints, request latencies are OK, no big number of pending compactions.
Also during repair everything looks fine.
I have the "feeling", when running on constrained hardware the underlaying
Post by Riccardo Ferrari
kernel optimization is a must. I agree with Jonathan H. that you should
think about increasing the instance size, CPU and memory mathters a lot.
How did you solve your issue in the end? You didn't rollback to 3.0.6?
Did you tune kernel parameters? Which ones?
Thank you!
--
Alex
Oleksandr Shulgin
2018-12-07 11:43:01 UTC
Permalink
Post by Riccardo Ferrari
To be honest I've never seen the OOM in action on those instances. My Xmx
was 8GB just like yours and that let me think you have some process that is
competing for memory, is it? Do you have any cron, any backup, anything
that can trick the OOMKiller ?
Riccardo,

As I've mentioned previously, apart from docker running Cassandra on JVM,
there is a small number of houskeeping processes, namely cron to trigger
log rotation, a log shipping agent, node metrics exporter (prometheus) and
some other small things. None of those come close in their memory
requirements compared to Cassandra and are routinely pretty low in memory
usage reports from atop and similar tools. The overhead of these seems to
be minimal.

My unresponsiveness was seconds long. This is/was bad becasue gossip
Post by Riccardo Ferrari
protocol was going crazy by marking nodes down and all the consequences
this can lead in distributed system, think about hints, dynamic snitch, and
whatever depends on node availability ...
Can you share some number about your `tpstats` or system load in general?
Here's some pretty typical tpstats output from one of the nodes:

Pool Name Active Pending Completed Blocked All
time blocked
MutationStage 0 0 319319724 0
0
ViewMutationStage 0 0 0 0
0
ReadStage 0 0 80006984 0
0
RequestResponseStage 0 0 258548356 0
0
ReadRepairStage 0 0 2707455 0
0
CounterMutationStage 0 0 0 0
0
MiscStage 0 0 0 0
0
CompactionExecutor 1 55 1552918 0
0
MemtableReclaimMemory 0 0 4042 0
0
PendingRangeCalculator 0 0 111 0
0
GossipStage 0 0 6343859 0
0
SecondaryIndexManagement 0 0 0 0
0
HintsDispatcher 0 0 226 0
0
MigrationStage 0 0 0 0
0
MemtablePostFlush 0 0 4046 0
0
ValidationExecutor 1 1 1510 0
0
Sampler 0 0 0 0
0
MemtableFlushWriter 0 0 4042 0
0
InternalResponseStage 0 0 5890 0
0
AntiEntropyStage 0 0 5532 0
0
CacheCleanupExecutor 0 0 0 0
0
Repair#250 1 1 1 0
0
Native-Transport-Requests 2 0 260447405 0
18

Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 1
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0

Speaking of CPU utilization, it is consistently within 30-60% on all nodes
(and even less in the night).
Post by Riccardo Ferrari
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html
No rollbacks, just moving forward! Right now we are upgrading the instance
size to something more recent than m1.xlarge (for many different reasons,
including security, ECU and network).Nevertheless it might be a good idea
to upgrade to the 3.X branch to leverage on better off-heap memory
management.
One thing we have noticed very recently is that our nodes are indeed
running low on memory. It even seems now that the IO is a side effect of
impending OOM, not the other way round as we have thought initially.

After a fresh JVM start the memory allocation looks roughly like this:

total used free shared buffers cached
Mem: 14G 14G 173M 1.1M 12M 3.2G
-/+ buffers/cache: 11G 3.4G
Swap: 0B 0B 0B

Then, within a number of days, the allocated disk cache shrinks all the way
down to unreasonable numbers like only 150M. At the same time "free" stays
at the original level and "used" grows all the way up to 14G. Shortly
after that the node becomes unavailable because of the IO and ultimately
after some time the JVM gets killed.

Most importantly, the resident size of JVM process stays at around 11-12G
all the time, like it was shortly after the start. How can we find where
the rest of the memory gets allocated? Is it just some sort of malloc
fragmentation?

As we are running a relatively recent version of JDK, we've tried to use
the option -Djdk.nio.maxCachedBufferSize=262144 on one of the nodes, as
suggested in this issue:
https://issues.apache.org/jira/browse/CASSANDRA-13931
But we didn't see any improvement. Also, the expectation is if it would be
the issue in the first place, the resident size of JVM process would grow
at the same rate as available memory is shrinking, correct?

Another thing we didn't find the answer so far is why within JVM heap.used
(<= 6GB) never reaches heap.committed = 8GB. Any ideas?

Regards,
--
Alex

Loading...