memory_locking_policy parameter in cassandra.yaml for disabling swap

Discussion:

memory_locking_policy parameter in cassandra.yaml for disabling swap - has this variable been renamed?

Stephen Henderson

2011-07-28 10:17:10 UTC

Hi,

We've started having problems with cassandra and memory swapping on linux which seems to be a fairly common issue (in our particular case after about a week all swap space will have been used up and we have to restart the process).

It sounds like the general consensus is to just disable swap completely, but the recently released "Cassandra High Performance Cookbook" from Packt has instructions for "Stopping cassandra from using swap without disabling it system wide". We've tried following the instructions but it refers to a "memory_locking_policy" variable in cassandra.yaml which throws an "unknown property" error on startup and I can't find any reference to it in any of the cassandra docs.

I've copied the summarised instructions below, does anyone know if this is something that ever worked or is there a different variable to set which does the same thing? (we're using 0.7.4 at present and it looks like the book was written for 0.7.0-beta-1.10 so it might have been something which was abandoned during beta?)
---
Disabling Swap Memory system-wide may not always be desirable. For example, if the system is not dedicated to running Cassandra, other processes on the system may benefit from Swap Memory. This recipe shows how to install the Java Native Architecture, which allows Java to lock itself in memory making it inevitable.

1. Place the jna.jar and platform.jar in the $CASSANDRA_HOME/lib directory:
2. Enable memory_locking_policy in $CASSANDRA_HOME/conf/cassandra.yaml: "memory_locking_policy: required"
3. Restart your Cassandra instance.
4. Confirm this configuration has taken effect by checking to see if a large portion of memory is Unevictable:
$ grep Unevictable /proc/meminfo
Unevictable: 1024 Kb
---

Thanks,
Stephen

Stephen Henderson - Lead Developer (Onsite), Cognitive Match
***@cognitivematch.com<mailto:***@cognitivematch.com> | http://www.cognitivematch.com<http://www.cognitivematch.com/>
T: +44 (0) 203 205 0004 | F: +44 (0) 207 526 2226

Adi

2011-07-28 12:05:54 UTC

Permalink

Hi,****
** **
Weve started having problems with cassandra and memory swapping on linux
which seems to be a fairly common issue (in our particular case after about
a week all swap space will have been used up and we have to restart the
process). ****
** **
It sounds like the general consensus is to just disable swap completely,
but the recently released Cassandra High Performance Cookbook from Packt
has instructions for Stopping cassandra from using swap without disabling
it system wide. Weve tried following the instructions but it refers to a
memory_locking_policy variable in cassandra.yaml which throws an unknown
property error on startup and I cant find any reference to it in any of
the cassandra docs.****
** **
Ive copied the summarised instructions below, does anyone know if this is
something that ever worked or is there a different variable to set which
does the same thing?

If you are having trouble preventing the swapping the other parameter that
can help is disk_access_mode
We are using "mmap_index_only" and that has prevented swapping for now.

"auto" will try to use mmap for all disk access ,
"mmap" will use mmap
"standard" will not use mmap

Search for swapping on the users list and go through the email discussions
and jira issues related to swapping and that will give you an idea what can
work for you.

-Adi

Jonathan Ellis

2011-07-28 12:52:12 UTC

Permalink

This is not advisable in general, since non-mmap'd I/O is substantially slower.

The OP is correct that it is best to disable swap entirely, and
second-best to enable JNA for mlockall.

Post by Adi

If you are having trouble preventing the swapping the other parameter that
can help is disk_access_mode
We are using "mmap_index_only" and that has prevented swapping for now.
"auto" will try to use mmap for all disk access ,
"mmap" will use mmap
"standard" will not use mmap
Search for swapping on the users list and go through the email discussions
and jira issues related to swapping and that will give you an idea what can
work for you.
-Adi

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Terje Marthinussen

2011-07-28 14:03:43 UTC

Permalink

Post by Jonathan Ellis
This is not advisable in general, since non-mmap'd I/O is substantially slower.

I see this again and again as a claim here, but it is actually close to 10 years since I saw mmap'd I/O have any substantial performance benefits on any real life use I have needed.

We have done a lot of testing of this also with cassandra and I don't see anything conclusive. We have done as many test where normal I/O has been faster than mmap and the differences may very well be within statistical variances given the complexity and number of factors involved in something like a distributed cassandra working at quorum.

mmap made a difference in 2000 when memory throughput was still measured in hundreds of megabytes/sec and cpu caches was a few kilobytes, but today, you got megabytes of CPU caches with 100GB/sec bandwidths and even memory bandwidths are in 10's of GB/sec.

However, I/O buffers are generally quiet small and copying an I/O buffer from kernel to user space inside a cache with 100GB/sec bandwidth is really a non-issue given the I/O throughput cassandra generates.

In 2005 or so, CPUs had already reached a limit where I saw that mmap performed worse than regular I/O on as a large number of use cases.

Hard to say exactly why, but I saw one theory from a FreeBSD core developer speculating back then that the extra MMU work involved in some I/O loads may actually be slower than cache internal memcopy of tiny I/O buffers (they are pretty small after all).

I don't have a personal theory here. I just know that especially on large amounts of smaller I/O operations regular I/O was typically faster than mmap, which could back up that theory.

So, I wonder how people came to this conclusion as I am, under no real life use case with cassandra, able to reproduce anything resembling a significant difference and we have been benchmarking on nodes with ssd setups which can churn out 1GB/sec+ read speeds.

Way more I/O throughput than most people have at hand and still I cannot get mmap to give me better performance.

I do, although subjectively, feel that things just seem to work better with regular I/O for us. We have currently have very nice and stable heap sizes at regardless of I/O loads and we have an easier system to operate as we can actually monitor how much memory the darned thing work.

My recommendation? Stay away from mmap.

I would love to understand how people got to this conclusion however and try to find out why we seem to see differences!

Post by Jonathan Ellis
The OP is correct that it is best to disable swap entirely, and
second-best to enable JNA for mlockall.

Be a bit careful with removing swap completely. Linux is not always happy when it gets short on memory.

Terje

Jonathan Ellis

2011-07-28 20:16:37 UTC

Permalink

If you're actually hitting disk for most or even many of your reads then
mmap doesn't matter since the extra copy to a Java buffer is negligible
compared to the i/o itself (even on ssds).

Post by Terje Marthinussen

Post by Jonathan Ellis
This is not advisable in general, since non-mmap'd I/O is substantially slower.

I see this again and again as a claim here, but it is actually close to 10

years since I saw mmap'd I/O have any substantial performance benefits on
any real life use I have needed.

Post by Terje Marthinussen
We have done a lot of testing of this also with cassandra and I don't see

anything conclusive. We have done as many test where normal I/O has been
faster than mmap and the differences may very well be within statistical
variances given the complexity and number of factors involved in something
like a distributed cassandra working at quorum.

Post by Terje Marthinussen
mmap made a difference in 2000 when memory throughput was still measured

in hundreds of megabytes/sec and cpu caches was a few kilobytes, but today,
you got megabytes of CPU caches with 100GB/sec bandwidths and even memory
bandwidths are in 10's of GB/sec.

Post by Terje Marthinussen
However, I/O buffers are generally quiet small and copying an I/O buffer

from kernel to user space inside a cache with 100GB/sec bandwidth is really
a non-issue given the I/O throughput cassandra generates.

Post by Terje Marthinussen
In 2005 or so, CPUs had already reached a limit where I saw that mmap

performed worse than regular I/O on as a large number of use cases.

Post by Terje Marthinussen
Hard to say exactly why, but I saw one theory from a FreeBSD core

developer speculating back then that the extra MMU work involved in some I/O
loads may actually be slower than cache internal memcopy of tiny I/O buffers
(they are pretty small after all).

Post by Terje Marthinussen
I don't have a personal theory here. I just know that especially on large

amounts of smaller I/O operations regular I/O was typically faster than
mmap, which could back up that theory.

Post by Terje Marthinussen
So, I wonder how people came to this conclusion as I am, under no real

life use case with cassandra, able to reproduce anything resembling a
significant difference and we have been benchmarking on nodes with ssd
setups which can churn out 1GB/sec+ read speeds.

Post by Terje Marthinussen
Way more I/O throughput than most people have at hand and still I cannot

get mmap to give me better performance.

Post by Terje Marthinussen
I do, although subjectively, feel that things just seem to work better

with regular I/O for us. We have currently have very nice and stable heap
sizes at regardless of I/O loads and we have an easier system to operate as
we can actually monitor how much memory the darned thing work.

Post by Terje Marthinussen
My recommendation? Stay away from mmap.
I would love to understand how people got to this conclusion however and

try to find out why we seem to see differences!

Post by Terje Marthinussen

Post by Jonathan Ellis
The OP is correct that it is best to disable swap entirely, and
second-best to enable JNA for mlockall.

Be a bit careful with removing swap completely. Linux is not always happy

when it gets short on memory.

Post by Terje Marthinussen
Terje

Terje Marthinussen

2011-07-28 21:57:18 UTC

Permalink

Benchmarks was done with up to 96GB memory, much more caching than most people will ever have.

The point anyway is that you are talking I/O in 10's or at best, a few hundred MB/sec before cassandra will eat all your CPU (with dual CPU 6 cores in our case).

The memcopy involved here deep inside the kernel will not be very high on the list of expensive operations.

The assumption also seems to be that mmap is "free" cpu wise.
It clearly isn't. There is definitely work involved for the CPU also when doing mmap. It is just that you move it from context switching and small I/O buffer copying to memory management.

Terje

If you're actually hitting disk for most or even many of your reads then mmap doesn't matter since the extra copy to a Java buffer is negligible compared to the i/o itself (even on ssds).

Post by Terje Marthinussen

Post by Jonathan Ellis
This is not advisable in general, since non-mmap'd I/O is substantially slower.

I see this again and again as a claim here, but it is actually close to 10 years since I saw mmap'd I/O have any substantial performance benefits on any real life use I have needed.
We have done a lot of testing of this also with cassandra and I don't see anything conclusive. We have done as many test where normal I/O has been faster than mmap and the differences may very well be within statistical variances given the complexity and number of factors involved in something like a distributed cassandra working at quorum.
mmap made a difference in 2000 when memory throughput was still measured in hundreds of megabytes/sec and cpu caches was a few kilobytes, but today, you got megabytes of CPU caches with 100GB/sec bandwidths and even memory bandwidths are in 10's of GB/sec.
However, I/O buffers are generally quiet small and copying an I/O buffer from kernel to user space inside a cache with 100GB/sec bandwidth is really a non-issue given the I/O throughput cassandra generates.
In 2005 or so, CPUs had already reached a limit where I saw that mmap performed worse than regular I/O on as a large number of use cases.
Hard to say exactly why, but I saw one theory from a FreeBSD core developer speculating back then that the extra MMU work involved in some I/O loads may actually be slower than cache internal memcopy of tiny I/O buffers (they are pretty small after all).
I don't have a personal theory here. I just know that especially on large amounts of smaller I/O operations regular I/O was typically faster than mmap, which could back up that theory.
So, I wonder how people came to this conclusion as I am, under no real life use case with cassandra, able to reproduce anything resembling a significant difference and we have been benchmarking on nodes with ssd setups which can churn out 1GB/sec+ read speeds.
Way more I/O throughput than most people have at hand and still I cannot get mmap to give me better performance.
I do, although subjectively, feel that things just seem to work better with regular I/O for us. We have currently have very nice and stable heap sizes at regardless of I/O loads and we have an easier system to operate as we can actually monitor how much memory the darned thing work.
My recommendation? Stay away from mmap.
I would love to understand how people got to this conclusion however and try to find out why we seem to see differences!

Post by Jonathan Ellis
The OP is correct that it is best to disable swap entirely, and
second-best to enable JNA for mlockall.

Be a bit careful with removing swap completely. Linux is not always happy when it gets short on memory.
Terje

Peter Schuller

2011-07-29 07:04:29 UTC

Permalink

Post by Terje Marthinussen
It clearly isn't. There is definitely work involved for the CPU also when
doing mmap. It is just that you move it from context switching and small I/O
buffer copying to memory management.

*All* memory access a process does is subject to the rules of the
memory management unit of the CPU, so that cost is not specific to
mmap():ed files (once a given page is in core that is).

(But again, I'm not arguing the point in Cassandra's case; just generally.)

--
/ Peter Schuller (@scode on twitter)

Peter Schuller

2011-07-28 21:29:36 UTC

Permalink

Post by Terje Marthinussen
I would love to understand how people got to this conclusion however and try to find out why we seem to see differences!

I won't make any claims with Cassandra because I have never bothered
benchmarking the different in CPU usage since all my use-cases have
been more focused on I/O efficiency, but I will say, without having
benchmarked that either, the *generally*, if you're doing small reads
of data that is in page cache using mmap() - something would have to
be seriously wrong for that not to be significantly faster than
regular I/O.

There's just *no way* there is no performance penalty involved in
making the context switch to kernel space, validating syscall
parameters etc (not to mention the indirect effects on e.g. process
scheduling etc) - compared to simply *touching some virtual memory*.
It's easy to benchmark the maximum number of syscalls you can do per
second, and I'll eat my left foot if you're able to do more of that
than touching a piece of memory ;)

Obviously this does *not* mean that mmap():ed I/O will actually be
faster in some particular application. But I do want to make the point
that the idea that mmap():ed I/O is good for performance (in terms of
CPU) is definitely not arbitrary and unfounded.

Now, and HERE is the kicker: With all the hoopla over mmap():ed I/O
and benchmarks you see, as usual there are lies, damned lies and
benchmarks. It's pretty easy to come up with I/O patterns where mmap()
will be significantly slower (certainly on platters, I'm guessing even
with modern SSD:s) than regular I/O because the method used to
communicate with the operating system (touching a page of memory) is
vastly different.

In the most obvious and simple case, consider an application that
needs to read 50 MB of data exactly, and knows it. Suppose the data is
not in page cache. Submitting a read() of exactly those 50 MB clearly
has at least the potential to be significantly more efficient
(assuming nothing is outright wrong) than toughing pages in a
sequential fashion and (1) taking multiple, potentially quite a few,
page faults in the kernel, and (2) being reliant on
read-ahead/pre-fetching which will never have enough knowledge to
predict your 50 MB read so you'll invariable take more seeks (at least
potentially with concurrent I/O) and probably read more than necessary
(since pre-fetching algorithms won't know when you'll be "done") than
if you simply state to the kernel your exact intent of reading exactly
50*1024*1024 bytes in a particular position in a file/device.

To some extent issues like these may affect Cassandra, but it's
difficult to measure. For example, if you're I/O bound and doing a lot
of range slices that are bigger than a single page - perhaps the
default 64kb read size with standard I/O is eliminating unnecessary
seeks for you that you're otherwise taking when doing I/O by paging?
It's an hypothesis that is certainly plausible under some
circumstances, but difficult to validate or falsify. One can probably
construct a benchmark where there's no difference, yet see a
significant difference in a real-world scenario when your benchmarked
I/O is intermixed with other I/O. Not to mention subtle differences in
behaviors of kernels, RAID controllers, disk drive controllers, etc...

--
/ Peter Schuller (@scode on twitter)

Terje Marthinussen

2011-07-29 09:35:43 UTC

Permalink

Post by Terje Marthinussen

Post by Terje Marthinussen
I would love to understand how people got to this conclusion however and

try to find out why we seem to see differences!
I won't make any claims with Cassandra because I have never bothered
benchmarking the different in CPU usage since all my use-cases have
been more focused on I/O efficiency, but I will say, without having
benchmarked that either, the *generally*, if you're doing small reads
of data that is in page cache using mmap() - something would have to
be seriously wrong for that not to be significantly faster than
regular I/O.

Sorry, with small reads, I was thinking small random reads, basically
things that are not very cacheable and probably cause demand paging.
For quite large reads like 10s of MB from disk, the demand paging will not
be good for mmap performance. This is probably not a type of storage use
which is a stronghold of cassandra either.

But you sort of nicely list a lot of things I did not take time to write and
just add support for my original question: "What is the origin of the mmap
is substantially faster" claim?

You also need to throw in also throw in the fun question on how the jvm will
interact with all of this.

Given the amount of people asking question here related to confusion on
mmap, memory map and jna, and the work of maintaining mmap code, I am
somewhat curious if this is worth it.

Different usages can generate vastly different loads on systems, so just
because our current usage scenarios does not seem to benefit from mmap,
other cases obviously can and I am curious what these cases look like.

Terje

Jonathan Ellis

2011-07-29 17:33:07 UTC

Permalink

On Fri, Jul 29, 2011 at 4:35 AM, Terje Marthinussen

Post by Terje Marthinussen
"What is the origin of the mmap
is substantially faster" claim?

The origin is the performance testing we did when adding mmap'd i/o.

I believe Chris Goffinet also found a double-digit percentage
performance improvement at Digg and/or Twitter, but I don't remember
the details.

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Jonathan Ellis

2011-07-28 12:50:55 UTC

Permalink

I don't think there's ever been a "memory_locking_policy" variable.
Cassandra will call mlockall if JNA is present, no further steps
required.

On Thu, Jul 28, 2011 at 5:17 AM, Stephen Henderson

Hi,
We’ve started having problems with cassandra and memory swapping on linux
which seems to be a fairly common issue (in our particular case after about
a week all swap space will have been used up and we have to restart the
process).
It sounds like the general consensus is to just disable swap completely, but
the recently released “Cassandra High Performance Cookbook” from Packt has
instructions for “Stopping cassandra from using swap without disabling it
system wide”. We’ve tried following the instructions but it refers to a
“memory_locking_policy” variable in cassandra.yaml which throws an “unknown
property” error on startup and I can’t find any reference to it in any of
the cassandra docs.
I’ve copied the summarised instructions below, does anyone know if this is
something that ever worked or is there a different variable to set which
does the same thing? (we’re using 0.7.4 at present and it looks like the
book was written for 0.7.0-beta-1.10 so it might have been something which
was abandoned during beta?)
---
Disabling Swap Memory system-wide may not always be desirable. For example,
if the system is not dedicated to running Cassandra, other processes on the
system may benefit from Swap Memory. This recipe shows how to install the
Java Native Architecture, which allows Java to lock itself in memory making
it inevitable.
“memory_locking_policy: required”
3. Restart your Cassandra instance.
4. Confirm this configuration has taken effect by checking to see if a large
$ grep Unevictable /proc/meminfo
Unevictable: 1024 Kb
---
Thanks,
Stephen
Stephen Henderson – Lead Developer (Onsite), Cognitive Match
T: +44 (0) 203 205 0004 | F: +44 (0) 207 526 2226

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Teijo Holzer

2011-07-29 01:13:02 UTC

Permalink

Hi,

yes I was looking for this config as well.

This is really simple to achieve:

Put the following line into /etc/security/limits.conf

cassandra - memlock 32

Then, start Cassandra as the user cassandra, not as root (note there is never a
need to run Cassandra as root, all functionality can be achieved from a normal
user by changing the right configs).

In the log you will see:

[2011-07-29 13:06:46,491] WARN: Unable to lock JVM memory (ENOMEM). This can
result in part of the JVM being swapped out, especially with mmapped I/O
enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root. (main CLibrary.java:118)

Done.

If you want enable mlockall again, simply change the above 32(k) value to the
size of your RAM (e.g. 17825792 for a 16GB machine).

We also have turned off mmap altogether and never had any memory issues again.
swap is happily enabled. We currently prefer stability over performance.

Cheers,

T.

Hi,
We’ve started having problems with cassandra and memory swapping on linux which
seems to be a fairly common issue (in our particular case after about a week
all swap space will have been used up and we have to restart the process).
It sounds like the general consensus is to just disable swap completely, but
the recently released “Cassandra High Performance Cookbook” from Packt has
instructions for “Stopping cassandra from using swap without disabling it
system wide”. We’ve tried following the instructions but it refers to a
“memory_locking_policy” variable in cassandra.yaml which throws an “unknown
property” error on startup and I can’t find any reference to it in any of the
cassandra docs.
I’ve copied the summarised instructions below, does anyone know if this is
something that ever worked or is there a different variable to set which does
the same thing? (we’re using 0.7.4 at present and it looks like the book was
written for 0.7.0-beta-1.10 so it might have been something which was abandoned
during beta?)
---
Disabling Swap Memory system-wide may not always be desirable. For example, if
the system is not dedicated to running Cassandra, other processes on the system
may benefit from Swap Memory. This recipe shows how to install the Java Native
Architecture, which allows Java to lock itself in memory making it inevitable.
“memory_locking_policy: required”
3. Restart your Cassandra instance.
4. Confirm this configuration has taken effect by checking to see if a large
$ grep Unevictable /proc/meminfo
Unevictable: 1024 Kb
---
Thanks,
Stephen
*Stephen Henderson – *Lead Developer (Onsite), Cognitive Match
<http://www.cognitivematch.com/>
T: +44 (0) 203 205 0004 | F: +44 (0) 207 526 2226