Availability issues for write/update/read workloads (up to 100s downtime) in case of a Cassandra node failure

Discussion:

Daniel Seybold

2018-11-09 10:48:32 UTC

Hi Apache Cassandra experts,

we are running a set of availability evaluations under a
write/read/update workloads with Apache Cassandra and experience some
unexpected results, i.e.Â 0 ops/s over a period up to 100s.

In order to provide a clear picture find below the details of (1) the
setup and (2) the evaluation workflow

*1. Setup:*

Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the
same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and
50GB disk.

Workload:
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update

*2. Evaluation Workflow: *

1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of aÂ random node in the cluster and delete the
VM without stoppingÂ Cassandra before
5. analyze throughput time series over the evaluation

*3. (Unexpected) Results

*We expected to see a (slight) drop in the throughput as soon as the VM
was deleted.
But the throughput results show that the there are periods of ~10s -
150s (not deterministic) where no operations are executed (all metrics
are collected on client side)
Yet, there are no timeout exceptions on client side and also the logs on
cluster side do not show anything that explains this behaviour.

I attached a series of plots which show the throughput and the downtimes
over the evaluation runs.

Do you have any explanations for this behaviour or recommendations how
to reduce theÂ potential "downtime" ?

Thanks in advance for any help and recommendations,

Cheers,
Daniel

--
M.Sc. Daniel Seybold

UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43
89081 Ulm
Phone: +49 (0)731 50-28 799

Durity, Sean R

2018-11-09 18:04:17 UTC

Permalink

The VMsâ memory (4 GB) seems pretty small for Cassandra. What heap size are you using? Which garbage collector? Are you seeing long GC times on the nodes? The basic rule of thumb is to give the Cassandra heap 50% of the RAM on the host. 2 GB isnât very much.

Also, I wouldnât set the replication factor to 5 (the number of nodes). If RF is always equal to the number of nodes, you canât really scale beyond the size of the disk on any one node (all data is on each node). A replication factor of 3 would be more like a typical production set-up.

Sean Durity

From: Daniel Seybold <***@uni-ulm.de>
Sent: Friday, November 09, 2018 5:49 AM
To: ***@cassandra.apache.org
Subject: [EXTERNAL] Availability issues for write/update/read workloads (up to 100s downtime) in case of a Cassandra node failure

Hi Apache Cassandra experts,

we are running a set of availability evaluations under a write/read/update workloads with Apache Cassandra and experience some unexpected results, i.e. 0 ops/s over a period up to 100s.

In order to provide a clear picture find below the details of (1) the setup and (2) the evaluation workflow

1. Setup:
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and 50GB disk.

Workload:
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update

2. Evaluation Workflow:

1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of a random node in the cluster and delete the VM without stopping Cassandra before
5. analyze throughput time series over the evaluation

3. (Unexpected) Results

We expected to see a (slight) drop in the throughput as soon as the VM was deleted.
But the throughput results show that the there are periods of ~10s - 150s (not deterministic) where no operations are executed (all metrics are collected on client side)
Yet, there are no timeout exceptions on client side and also the logs on cluster side do not show anything that explains this behaviour.

I attached a series of plots which show the throughput and the downtimes over the evaluation runs.

Do you have any explanations for this behaviour or recommendations how to reduce the potential "downtime" ?

Thanks in advance for any help and recommendations,

Cheers,
Daniel
--
M.Sc. Daniel Seybold

UniversitÃ€t Ulm

Institut Organisation und Management

von Informationssystemen (OMI)

Albert-Einstein-Allee 43

89081 Ulm

Phone: +49 (0)731 50-28 799

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Daniel Seybold

2018-11-16 12:17:03 UTC

Permalink

Hi Sean,

thanks for your comments, find below some more details with respect to
the (1) VM sizing and (2) the replication factor:

(1) VM sizing:

We selected the small VMs as intial setup to run our experiments. We
have also executed the same experiments (5 nodes) on larger VMs with 6
cores and 12GB memory (where 6GB was allocated to Cassandra).

We use the default CMS garbace collector (with default settings) and the
debug.log and system.log does not show any suspicious GC messages.

(2) Replication factor

We set the RF to 5 as we want to emulate a scenario which is able to
survive multiple-node failures. We have also tried a RF of 3 (in the 5
node cluster) but the downtime in case of a node failure persists.

I also attached two plots which show the results with the downtimes for
using the larger VMs and setting the RF to 3

Any further comments much appreciated,

Cheers,
Daniel

Post by Durity, Sean R
The VMsâ memory (4 GB) seems pretty small for Cassandra. What heap
size are you using? Which garbage collector? Are you seeing long GC
times on the nodes? The basic rule of thumb is to give the Cassandra
heap 50% of the RAM on the host. 2 GB isnât very much.
Also, I wouldnât set the replication factor to 5 (the number of
nodes). If RF is always equal to the number of nodes, you canât really
scale beyond the size of the disk on any one node (all data is on each
node). A replication factor of 3 would be more like a typical
production set-up.
Sean Durity
*Sent:* Friday, November 09, 2018 5:49 AM
*Subject:* [EXTERNAL] Availability issues for write/update/read
workloads (up to 100s downtime) in case of a Cassandra node failure
Hi Apache Cassandra experts,
we are running a set of availability evaluations under a
write/read/update workloads with Apache Cassandra and experience some
unexpected results, i.e.Â 0 ops/s over a period up to 100s.
In order to provide a clear picture find below the details of (1) the
setup and (2) the evaluation workflow
*1. Setup:*
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the
same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and 50GB disk.
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update
*2. Evaluation Workflow: *
1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of aÂ random node in the cluster and delete
the VM without stoppingÂ Cassandra before
5. analyze throughput time series over the evaluation
*3. (Unexpected) Results
*We expected to see a (slight) drop in the throughput as soon as the
VM was deleted.
But the throughput results show that the there are periods of ~10s -
150s (not deterministic) where no operations are executed (all metrics
are collected on client side)
Yet, there are no timeout exceptions on client side and also the logs
on cluster side do not show anything that explains this behaviour.
I attached a series of plots which show the throughput and the
downtimes over the evaluation runs.
Do you have any explanations for this behaviour or recommendations how
to reduce theÂ potential "downtime" ?
Thanks in advance for any help and recommendations,
Cheers,
Daniel
--
M.Sc. Daniel Seybold
UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43
89081 Ulm
Phone: +49 (0)731 50-28 799
------------------------------------------------------------------------
The information in this Internet Email is confidential and may be
legally privileged. It is intended solely for the addressee. Access to
this Email by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution or any action taken
or omitted to be taken in reliance on it, is prohibited and may be
unlawful. When addressed to our clients any opinions or advice
contained in this Email are subject to the terms and conditions
expressed in any applicable governing The Home Depot terms of business
or client engagement letter. The Home Depot disclaims all
responsibility and liability for the accuracy and content of this
attachment and for any damages or losses arising from any
inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or
other items of a destructive nature, which may be contained in this
attachment and shall not be liable for direct, indirect, consequential
or special damages in connection with this e-mail message or its
attachment.

--
M.Sc. Daniel Seybold

UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43
89081 Ulm
Phone: +49 (0)731 50-28 799

Alexander Dejanovski

2018-11-16 14:08:07 UTC

Permalink

Hi Daniel,

it seems like the driver isn't detecting that the node went down, which is
probably due to the way the node is being killed.
If I remember correctly, in some cases Netty transport is still up in the
client, which will still allows to send queries without them answering back
: https://datastax-oss.atlassian.net/browse/JAVA-1346
Eventually, the node gets discarded when the heartbeat system catches up.
It's also possible that the stuck queries then eat up all the available
slots in the driver, preventing any other query to be sent in that JVM.

Which version of the Datastax Driver are you using for your tests?
How is it configured (load balancing policies, etc...) ?
Do you have some debug logs on the client side that could help?

Thanks,

Post by Daniel Seybold
Hi Sean,
thanks for your comments, find below some more details with respect to the
We selected the small VMs as intial setup to run our experiments. We have
also executed the same experiments (5 nodes) on larger VMs with 6 cores and
12GB memory (where 6GB was allocated to Cassandra).
We use the default CMS garbace collector (with default settings) and the
debug.log and system.log does not show any suspicious GC messages.
(2) Replication factor
We set the RF to 5 as we want to emulate a scenario which is able to
survive multiple-node failures. We have also tried a RF of 3 (in the 5 node
cluster) but the downtime in case of a node failure persists.
I also attached two plots which show the results with the downtimes for
using the larger VMs and setting the RF to 3
Any further comments much appreciated,
Cheers,
Daniel
The VMsâ memory (4 GB) seems pretty small for Cassandra. What heap size
are you using? Which garbage collector? Are you seeing long GC times on the
nodes? The basic rule of thumb is to give the Cassandra heap 50% of the RAM
on the host. 2 GB isnât very much.
Also, I wouldnât set the replication factor to 5 (the number of nodes). If
RF is always equal to the number of nodes, you canât really scale beyond
the size of the disk on any one node (all data is on each node). A
replication factor of 3 would be more like a typical production set-up.
Sean Durity
*Sent:* Friday, November 09, 2018 5:49 AM
*Subject:* [EXTERNAL] Availability issues for write/update/read workloads
(up to 100s downtime) in case of a Cassandra node failure
Hi Apache Cassandra experts,
we are running a set of availability evaluations under a write/read/update
workloads with Apache Cassandra and experience some unexpected results,
i.e. 0 ops/s over a period up to 100s.
In order to provide a clear picture find below the details of (1) the
setup and (2) the evaluation workflow
*1. Setup:*
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the same
availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and 50GB disk.
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update
*2. Evaluation Workflow: *
1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of a random node in the cluster and delete the
VM without stopping Cassandra before
5. analyze throughput time series over the evaluation
*3. (Unexpected) Results *We expected to see a (slight) drop in the
throughput as soon as the VM was deleted.
But the throughput results show that the there are periods of ~10s - 150s
(not deterministic) where no operations are executed (all metrics are
collected on client side)
Yet, there are no timeout exceptions on client side and also the logs on
cluster side do not show anything that explains this behaviour.
I attached a series of plots which show the throughput and the downtimes
over the evaluation runs.
Do you have any explanations for this behaviour or recommendations how to
reduce the potential "downtime" ?
Thanks in advance for any help and recommendations,
Cheers,
Daniel
--
M.Sc. Daniel Seybold
UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43 <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++++++++++89081+Ulm&entry=gmail&source=g>
<https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++++++++++89081+Ulm&entry=gmail&source=g>
89081 Ulm <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++++++++++89081+Ulm&entry=gmail&source=g>
Phone: +49 (0)731 50-28 799 <+49%20731%205028799>
------------------------------
The information in this Internet Email is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this Email
by anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be
taken in reliance on it, is prohibited and may be unlawful. When addressed
to our clients any opinions or advice contained in this Email are subject
to the terms and conditions expressed in any applicable governing The Home
Depot terms of business or client engagement letter. The Home Depot
disclaims all responsibility and liability for the accuracy and content of
this attachment and for any damages or losses arising from any
inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
items of a destructive nature, which may be contained in this attachment
and shall not be liable for direct, indirect, consequential or special
damages in connection with this e-mail message or its attachment.
--
M.Sc. Daniel Seybold
UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)Albert-Einstein-Allee 43
89081 Ulm <https://maps.google.com/?q=Albert-Einstein-Allee+43%0D%0A89081+Ulm&entry=gmail&source=g>
Phone: +49 (0)731 50-28 799 <+49%20731%205028799>
---------------------------------------------------------------------

--
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Daniel Seybold

2018-11-23 09:19:25 UTC

Permalink

Hi Alexander,

thanks a lot for the pointers, I checked the mentioned issue.

While the reported issue seems to match our problem it only occurs reads
and not for writes (according to the Datastax Jira). But we experience
downtimes for writes and reads.

Post by Alexander Dejanovski
Which version of the Datastax Driver are you using for your tests?

We use version 3.0.0

But I have also tried version 3.2.0 to avoid your mentioned JAVA-1346
issue, but still the same behaviour with respect to the downtime.

Post by Alexander Dejanovski
How is it configured (load balancing policies, etc...) ?

Besides the write consistency of ONE it uses the default settings.

As we use the YCSB as workload for our experiments, you can have a look
at the driver settings in the basic class:
https://github.com/brianfrankcooper/YCSB/blob/master/cassandra/src/main/java/com/yahoo/ycsb/db/CassandraCQLClient.java

Post by Alexander Dejanovski
Do you have some debug logs on the client side that could help?

On client side the logs shows no exceptions or any suspicious messages.

I also turned on the tracing but didn't find any suspicious messages
(yet I did not spend too much time in that and I am no expert the
Cassandra Driver)

If more detailed logs or the traces would help to further investigate
the issue let me know and I will rerun the experiments to create the
logs and traces.

Many thanks again for your help.

Cheers,

Daniel

Post by Alexander Dejanovski
Hi Daniel,
it seems like the driver isn't detecting that the node went down,
which is probably due to the way the node is being killed.
If I remember correctly, in some cases Netty transport is still up in
the client, which will still allows to send queries without them
answering back : https://datastax-oss.atlassian.net/browse/JAVA-1346
Eventually, the node gets discarded when the heartbeat system catches up.
It's also possible that the stuck queries then eat up all the
available slots in the driver, preventing any other query to be sent
in that JVM.
Which version of the Datastax Driver are you using for your tests?
How is it configured (load balancing policies, etc...) ?
Do you have some debug logs on the client side that could help?
Thanks,
On Fri, Nov 16, 2018 at 1:19 PM Daniel Seybold
Hi Sean,
thanks for your comments, find below some more details with
We selected the small VMs as intial setup to run our experiments.
We have also executed the same experiments (5 nodes) on larger VMs
with 6 cores and 12GB memory (where 6GB was allocated to Cassandra).
We use the default CMS garbace collector (with default settings)
and the debug.log and system.log does not show any suspicious GC
messages.
(2) Replication factor
We set the RF to 5 as we want to emulate a scenario which is able
to survive multiple-node failures. We have also tried a RF of 3
(in the 5 node cluster) but the downtime in case of a node failure
persists.
I also attached two plots which show the results with the
downtimes for using the larger VMs and setting the RF to 3
Any further comments much appreciated,
Cheers,
Daniel

Post by Durity, Sean R
The VMsâ memory (4 GB) seems pretty small for Cassandra. What
heap size are you using? Which garbage collector? Are you seeing
long GC times on the nodes? The basic rule of thumb is to give
the Cassandra heap 50% of the RAM on the host. 2 GB isnât very much.
Also, I wouldnât set the replication factor to 5 (the number of
nodes). If RF is always equal to the number of nodes, you canât
really scale beyond the size of the disk on any one node (all
data is on each node). A replication factor of 3 would be more
like a typical production set-up.
Sean Durity
*Sent:* Friday, November 09, 2018 5:49 AM
*Subject:* [EXTERNAL] Availability issues for write/update/read
workloads (up to 100s downtime) in case of a Cassandra node failure
Hi Apache Cassandra experts,
we are running a set of availability evaluations under a
write/read/update workloads with Apache Cassandra and experience
some unexpected results, i.e.Â 0 ops/s over a period up to 100s.
In order to provide a clear picture find below the details of (1)
the setup and (2) the evaluation workflow
*1. Setup:*
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within
the same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and 50GB disk.
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update
*2. Evaluation Workflow: *
1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of aÂ random node in the cluster and
delete the VM without stopping Cassandra before
5. analyze throughput time series over the evaluation
*3. (Unexpected) Results
*We expected to see a (slight) drop in the throughput as soon as
the VM was deleted.
But the throughput results show that the there are periods of
~10s - 150s (not deterministic) where no operations are executed
(all metrics are collected on client side)
Yet, there are no timeout exceptions on client side and also the
logs on cluster side do not show anything that explains this
behaviour.
I attached a series of plots which show the throughput and the
downtimes over the evaluation runs.
Do you have any explanations for this behaviour or
recommendations how to reduce theÂ potential "downtime" ?
Thanks in advance for any help and recommendations,
Cheers,
Daniel
--
M.Sc. Daniel Seybold
UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43 <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++++++++++89081+Ulm&entry=gmail&source=g>
<https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++++++++++89081+Ulm&entry=gmail&source=g>
89081 Ulm <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++++++++++89081+Ulm&entry=gmail&source=g>
Phone:+49 (0)731 50-28 799 <tel:+49%20731%205028799>
------------------------------------------------------------------------
The information in this Internet Email is confidential and may be
legally privileged. It is intended solely for the addressee.
Access to this Email by anyone else is unauthorized. If you are
not the intended recipient, any disclosure, copying, distribution
or any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful. When addressed to our clients any
opinions or advice contained in this Email are subject to the
terms and conditions expressed in any applicable governing The
Home Depot terms of business or client engagement letter. The
Home Depot disclaims all responsibility and liability for the
accuracy and content of this attachment and for any damages or
losses arising from any inaccuracies, errors, viruses, e.g.,
worms, trojan horses, etc., or other items of a destructive
nature, which may be contained in this attachment and shall not
be liable for direct, indirect, consequential or special damages
in connection with this e-mail message or its attachment.

--
M.Sc. Daniel Seybold
UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43 89081 Ulm <https://maps.google.com/?q=Albert-Einstein-Allee+43%0D%0A89081+Ulm&entry=gmail&source=g>
Phone:+49 (0)731 50-28 799 <tel:+49%20731%205028799>
---------------------------------------------------------------------
--
-----------------
Alexander Dejanovski
France
@alexanderdeja
Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com <http://www.thelastpickle.com/>

--
M.Sc. Daniel Seybold

UniversitÃ€t Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43
89081 Ulm
Phone: +49 (0)731 50-28 799