Daniel Seybold
2018-11-09 10:48:32 UTC
Hi Apache Cassandra experts,
we are running a set of availability evaluations under a
write/read/update workloads with Apache Cassandra and experience some
unexpected results, i.e. 0 ops/s over a period up to 100s.
In order to provide a clear picture find below the details of (1) the
setup and (2) the evaluation workflow
*1. Setup:*
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the
same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and
50GB disk.
Workload:
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update
*2. Evaluation Workflow: *
1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of a random node in the cluster and delete the
VM without stopping Cassandra before
5. analyze throughput time series over the evaluation
*3. (Unexpected) Results
*We expected to see a (slight) drop in the throughput as soon as the VM
was deleted.
But the throughput results show that the there are periods of ~10s -
150s (not deterministic) where no operations are executed (all metrics
are collected on client side)
Yet, there are no timeout exceptions on client side and also the logs on
cluster side do not show anything that explains this behaviour.
I attached a series of plots which show the throughput and the downtimes
over the evaluation runs.
Do you have any explanations for this behaviour or recommendations how
to reduce the potential "downtime" ?
Thanks in advance for any help and recommendations,
Cheers,
Daniel
we are running a set of availability evaluations under a
write/read/update workloads with Apache Cassandra and experience some
unexpected results, i.e. 0 ops/s over a period up to 100s.
In order to provide a clear picture find below the details of (1) the
setup and (2) the evaluation workflow
*1. Setup:*
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the
same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and
50GB disk.
Workload:
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update
*2. Evaluation Workflow: *
1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of a random node in the cluster and delete the
VM without stopping Cassandra before
5. analyze throughput time series over the evaluation
*3. (Unexpected) Results
*We expected to see a (slight) drop in the throughput as soon as the VM
was deleted.
But the throughput results show that the there are periods of ~10s -
150s (not deterministic) where no operations are executed (all metrics
are collected on client side)
Yet, there are no timeout exceptions on client side and also the logs on
cluster side do not show anything that explains this behaviour.
I attached a series of plots which show the throughput and the downtimes
over the evaluation runs.
Do you have any explanations for this behaviour or recommendations how
to reduce the potential "downtime" ?
Thanks in advance for any help and recommendations,
Cheers,
Daniel
--
M.Sc. Daniel Seybold
UniversitÀt Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43
89081 Ulm
Phone: +49 (0)731 50-28 799
M.Sc. Daniel Seybold
UniversitÀt Ulm
Institut Organisation und Management
von Informationssystemen (OMI)
Albert-Einstein-Allee 43
89081 Ulm
Phone: +49 (0)731 50-28 799