Discussion:
system_auth keyspace replication factor
Vitali Dyachuk
2018-11-23 16:38:31 UTC
Permalink
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is taking
most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?

Regards,
Vitali Djatsuk.
Jonathan Haddad
2018-11-23 17:30:21 UTC
Permalink
Any chance you’re logging in with the Cassandra user? It uses quorum reads.
Post by Vitali Dyachuk
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is taking
most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?
Regards,
Vitali Djatsuk.
--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade
Vitali Dyachuk
2018-11-23 18:18:01 UTC
Permalink
no its not a cassandra user and as i understood all other users login
local_one.
Post by Jonathan Haddad
Any chance you’re logging in with the Cassandra user? It uses quorum reads.
Post by Vitali Dyachuk
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is
taking most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?
Regards,
Vitali Djatsuk.
--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade
Jeff Jirsa
2018-11-23 18:31:55 UTC
Permalink
I suspect some of the intermediate queries (determining role, etc) happen at quorum in 2.2+, but I don’t have time to go read the code and prove it.

In any case, RF > 10 per DC is probably excessive

Also want to crank up the validity times so it uses cached info longer
--
Jeff Jirsa
no its not a cassandra user and as i understood all other users login local_one.
Post by Jonathan Haddad
Any chance you’re logging in with the Cassandra user? It uses quorum reads.
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the cluster
and set an RF=60 for the system_auth ks, following this documentation https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster, when analyzing the logs we notices that Rebuilding token map is taking most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver" version="3.2.1"
I've found somehow related to my problem ticket https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the related tickets, that the issue with the token map rebuild time has been fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting system_auth ks RF?
Regards,
Vitali Djatsuk.
--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade
Vitali Dyachuk
2018-11-23 22:27:50 UTC
Permalink
Attaching the runner log snippet, where we can see that "Rebuilding token
map" took most of the time.
getAllroles is using quorum, don't if it is used during login
https://github.com/apache/cassandra/blob/cc12665bb7645d17ba70edcf952ee6a1ea63127b/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L260

Vitali Djatsuk,
Post by Jeff Jirsa
I suspect some of the intermediate queries (determining role, etc) happen
at quorum in 2.2+, but I don’t have time to go read the code and prove it.
In any case, RF > 10 per DC is probably excessive
Also want to crank up the validity times so it uses cached info longer
--
Jeff Jirsa
no its not a cassandra user and as i understood all other users login local_one.
Post by Jonathan Haddad
Any chance you’re logging in with the Cassandra user? It uses quorum reads.
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is
taking most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting system_auth ks RF?
Regards,
Vitali Djatsuk.
--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade
Oleksandr Shulgin
2018-11-26 09:44:44 UTC
Permalink
Post by Vitali Dyachuk
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
Sadly, this recommendation is out of date / incorrect. For `system_auth`
we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no
issues.

Is there a chance to correct the documentation @datastax?

Regards,
--
Alex
Sam Tunnicliffe
2018-11-26 10:03:16 UTC
Permalink
Post by Jeff Jirsa
I suspect some of the intermediate queries (determining role, etc) happen at quorum in 2.2+, but I don’t have time to go read the code and prove it.
This isn’t true. Aside from when using the default superuser, only CRM::getAllRoles reads at QUORUM (because the resultset would include the default superuser if present). This is only called during execution of a LIST ROLES statement and isn’t on the login path.

From the driver log you can see that the actual authentication exchange happens quickly, so I’d say that the problem described in CSHARP-436 is a more likely candidate.
Post by Jeff Jirsa
Sadly, this recommendation is out of date / incorrect. For `system_auth` we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no issues.
+1 to that, RF=N is way over the top.

Thanks,
Sam
Post by Jeff Jirsa
We have recently met a problem when we added 60 nodes in 1 region to the cluster
and set an RF=60 for the system_auth ks, following this documentation https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html <https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html>
Sadly, this recommendation is out of date / incorrect. For `system_auth` we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no issues.
Regards,
--
Alex
Loading...