system_auth keyspace replication factor

Discussion:

Vitali Dyachuk

2018-11-23 16:38:31 UTC

Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is taking
most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?

Regards,
Vitali Djatsuk.

Jonathan Haddad

2018-11-23 17:30:21 UTC

Permalink

Any chance youâre logging in with the Cassandra user? It uses quorum reads.

Post by Vitali Dyachuk
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is taking
most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?
Regards,
Vitali Djatsuk.
--

Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Vitali Dyachuk

2018-11-23 18:18:01 UTC

Permalink

no its not a cassandra user and as i understood all other users login
local_one.

Post by Jonathan Haddad
Any chance youâre logging in with the Cassandra user? It uses quorum reads.

Post by Vitali Dyachuk
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is
taking most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?
Regards,
Vitali Djatsuk.
--

Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Jeff Jirsa

2018-11-23 18:31:55 UTC

Permalink

I suspect some of the intermediate queries (determining role, etc) happen at quorum in 2.2+, but I donât have time to go read the code and prove it.

In any case, RF > 10 per DC is probably excessive

Also want to crank up the validity times so it uses cached info longer
--
Jeff Jirsa

no its not a cassandra user and as i understood all other users login local_one.

Post by Jonathan Haddad
Any chance youâre logging in with the Cassandra user? It uses quorum reads.

Hi,
We have recently met a problem when we added 60 nodes in 1 region to the cluster
and set an RF=60 for the system_auth ks, following this documentation https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster, when analyzing the logs we notices that Rebuilding token map is taking most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver" version="3.2.1"
I've found somehow related to my problem ticket https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the related tickets, that the issue with the token map rebuild time has been fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting system_auth ks RF?
Regards,
Vitali Djatsuk.

--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Vitali Dyachuk

2018-11-23 22:27:50 UTC

Permalink

Attaching the runner log snippet, where we can see that "Rebuilding token
map" took most of the time.
getAllroles is using quorum, don't if it is used during login
https://github.com/apache/cassandra/blob/cc12665bb7645d17ba70edcf952ee6a1ea63127b/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L260

Vitali Djatsuk,

Post by Jeff Jirsa
I suspect some of the intermediate queries (determining role, etc) happen
at quorum in 2.2+, but I donât have time to go read the code and prove it.
In any case, RF > 10 per DC is probably excessive
Also want to crank up the validity times so it uses cached info longer
--
Jeff Jirsa
no its not a cassandra user and as i understood all other users login local_one.

Post by Jonathan Haddad
Any chance youâre logging in with the Cassandra user? It uses quorum reads.

Hi,
We have recently met a problem when we added 60 nodes in 1 region to the cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that Rebuilding token map is
taking most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting system_auth ks RF?
Regards,
Vitali Djatsuk.
--

Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Oleksandr Shulgin

2018-11-26 09:44:44 UTC

Permalink

Post by Vitali Dyachuk
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html

Sadly, this recommendation is out of date / incorrect. For `system_auth`
we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no
issues.

Is there a chance to correct the documentation @datastax?

Regards,
--
Alex

Sam Tunnicliffe

2018-11-26 10:03:16 UTC

Permalink

Post by Jeff Jirsa
I suspect some of the intermediate queries (determining role, etc) happen at quorum in 2.2+, but I donât have time to go read the code and prove it.

This isnât true. Aside from when using the default superuser, only CRM::getAllRoles reads at QUORUM (because the resultset would include the default superuser if present). This is only called during execution of a LIST ROLES statement and isnât on the login path.

From the driver log you can see that the actual authentication exchange happens quickly, so Iâd say that the problem described in CSHARP-436 is a more likely candidate.

Post by Jeff Jirsa
Sadly, this recommendation is out of date / incorrect. For `system_auth` we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no issues.

+1 to that, RF=N is way over the top.

Thanks,
Sam

Post by Jeff Jirsa
We have recently met a problem when we added 60 nodes in 1 region to the cluster
and set an RF=60 for the system_auth ks, following this documentation https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html <https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html>
Sadly, this recommendation is out of date / incorrect. For `system_auth` we are mostly using a formula like `RF=min(num_dc_nodes, 5)` and see no issues.
Regards,
--
Alex