Discussion:
Is there any limit in the number of partitions that a table can have
Javier Pareja
2018-03-07 11:06:08 UTC
Permalink
Hello all,

I have been trying to find an answer to the following but I have had no
luck so far:
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.

Regards,

F Javier Pareja
Rahul Singh
2018-03-07 11:12:32 UTC
Permalink
The range is 2*2^63

--
Rahul Singh
***@anant.us

Anant Corporation
Post by Javier Pareja
Hello all,
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a recommended limit on the number of values that this partition key can have? Is it recommended to have a clustering key to reduce this number by storing several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja
Javier Pareja
2018-03-07 11:26:58 UTC
Permalink
Thank you Rahul, but is it a good practice to use a large range here? Or
would it be better to create partitions with more than 1 row (by using a
clustering key)?
From a data query point of view I will be accessing the rows by a UID one
at a time.

F Javier Pareja
Post by Rahul Singh
The range is 2*2^63
--
Rahul Singh
Anant Corporation
Hello all,
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja
Tom van der Woerdt
2018-03-07 11:20:14 UTC
Permalink
Hi Javier,

When our users ask this question, I tend to answer "keep it above a
billion". More partitions is better.

I'm not aware of any actual limits on partition count. Practically it's
almost always limited by the disk space in a server.

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
Post by Javier Pareja
Hello all,
I have been trying to find an answer to the following but I have had no
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja
Jeff Jirsa
2018-03-07 14:36:03 UTC
Permalink
There is no limit

The token range of murmur3 is 2^64, but Cassandra properly handles token overlaps (we use a key that’s effectively a tuple of the token/hash and the underlying key itself), so having more than 2^64 partitions won’t hurt anything in theory

That said, having that many partitions would be an incredibly huge data set, and unless modeled properly, would be very likely to be unwieldy in practice.

Tables without clustering keys are often deceptively expensive to compact, as a lot of work (relative to the other cell boundaries) happens on partition boundaries.
--
Jeff Jirsa
Post by Javier Pareja
Hello all,
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a recommended limit on the number of values that this partition key can have? Is it recommended to have a clustering key to reduce this number by storing several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja
---------------------------------------------------------------------
To unsubscribe, e-mail: user-***@cassandra.apache.org
For additional commands, e-mail: user-***@cassandra.apache.org
Carlos Rolo
2018-03-07 15:13:49 UTC
Permalink
Hi Jeff,

Could you expand: "Tables without clustering keys are often deceptively
expensive to compact, as a lot of work (relative to the other cell
boundaries) happens on partition boundaries." This is something I didn't
know and highly interesting to know more about!

--
Carlos Rolo
Post by Jeff Jirsa
There is no limit
The token range of murmur3 is 2^64, but Cassandra properly handles token
overlaps (we use a key that’s effectively a tuple of the token/hash and the
underlying key itself), so having more than 2^64 partitions won’t hurt
anything in theory
That said, having that many partitions would be an incredibly huge data
set, and unless modeled properly, would be very likely to be unwieldy in
practice.
Tables without clustering keys are often deceptively expensive to compact,
as a lot of work (relative to the other cell boundaries) happens on
partition boundaries.
--
Jeff Jirsa
Post by Javier Pareja
Hello all,
I have been trying to find an answer to the following but I have had no
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.
Post by Javier Pareja
Regards,
F Javier Pareja
---------------------------------------------------------------------
--
--
Jeff Jirsa
2018-03-07 17:20:35 UTC
Permalink
Post by Carlos Rolo
Hi Jeff,
Could you expand: "Tables without clustering keys are often deceptively
expensive to compact, as a lot of work (relative to the other cell
boundaries) happens on partition boundaries." This is something I didn't
know and highly interesting to know more about!
We do a lot "by partition". We build column indexes by partition. We update
the partition index on each partition. We invalidate key cache by
partition. They're not super expensive, but they take time, and tables with
tiny partitions can actually be slower to compact.

There's no magic cutoff where it does/doesn't make sense, my comment is
mostly a warning that the edges of the "normal" use cases tend to be less
optimized than the common case. Having a table with a hundred billion
records, where the key is numeric and the value is a single byte (let's say
you're keeping track of whether or not a specific sensor has ever detected
some magic event, and you have 100B sensors, that table will be close to
the worst-case example of this behavior).
Javier Pareja
2018-03-07 17:48:42 UTC
Permalink
Thank you for your time Jeff, very helpful.I couldn't find anything out
there about the subject and I suspected that this could be the case.

Regarding the clustering key in this case:
Back in the RDBMS world, you will always assign a sequential (or as
sequential as possible) clustering key to a table to minimize fragmentation
and increase the speed of the insertions. In the Cassandra world, does the
same apply to the clustering key? For example, is it a good idea to assign
a UUID to a clustering key, or would a timestamp be a better choice? I am
thinking that partitions need to keep some sort of binary index for the
clustering keys and for relatively large partitions it can be relatively
expensive to maintain.

F Javier Pareja
Post by Jeff Jirsa
Post by Carlos Rolo
Hi Jeff,
Could you expand: "Tables without clustering keys are often deceptively
expensive to compact, as a lot of work (relative to the other cell
boundaries) happens on partition boundaries." This is something I didn't
know and highly interesting to know more about!
We do a lot "by partition". We build column indexes by partition. We
update the partition index on each partition. We invalidate key cache by
partition. They're not super expensive, but they take time, and tables with
tiny partitions can actually be slower to compact.
There's no magic cutoff where it does/doesn't make sense, my comment is
mostly a warning that the edges of the "normal" use cases tend to be less
optimized than the common case. Having a table with a hundred billion
records, where the key is numeric and the value is a single byte (let's say
you're keeping track of whether or not a specific sensor has ever detected
some magic event, and you have 100B sensors, that table will be close to
the worst-case example of this behavior).
Carlos Rolo
2018-03-07 20:07:20 UTC
Permalink
Great explanation, thanks Jeff!
Post by Javier Pareja
Thank you for your time Jeff, very helpful.I couldn't find anything out
there about the subject and I suspected that this could be the case.
Back in the RDBMS world, you will always assign a sequential (or as
sequential as possible) clustering key to a table to minimize fragmentation
and increase the speed of the insertions. In the Cassandra world, does the
same apply to the clustering key? For example, is it a good idea to assign
a UUID to a clustering key, or would a timestamp be a better choice? I am
thinking that partitions need to keep some sort of binary index for the
clustering keys and for relatively large partitions it can be relatively
expensive to maintain.
F Javier Pareja
Post by Jeff Jirsa
Post by Carlos Rolo
Hi Jeff,
Could you expand: "Tables without clustering keys are often deceptively
expensive to compact, as a lot of work (relative to the other cell
boundaries) happens on partition boundaries." This is something I didn't
know and highly interesting to know more about!
We do a lot "by partition". We build column indexes by partition. We
update the partition index on each partition. We invalidate key cache by
partition. They're not super expensive, but they take time, and tables with
tiny partitions can actually be slower to compact.
There's no magic cutoff where it does/doesn't make sense, my comment is
mostly a warning that the edges of the "normal" use cases tend to be less
optimized than the common case. Having a table with a hundred billion
records, where the key is numeric and the value is a single byte (let's say
you're keeping track of whether or not a specific sensor has ever detected
some magic event, and you have 100B sensors, that table will be close to
the worst-case example of this behavior).
--
--
Javier Pareja
2018-03-07 16:41:38 UTC
Permalink
Thank you Jeff,

So, if I understood your email correctly, there is no restriction but I
should be using clustering for performance reasons.
I am expecting to store 10B rows per year in this table and each row will
have a user defined type with an approx size of 1500 bytes.
The access to the data in this table will be random as it stores the "raw"
data. There will be other tables with processed data organized by time in a
clustering column to access in sequence for the system.
Each row represents an event and it has a UUID which I am planning to use
as the partition key. Should I find another field for the partition and use
the UUID for the clustering instead?


F Javier Pareja
Post by Jeff Jirsa
There is no limit
The token range of murmur3 is 2^64, but Cassandra properly handles token
overlaps (we use a key that’s effectively a tuple of the token/hash and the
underlying key itself), so having more than 2^64 partitions won’t hurt
anything in theory
That said, having that many partitions would be an incredibly huge data
set, and unless modeled properly, would be very likely to be unwieldy in
practice.
Tables without clustering keys are often deceptively expensive to compact,
as a lot of work (relative to the other cell boundaries) happens on
partition boundaries.
--
Jeff Jirsa
Post by Javier Pareja
Hello all,
I have been trying to find an answer to the following but I have had no
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.
Post by Javier Pareja
Regards,
F Javier Pareja
---------------------------------------------------------------------
Continue reading on narkive:
Loading...