Is there any limit in the number of partitions that a table can have

Discussion:

Javier Pareja

2018-03-07 11:06:08 UTC

Hello all,

I have been trying to find an answer to the following but I have had no
luck so far:
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.

Regards,

F Javier Pareja

Rahul Singh

2018-03-07 11:12:32 UTC

Permalink

The range is 2*2^63

--
Rahul Singh
***@anant.us

Anant Corporation

Post by Javier Pareja
Hello all,
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a recommended limit on the number of values that this partition key can have? Is it recommended to have a clustering key to reduce this number by storing several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja

Javier Pareja

2018-03-07 11:26:58 UTC

Permalink

Thank you Rahul, but is it a good practice to use a large range here? Or
would it be better to create partitions with more than 1 row (by using a
clustering key)?
From a data query point of view I will be accessing the rows by a UID one
at a time.

F Javier Pareja

Post by Rahul Singh
The range is 2*2^63
--
Rahul Singh
Anant Corporation
Hello all,
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja

Tom van der Woerdt

2018-03-07 11:20:14 UTC

Permalink

Hi Javier,

When our users ask this question, I tend to answer "keep it above a
billion". More partitions is better.

I'm not aware of any actual limits on partition count. Practically it's
almost always limited by the disk space in a server.

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)

Post by Javier Pareja
Hello all,
I have been trying to find an answer to the following but I have had no
Is there any limit to the number of partitions that a table can have?
Let's say a table has a partition key an no clustering key, is there a
recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.
Regards,
F Javier Pareja

Jeff Jirsa

2018-03-07 14:36:03 UTC

Permalink

There is no limit

The token range of murmur3 is 2^64, but Cassandra properly handles token overlaps (we use a key that’s effectively a tuple of the token/hash and the underlying key itself), so having more than 2^64 partitions won’t hurt anything in theory

That said, having that many partitions would be an incredibly huge data set, and unless modeled properly, would be very likely to be unwieldy in practice.

Tables without clustering keys are often deceptively expensive to compact, as a lot of work (relative to the other cell boundaries) happens on partition boundaries.
--
Jeff Jirsa

---------------------------------------------------------------------
To unsubscribe, e-mail: user-***@cassandra.apache.org
For additional commands, e-mail: user-***@cassandra.apache.org

Carlos Rolo

2018-03-07 15:13:49 UTC

Permalink

Hi Jeff,

Could you expand: "Tables without clustering keys are often deceptively
expensive to compact, as a lot of work (relative to the other cell
boundaries) happens on partition boundaries." This is something I didn't
know and highly interesting to know more about!

--
Carlos Rolo

Post by Jeff Jirsa
There is no limit
The token range of murmur3 is 2^64, but Cassandra properly handles token
overlaps (we use a key thatâs effectively a tuple of the token/hash and the
underlying key itself), so having more than 2^64 partitions wonât hurt
anything in theory
That said, having that many partitions would be an incredibly huge data
set, and unless modeled properly, would be very likely to be unwieldy in
practice.
Tables without clustering keys are often deceptively expensive to compact,
as a lot of work (relative to the other cell boundaries) happens on
partition boundaries.
--
Jeff Jirsa

recommended limit on the number of values that this partition key can have?
Is it recommended to have a clustering key to reduce this number by storing
several rows in each partition instead of one row per partition.

Post by Javier Pareja
Regards,
F Javier Pareja

---------------------------------------------------------------------

--
--

Jeff Jirsa

2018-03-07 17:20:35 UTC

Permalink

Post by Carlos Rolo
Hi Jeff,
Could you expand: "Tables without clustering keys are often deceptively
expensive to compact, as a lot of work (relative to the other cell
boundaries) happens on partition boundaries." This is something I didn't
know and highly interesting to know more about!

We do a lot "by partition". We build column indexes by partition. We update
the partition index on each partition. We invalidate key cache by
partition. They're not super expensive, but they take time, and tables with
tiny partitions can actually be slower to compact.

There's no magic cutoff where it does/doesn't make sense, my comment is
mostly a warning that the edges of the "normal" use cases tend to be less
optimized than the common case. Having a table with a hundred billion
records, where the key is numeric and the value is a single byte (let's say
you're keeping track of whether or not a specific sensor has ever detected
some magic event, and you have 100B sensors, that table will be close to
the worst-case example of this behavior).

Javier Pareja

2018-03-07 17:48:42 UTC

Permalink

Thank you for your time Jeff, very helpful.I couldn't find anything out
there about the subject and I suspected that this could be the case.

Regarding the clustering key in this case:
Back in the RDBMS world, you will always assign a sequential (or as
sequential as possible) clustering key to a table to minimize fragmentation
and increase the speed of the insertions. In the Cassandra world, does the
same apply to the clustering key? For example, is it a good idea to assign
a UUID to a clustering key, or would a timestamp be a better choice? I am
thinking that partitions need to keep some sort of binary index for the
clustering keys and for relatively large partitions it can be relatively
expensive to maintain.

F Javier Pareja

Post by Jeff Jirsa

We do a lot "by partition". We build column indexes by partition. We
update the partition index on each partition. We invalidate key cache by
partition. They're not super expensive, but they take time, and tables with
tiny partitions can actually be slower to compact.
There's no magic cutoff where it does/doesn't make sense, my comment is
mostly a warning that the edges of the "normal" use cases tend to be less
optimized than the common case. Having a table with a hundred billion
records, where the key is numeric and the value is a single byte (let's say
you're keeping track of whether or not a specific sensor has ever detected
some magic event, and you have 100B sensors, that table will be close to
the worst-case example of this behavior).

Carlos Rolo

2018-03-07 20:07:20 UTC

Permalink

Great explanation, thanks Jeff!

Post by Javier Pareja
Thank you for your time Jeff, very helpful.I couldn't find anything out
there about the subject and I suspected that this could be the case.
Back in the RDBMS world, you will always assign a sequential (or as
sequential as possible) clustering key to a table to minimize fragmentation
and increase the speed of the insertions. In the Cassandra world, does the
same apply to the clustering key? For example, is it a good idea to assign
a UUID to a clustering key, or would a timestamp be a better choice? I am
thinking that partitions need to keep some sort of binary index for the
clustering keys and for relatively large partitions it can be relatively
expensive to maintain.
F Javier Pareja

Post by Jeff Jirsa

We do a lot "by partition". We build column indexes by partition. We
update the partition index on each partition. We invalidate key cache by
partition. They're not super expensive, but they take time, and tables with
tiny partitions can actually be slower to compact.
There's no magic cutoff where it does/doesn't make sense, my comment is
mostly a warning that the edges of the "normal" use cases tend to be less
optimized than the common case. Having a table with a hundred billion
records, where the key is numeric and the value is a single byte (let's say
you're keeping track of whether or not a specific sensor has ever detected
some magic event, and you have 100B sensors, that table will be close to
the worst-case example of this behavior).

--
--

Javier Pareja

2018-03-07 16:41:38 UTC

Permalink

Thank you Jeff,

So, if I understood your email correctly, there is no restriction but I
should be using clustering for performance reasons.
I am expecting to store 10B rows per year in this table and each row will
have a user defined type with an approx size of 1500 bytes.
The access to the data in this table will be random as it stores the "raw"
data. There will be other tables with processed data organized by time in a
clustering column to access in sequence for the system.
Each row represents an event and it has a UUID which I am planning to use
as the partition key. Should I find another field for the partition and use
the UUID for the clustering instead?

F Javier Pareja