Discussion:
Limit on the size of a list
Robert Wille
2013-05-12 15:13:32 UTC
Permalink
I designed a data model for my data that uses a list of UUID's in a
column. When I designed my data model, my expectation was that most of the
lists would have fewer than a hundred elements, with a few having several
thousand. I discovered in my data a list that has nearly 400,000 items in
it. When I try to retrieve it, I get the following exception:

java.lang.IllegalArgumentException: Illegal Capacity: -14594
at java.util.ArrayList.<init>(ArrayList.java:110)
at
org.apache.cassandra.cql.jdbc.ListMaker.compose(ListMaker.java:54)
at
org.apache.cassandra.cql.jdbc.TypedColumn.<init>(TypedColumn.java:68)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.createColumn(CassandraResu
ltSet.java:1086)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.populateColumns(CassandraR
esultSet.java:161)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.<init>(CassandraResultSet.
java:134)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStateme
nt.java:166)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.executeQuery(CassandraStat
ement.java:226)


I get this with Cassandra 1.2.4 and the latest snapshot of the JDBC
driver. Admittedly, several hundred thousand is quite a lot of items, but
odd that I'm getting some kind of wraparound, since 400,000 is a long ways
from 2 billion.

What are the physical and practical limits on the size of a list? Is it
possible to retrieve a range of items from a list?

Thanks in advance

Robert
Edward Capriolo
2013-05-13 01:26:26 UTC
Permalink
2 billion is the maximum theoretically limit of columns under a row. It is
NOT the maximum limit of a CQL collection. The design of CQL collections
currently require retrieving the entire collection on read.
Post by Robert Wille
I designed a data model for my data that uses a list of UUID's in a
column. When I designed my data model, my expectation was that most of the
lists would have fewer than a hundred elements, with a few having several
thousand. I discovered in my data a list that has nearly 400,000 items in
java.lang.IllegalArgumentException: Illegal Capacity: -14594
at java.util.ArrayList.<init>(ArrayList.java:110)
at
org.apache.cassandra.cql.jdbc.ListMaker.compose(ListMaker.java:54)
at
org.apache.cassandra.cql.jdbc.TypedColumn.<init>(TypedColumn.java:68)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.createColumn(CassandraResu
ltSet.java:1086)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.populateColumns(CassandraR
esultSet.java:161)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.<init>(CassandraResultSet.
java:134)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStateme
nt.java:166)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.executeQuery(CassandraStat
ement.java:226)
I get this with Cassandra 1.2.4 and the latest snapshot of the JDBC
driver. Admittedly, several hundred thousand is quite a lot of items, but
odd that I'm getting some kind of wraparound, since 400,000 is a long ways
from 2 billion.
What are the physical and practical limits on the size of a list? Is it
possible to retrieve a range of items from a list?
Thanks in advance
Robert
Theo Hultberg
2013-05-13 06:51:48 UTC
Permalink
In the CQL3 protocol the sizes of collections are unsigned shorts, so the
maximum number of elements in a LIST<...> is 65,536. There's no check,
afaik, that stops you from creating lists that are bigger than that, but
the protocol doesn't handle returning them (you get the first N - 65536 %
65536 items).

On the other hand the JDBC driver doesn't talk over the binary protocol but
Thrift, doesn't it? In that case there may be other limits.

T#
Post by Edward Capriolo
2 billion is the maximum theoretically limit of columns under a row. It is
NOT the maximum limit of a CQL collection. The design of CQL collections
currently require retrieving the entire collection on read.
Post by Robert Wille
I designed a data model for my data that uses a list of UUID's in a
column. When I designed my data model, my expectation was that most of the
lists would have fewer than a hundred elements, with a few having several
thousand. I discovered in my data a list that has nearly 400,000 items in
java.lang.IllegalArgumentException: Illegal Capacity: -14594
at java.util.ArrayList.<init>(ArrayList.java:110)
at
org.apache.cassandra.cql.jdbc.ListMaker.compose(ListMaker.java:54)
at
org.apache.cassandra.cql.jdbc.TypedColumn.<init>(TypedColumn.java:68)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.createColumn(CassandraResu
ltSet.java:1086)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.populateColumns(CassandraR
esultSet.java:161)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.<init>(CassandraResultSet.
java:134)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStateme
nt.java:166)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.executeQuery(CassandraStat
ement.java:226)
I get this with Cassandra 1.2.4 and the latest snapshot of the JDBC
driver. Admittedly, several hundred thousand is quite a lot of items, but
odd that I'm getting some kind of wraparound, since 400,000 is a long ways
from 2 billion.
What are the physical and practical limits on the size of a list? Is it
possible to retrieve a range of items from a list?
Thanks in advance
Robert
Edward Capriolo
2013-05-13 13:51:45 UTC
Permalink
Collections that big are likely not what you want. Many people are using
cassandra because they want low latency reads <10ms on smallish row keys or
key slices. Attempting to get 10K + columns in one go generally does not
work well. First, there is network issues 100K columns of 5 bytes requires
large buffers thrift has a max packet size. Thrift aside it is still a lot
of data requiring large buffers etc. However the largest problem is you can
not sub select a collection like you can slice columns from a row key like
thrift.

The lesson being learned the hard way is that collections should not be
very large. This is fairly tricky to accomplish with a blind-write system.
You actually need to do a worst case scenario assessment to figure out how
big your largest entry could be and then chose an approach based on that.
This is counter intuitive because many might think to design around the
normal scenario and leave the worst case scenario to chance. Then you end
up in a spot like this where one row is unreadable/impractical to manage.
Post by Theo Hultberg
In the CQL3 protocol the sizes of collections are unsigned shorts, so the
maximum number of elements in a LIST<...> is 65,536. There's no check,
afaik, that stops you from creating lists that are bigger than that, but
the protocol doesn't handle returning them (you get the first N - 65536 %
65536 items).
On the other hand the JDBC driver doesn't talk over the binary protocol
but Thrift, doesn't it? In that case there may be other limits.
T#
Post by Edward Capriolo
2 billion is the maximum theoretically limit of columns under a row. It
is NOT the maximum limit of a CQL collection. The design of CQL collections
currently require retrieving the entire collection on read.
Post by Robert Wille
I designed a data model for my data that uses a list of UUID's in a
column. When I designed my data model, my expectation was that most of the
lists would have fewer than a hundred elements, with a few having several
thousand. I discovered in my data a list that has nearly 400,000 items in
java.lang.IllegalArgumentException: Illegal Capacity: -14594
at java.util.ArrayList.<init>(ArrayList.java:110)
at
org.apache.cassandra.cql.jdbc.ListMaker.compose(ListMaker.java:54)
at
org.apache.cassandra.cql.jdbc.TypedColumn.<init>(TypedColumn.java:68)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.createColumn(CassandraResu
ltSet.java:1086)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.populateColumns(CassandraR
esultSet.java:161)
at
org.apache.cassandra.cql.jdbc.CassandraResultSet.<init>(CassandraResultSet.
java:134)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStateme
nt.java:166)
at
org.apache.cassandra.cql.jdbc.CassandraStatement.executeQuery(CassandraStat
ement.java:226)
I get this with Cassandra 1.2.4 and the latest snapshot of the JDBC
driver. Admittedly, several hundred thousand is quite a lot of items, but
odd that I'm getting some kind of wraparound, since 400,000 is a long ways
from 2 billion.
What are the physical and practical limits on the size of a list? Is it
possible to retrieve a range of items from a list?
Thanks in advance
Robert
Robert Coli
2013-05-13 18:08:17 UTC
Permalink
Post by Edward Capriolo
2 billion is the maximum theoretically limit of columns under a row. It is
NOT the maximum limit of a CQL collection. The design of CQL collections
currently require retrieving the entire collection on read.
Each column has a byte overhead of 15 bytes [1]. Assuming a column key
of type "int" [2] and no column value, a 2 billion column row is
approximately 35 gigabytes. This is before including the column index.
Anyone attempting to actually create and interact with a single row
containing 2 billion columns seems likely to quickly discover just how
meaningless the theoretical maximum is, so I'm not sure why we refer
to this number..

=Rob
[1] http://btoddb-cass-storage.blogspot.com/2011/07/column-overhead-and-sizing-every-column.html
[2] The minimum size to actually have 2Bn unique column keys.
Edward Capriolo
2013-05-13 19:43:38 UTC
Permalink
Whenever I mention the limit in a talk I say, "2 Billion columns" in a faux
10 year old voice :). Cassandra can have a 2billion column row. A 60MB row
in row cache will make the JVM sh*t the bed. (row cache you should not use
anyway). As rcoli points out a 35 GB row, I doubt you can do anything with
it that will finish in less then the rpc_timeout. In a single request I
personally would not try more then 10K columns.
Post by Robert Coli
Post by Edward Capriolo
2 billion is the maximum theoretically limit of columns under a row. It
is
Post by Edward Capriolo
NOT the maximum limit of a CQL collection. The design of CQL collections
currently require retrieving the entire collection on read.
Each column has a byte overhead of 15 bytes [1]. Assuming a column key
of type "int" [2] and no column value, a 2 billion column row is
approximately 35 gigabytes. This is before including the column index.
Anyone attempting to actually create and interact with a single row
containing 2 billion columns seems likely to quickly discover just how
meaningless the theoretical maximum is, so I'm not sure why we refer
to this number..
=Rob
[1]
http://btoddb-cass-storage.blogspot.com/2011/07/column-overhead-and-sizing-every-column.html
[2] The minimum size to actually have 2Bn unique column keys.
Loading...