Discussion:
How to perform Range Queries in Cassandra
Mike Carter
2014-06-24 08:09:08 UTC
Permalink
Hello!


I'm a beginner in C* and I'm quite struggling with it.

I’d like to measure the performance of some Cassandra-Range-Queries. The
idea is to execute multidimensional range-queries on Cassandra. E.g. there
is a given table of 1million rows with 10 columns and I like to execute
some queries like “select count(*) from testable where d=1 and v1<10 and v2
20 and v3 <45 and v4>70 
 allow filtering”. This kind of queries is very
slow in C* and soon the tables are bigger, I get a read-timeout probably
caused by long scan operations.

In further tests I like to extend the dimensions to more than 200 hundreds
and the rows to 100millions, but actually I can’t handle this small table.
Should reorganize the data or is it impossible to perform such high
multi-dimensional queries on Cassandra?





The setup:

Cassandra is installed on a single node with 2 TB disk space and 180GB Ram.

Connected to Test Cluster at localhost:9160.

[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]



Keyspace:

CREATE KEYSPACE test WITH replication = {

'class': 'SimpleStrategy',

'replication_factor': '1'

};





Table:

CREATE TABLE testc21 (

key int,

d int,

v1 int,

v10 int,

v2 int,

v3 int,

v4 int,

v5 int,

v6 int,

v7 int,

v8 int,

v9 int,

PRIMARY KEY (key)

) WITH

bloom_filter_fp_chance=0.010000 AND

caching='ROWS_ONLY' AND

comment='' AND

dclocal_read_repair_chance=0.000000 AND

gc_grace_seconds=864000 AND

index_interval=128 AND

read_repair_chance=0.100000 AND

replicate_on_write='true' AND

populate_io_cache_on_flush='false' AND

default_time_to_live=0 AND

speculative_retry='99.0PERCENTILE' AND

memtable_flush_period_in_ms=0 AND

compaction={'class': 'SizeTieredCompactionStrategy'} AND

compression={'sstable_compression': 'LZ4Compressor'};



CREATE INDEX testc21_d_idx ON testc21 (d);



select * from testc21 limit 10;

key | d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9

--------+---+----+-----+----+----+-----+----+----+----+----+-----

302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54

531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27

693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19

4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14

927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100

313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7

368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52

671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93

62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63

488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45

Mike
Jeremy Jongsma
2014-06-24 13:21:32 UTC
Permalink
You'd be better off using external indexing (ElasticSearch or Solr),
Cassandra isn't really designed for this sort of querying.
Post by Mike Carter
Hello!
I'm a beginner in C* and I'm quite struggling with it.
I’d like to measure the performance of some Cassandra-Range-Queries. The
idea is to execute multidimensional range-queries on Cassandra. E.g. there
is a given table of 1million rows with 10 columns and I like to execute
some queries like “select count(*) from testable where d=1 and v1<10 and v2
20 and v3 <45 and v4>70 
 allow filtering”. This kind of queries is very
slow in C* and soon the tables are bigger, I get a read-timeout probably
caused by long scan operations.
In further tests I like to extend the dimensions to more than 200 hundreds
and the rows to 100millions, but actually I can’t handle this small table.
Should reorganize the data or is it impossible to perform such high
multi-dimensional queries on Cassandra?
Cassandra is installed on a single node with 2 TB disk space and 180GB Ram.
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
CREATE KEYSPACE test WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
CREATE TABLE testc21 (
key int,
d int,
v1 int,
v10 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
PRIMARY KEY (key)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='ROWS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX testc21_d_idx ON testc21 (d);
select * from testc21 limit 10;
key | d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9
--------+---+----+-----+----+----+-----+----+----+----+----+-----
302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54
531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27
693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19
4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14
927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100
313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7
368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52
671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93
62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63
488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45
Mike
Mike Carter
2014-07-04 14:14:37 UTC
Permalink
Does it mean, that Cassandra is not useful for any count queries on more
than one columns?
Post by Jeremy Jongsma
You'd be better off using external indexing (ElasticSearch or Solr),
Cassandra isn't really designed for this sort of querying.
Post by Mike Carter
Hello!
I'm a beginner in C* and I'm quite struggling with it.
I’d like to measure the performance of some Cassandra-Range-Queries. The
idea is to execute multidimensional range-queries on Cassandra. E.g. there
is a given table of 1million rows with 10 columns and I like to execute
some queries like “select count(*) from testable where d=1 and v1<10 and v2
20 and v3 <45 and v4>70 
 allow filtering”. This kind of queries is very
slow in C* and soon the tables are bigger, I get a read-timeout probably
caused by long scan operations.
In further tests I like to extend the dimensions to more than 200
hundreds and the rows to 100millions, but actually I can’t handle this
small table. Should reorganize the data or is it impossible to perform such
high multi-dimensional queries on Cassandra?
Cassandra is installed on a single node with 2 TB disk space and 180GB Ram.
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
CREATE KEYSPACE test WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
CREATE TABLE testc21 (
key int,
d int,
v1 int,
v10 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
PRIMARY KEY (key)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='ROWS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX testc21_d_idx ON testc21 (d);
select * from testc21 limit 10;
key | d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9
--------+---+----+-----+----+----+-----+----+----+----+----+-----
302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54
531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27
693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19
4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14
927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100
313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7
368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52
671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93
62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63
488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45
Mike
Jens Rantil
2014-07-04 20:07:32 UTC
Permalink
Hi Mike,


To learn get subsecond performance on your queries using _any_ database you need to use proper indexing. Like Jeremy said, Solr will do this.




If you'd like to try to solve this using Cassandra you need to learn the difference between partition and clustering in your primary key and understand you need a clustering to do any kind of range query.




Also, COUNTs in Cassandra are generally fairly slow.




Cheers,

Jens
—
Sent from Mailbox
Post by Mike Carter
Hello!
I'm a beginner in C* and I'm quite struggling with it.
I’d like to measure the performance of some Cassandra-Range-Queries. The
idea is to execute multidimensional range-queries on Cassandra. E.g. there
is a given table of 1million rows with 10 columns and I like to execute
some queries like “select count(*) from testable where d=1 and v1<10 and v2
20 and v3 <45 and v4>70 
 allow filtering”. This kind of queries is very
slow in C* and soon the tables are bigger, I get a read-timeout probably
caused by long scan operations.
In further tests I like to extend the dimensions to more than 200 hundreds
and the rows to 100millions, but actually I can’t handle this small table.
Should reorganize the data or is it impossible to perform such high
multi-dimensional queries on Cassandra?
Cassandra is installed on a single node with 2 TB disk space and 180GB Ram.
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
CREATE KEYSPACE test WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
CREATE TABLE testc21 (
key int,
d int,
v1 int,
v10 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
PRIMARY KEY (key)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='ROWS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX testc21_d_idx ON testc21 (d);
select * from testc21 limit 10;
key | d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9
--------+---+----+-----+----+----+-----+----+----+----+----+-----
302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54
531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27
693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19
4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14
927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100
313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7
368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52
671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93
62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63
488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45
Mike
Rameez Thonnakkal
2014-07-06 06:04:26 UTC
Permalink
Won't the performeance improve significantly if you increase the number of
nodes even in a commodity hardware profile.
Post by Jens Rantil
Hi Mike,
To learn get subsecond performance on your queries using _any_ database
you need to use proper indexing. Like Jeremy said, Solr will do this.
If you'd like to try to solve this using Cassandra you need to learn the
difference between partition and clustering in your primary key and
understand you need a clustering to do any kind of range query.
Also, COUNTs in Cassandra are generally fairly slow.
Cheers,
Jens
—
Sent from Mailbox <https://www.dropbox.com/mailbox>
Post by Mike Carter
Hello!
I'm a beginner in C* and I'm quite struggling with it.
I’d like to measure the performance of some Cassandra-Range-Queries. The
idea is to execute multidimensional range-queries on Cassandra. E.g. there
is a given table of 1million rows with 10 columns and I like to execute
some queries like “select count(*) from testable where d=1 and v1<10 and v2
20 and v3 <45 and v4>70 
 allow filtering”. This kind of queries is very
slow in C* and soon the tables are bigger, I get a read-timeout probably
caused by long scan operations.
In further tests I like to extend the dimensions to more than 200
hundreds and the rows to 100millions, but actually I can’t handle this
small table. Should reorganize the data or is it impossible to perform such
high multi-dimensional queries on Cassandra?
Cassandra is installed on a single node with 2 TB disk space and 180GB Ram.
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
CREATE KEYSPACE test WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
CREATE TABLE testc21 (
key int,
d int,
v1 int,
v10 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
PRIMARY KEY (key)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='ROWS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX testc21_d_idx ON testc21 (d);
select * from testc21 limit 10;
key | d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9
--------+---+----+-----+----+----+-----+----+----+----+----+-----
302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54
531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27
693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19
4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14
927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100
313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7
368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52
671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93
62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63
488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45
Mike
Jens Rantil
2014-07-06 08:33:16 UTC
Permalink
Ramirez,


If you partition your data correctly speed will be ~proportional. But there's always an upper limit - a slow range query that executes on a single node (using cluster key) will always be a slow.




Cheers,

Jens
—
Sent from Mailbox
Post by Rameez Thonnakkal
Won't the performeance improve significantly if you increase the number of
nodes even in a commodity hardware profile.
Post by Jens Rantil
Hi Mike,
To learn get subsecond performance on your queries using _any_ database
you need to use proper indexing. Like Jeremy said, Solr will do this.
If you'd like to try to solve this using Cassandra you need to learn the
difference between partition and clustering in your primary key and
understand you need a clustering to do any kind of range query.
Also, COUNTs in Cassandra are generally fairly slow.
Cheers,
Jens
—
Sent from Mailbox <https://www.dropbox.com/mailbox>
Post by Mike Carter
Hello!
I'm a beginner in C* and I'm quite struggling with it.
I’d like to measure the performance of some Cassandra-Range-Queries. The
idea is to execute multidimensional range-queries on Cassandra. E.g. there
is a given table of 1million rows with 10 columns and I like to execute
some queries like “select count(*) from testable where d=1 and v1<10 and v2
20 and v3 <45 and v4>70 
 allow filtering”. This kind of queries is very
slow in C* and soon the tables are bigger, I get a read-timeout probably
caused by long scan operations.
In further tests I like to extend the dimensions to more than 200
hundreds and the rows to 100millions, but actually I can’t handle this
small table. Should reorganize the data or is it impossible to perform such
high multi-dimensional queries on Cassandra?
Cassandra is installed on a single node with 2 TB disk space and 180GB Ram.
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
CREATE KEYSPACE test WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
CREATE TABLE testc21 (
key int,
d int,
v1 int,
v10 int,
v2 int,
v3 int,
v4 int,
v5 int,
v6 int,
v7 int,
v8 int,
v9 int,
PRIMARY KEY (key)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='ROWS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX testc21_d_idx ON testc21 (d);
select * from testc21 limit 10;
key | d | v1 | v10 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9
--------+---+----+-----+----+----+-----+----+----+----+----+-----
302602 | 1 | 56 | 55 | 26 | 45 | 67 | 75 | 25 | 50 | 26 | 54
531141 | 1 | 90 | 77 | 86 | 42 | 76 | 91 | 47 | 31 | 77 | 27
693077 | 1 | 67 | 71 | 14 | 59 | 100 | 90 | 11 | 15 | 6 | 19
4317 | 1 | 70 | 77 | 44 | 77 | 41 | 68 | 33 | 0 | 99 | 14
927961 | 1 | 15 | 97 | 95 | 80 | 35 | 36 | 45 | 8 | 11 | 100
313395 | 1 | 68 | 62 | 56 | 85 | 14 | 96 | 43 | 6 | 32 | 7
368168 | 1 | 3 | 63 | 55 | 32 | 18 | 95 | 67 | 78 | 83 | 52
671830 | 1 | 14 | 29 | 28 | 17 | 42 | 42 | 4 | 6 | 61 | 93
62693 | 1 | 26 | 48 | 15 | 22 | 73 | 94 | 86 | 4 | 66 | 63
488360 | 1 | 8 | 57 | 86 | 31 | 51 | 9 | 40 | 52 | 91 | 45
Mike
Loading...