Discussion:
Need database to log and retrieve sensor data
Heiner Bunjes
2012-02-06 13:21:19 UTC
Permalink
I need a database to log and retrieve sensor data.

Is cassandra the right solution for this task and if, how should I
set it up and which access methods should I use?
If not, which other DB system might be a better fit?


The details are as follows:

######## <requirements version="4">

Glossary

- Node = A computer on which an instance of the database
is running

- Blip = one data record send by a sensor

- Blip page = The sorted list of all blips for a specific sensor
and a specific time range.


The scale is as follows:

(01) 10E6 sensors deliver 1 blip every 100 seconds
-> Insert rate = 10 kiloblip/s
-> Insert rate ~ 315 gigablip/Year

(02) They have to be stored for ~3 years
-> Size of database = 1 terablip

(03) Each blip has about 200 bytes
-> Size of database = 200TB

(04) The system will start with just 10E4 sensors but will
soon increase upto the described volume.


The main operations on the data are:

(05) Add the new blips to the database
(written blips are never changed)!

(06) Return all blips for sensor X with a timestamp
between timestamp_a and timestamp_b!
With other words: Return a blip page.

(07) Return all the blips specified in (06) ordered
by timestamp!

(08) Delete all blips older than Y!


Further the following is true:

(09) Each added blip is clearly (without ambiguity) identifiable by
sensor_id+timestamp.

(10) 99.9% of the blips are inserted in
chronological order, the rest is not.

(11) The database system MUST be free and open source.

(12) The DB SHOULD be easy to administrate.

(13) All data MUST still be writable and readable while less
then the configurable number N of nodes are down (unexpectedly).

(14) The mechanisms to distribute the data to the available
nodes SHOULD be handled by the database.
This means that the database SHOULD automatically
redistribute the data when nodes are added or removed.

(15) The project is mainly implemented in erlang, so there must be
a stable erlang interface for database access.

######## </requirements>


Many thanks in advance
Heiner
R. Verlangen
2012-02-06 13:58:17 UTC
Permalink
As far as I'm familiar with Cassandra, I gave my opinion for every
requirement on your list:

1) 10k inserts / seconds should be no problem at all for Cassandra
2) Cassandra should scale to that
3) As the homepage of Cassandra states that amount of data should be able
to fit (source: http://cassandra.apache.org/ )
4) Not Cassandra related

5) Inserts are very fast in Cassandra
6) You could create row keys in cassandra that hold the values as columns,
within a timespan (e.g. per second / minute). Please not that "The maximum
of column per row is 2 billion" (source:
http://wiki.apache.org/cassandra/CassandraLimitations )
7) The most common ordering for Cassandra is random. Hower you could create
some kind of index ColumnFamily (CF) with as columns the row keys of your
actual Data CF. Columns are sorted by default.
8) Cassandra provides a time-to-live (TTL) mechanism: this suits perfect
for your needs

9) The column key could be something like "SENSORID~TIMESTAMP", e.g.
"US123~1328539905"
10) Cassandra will take care of the column sorting
11) Cassandra is released under the Apache 2.0 license: so it's open source
12) Opscenter from DataStax is a really nice tool with some GUI: for
enterprise usage there's a subscription required
13) The high-availability that Cassandra provides will meet your
requirements
14) Your contact-node will find out which nodes are responsible for your
write/read. Adding, removing or moving nodes is also possible.
15) I have no experience with that, but I'm pretty shure there's someone
around here who can help you.

Good luck with finding the best database for your problem.
Post by Heiner Bunjes
I need a database to log and retrieve sensor data.
Is cassandra the right solution for this task and if, how should I
set it up and which access methods should I use?
If not, which other DB system might be a better fit?
######## <requirements version="4">
Glossary
- Node = A computer on which an instance of the database
is running
- Blip = one data record send by a sensor
- Blip page = The sorted list of all blips for a specific sensor
and a specific time range.
(01) 10E6 sensors deliver 1 blip every 100 seconds
-> Insert rate = 10 kiloblip/s
-> Insert rate ~ 315 gigablip/Year
(02) They have to be stored for ~3 years
-> Size of database = 1 terablip
(03) Each blip has about 200 bytes
-> Size of database = 200TB
(04) The system will start with just 10E4 sensors but will
soon increase upto the described volume.
(05) Add the new blips to the database
(written blips are never changed)!
(06) Return all blips for sensor X with a timestamp
between timestamp_a and timestamp_b!
With other words: Return a blip page.
(07) Return all the blips specified in (06) ordered
by timestamp!
(08) Delete all blips older than Y!
(09) Each added blip is clearly (without ambiguity) identifiable by
sensor_id+timestamp.
(10) 99.9% of the blips are inserted in
chronological order, the rest is not.
(11) The database system MUST be free and open source.
(12) The DB SHOULD be easy to administrate.
(13) All data MUST still be writable and readable while less
then the configurable number N of nodes are down (unexpectedly).
(14) The mechanisms to distribute the data to the available
nodes SHOULD be handled by the database.
This means that the database SHOULD automatically
redistribute the data when nodes are added or removed.
(15) The project is mainly implemented in erlang, so there must be
a stable erlang interface for database access.
######## </requirements>
Many thanks in advance
Heiner
Continue reading on narkive:
Loading...