Discussion:
Storing files in blob into Cassandra
Damien Picard
2011-06-22 07:07:42 UTC
Permalink
Hi,

I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if this
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve this ?

Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
Sasha Dolgy
2011-06-22 07:22:34 UTC
Permalink
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html

Of significance from that link (which was great until feeling lucky
was removed...):

Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

Yields:
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage


--- store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently. we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files. works a treat
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if this
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve this ?
Thank you
Sylvain Lebresne
2011-06-22 08:07:12 UTC
Permalink
Let's be more precise in saying that this all depends on the
expected size of the documents. If you know that the documents
will be on the few hundreds kilobytes mark on average and
no more than a few megabytes (say < 5MB, even though there is
no magic number), then storing them as blob will work perfectly
fine (which is not saying storing them externally with metadata in
Cassandra won't, but using blobs can be simpler in some cases).

I've very successfully stored tons of images as blobs in Cassandra.
I just knew they couldn't get super big because the system wasn't
allowing it.

The point with the size being that each time you will get a document,
Cassandra will have to load it (entirely) in memory to return it.

--
Sylvain
Post by Sasha Dolgy
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html
Of significance from that link (which was great until feeling lucky
Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
--- store your images / documents / etc. somewhere and reference them
in Cassandra.  That's the consensus that's been bandied about on this
list quite frequently.  we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files.  works a treat
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if this
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve this ?
Thank you
Damien Picard
2011-06-22 08:23:59 UTC
Permalink
Post by Sasha Dolgy
store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently
Thank you for your answers.

I think I have to detail my configuration. On every server of my cluster, I
deploy :
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta

In front of these, I use a Round-Robin DNS load balancer which balance
request on every httpd.
Every Tomcat instance can access every Cassandra node, allowing them to deal
with every request.
Data are stored with RandomPartitionner, replication factor is 2.

In my case, it would be very easy to store images in Cassandra because these
images will be accessible everywhere in my cluster. If I store images in
FileSystem, I have to replicate them manually (probably with a distributed
filesystem) on every server (quite complicated). This is why I prefer to
store files into Cassandra.

According to Sylvain, the main thing to know is the max size of a file. In
so far as I am on a web purpose, I can define this max file size to 10 Mb
(HTTP POST max size) without disapointing my users.Furthermore, most of
these files will not exceed 2 or 3 Mb. In such case, do you advise me to
store files in Cassandra ?

Thank you.
Post by Sasha Dolgy
Let's be more precise in saying that this all depends on the
expected size of the documents. If you know that the documents
will be on the few hundreds kilobytes mark on average and
no more than a few megabytes (say < 5MB, even though there is
no magic number), then storing them as blob will work perfectly
fine (which is not saying storing them externally with metadata in
Cassandra won't, but using blobs can be simpler in some cases).
I've very successfully stored tons of images as blobs in Cassandra.
I just knew they couldn't get super big because the system wasn't
allowing it.
The point with the size being that each time you will get a document,
Cassandra will have to load it (entirely) in memory to return it.
--
Sylvain
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html
Post by Sasha Dolgy
Of significance from that link (which was great until feeling lucky
Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
Post by Sasha Dolgy
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
--- store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently. we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files. works a treat
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if
this
Post by Sasha Dolgy
Post by Damien Picard
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve
this ?
Post by Sasha Dolgy
Post by Damien Picard
Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
aaron morton
2011-06-22 09:59:45 UTC
Permalink
Post by Damien Picard
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
You will have a bunch of services on the machine competing with each other for resources (cpu, memory and network IO). It's not an approach I would take.

You will also tightly couple the front end HTTP capacity to the DB capacity. e.g. consider what happens when a cassandra node is down for a while, what does this mean for your ability to accept http connections?

Requests from your web app may go to the local cassandra node, but thats just the coordinator. They will be forwarded onto the replicas that contain the data.
Post by Damien Picard
Data are stored with RandomPartitionner, replication factor is 2.
RF 3 is the minimum RF you need to use for QUORUM to be less than the RF.
Post by Damien Picard
In such case, do you advise me to store files in Cassandra ?
Depends on your scale, workload and performance requirements. I would do some tests about how much data you expect to hold and what sort of workloads you need to support. Personally I think files are best kept in a file system, until a compelling reason is found to do other wise.

Hope that helps.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Post by Damien Picard
Post by Sasha Dolgy
store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently
Thank you for your answers.
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
In front of these, I use a Round-Robin DNS load balancer which balance request on every httpd.
Every Tomcat instance can access every Cassandra node, allowing them to deal with every request.
Data are stored with RandomPartitionner, replication factor is 2.
In my case, it would be very easy to store images in Cassandra because these images will be accessible everywhere in my cluster. If I store images in FileSystem, I have to replicate them manually (probably with a distributed filesystem) on every server (quite complicated). This is why I prefer to store files into Cassandra.
According to Sylvain, the main thing to know is the max size of a file. In so far as I am on a web purpose, I can define this max file size to 10 Mb (HTTP POST max size) without disapointing my users.Furthermore, most of these files will not exceed 2 or 3 Mb. In such case, do you advise me to store files in Cassandra ?
Thank you.
Let's be more precise in saying that this all depends on the
expected size of the documents. If you know that the documents
will be on the few hundreds kilobytes mark on average and
no more than a few megabytes (say < 5MB, even though there is
no magic number), then storing them as blob will work perfectly
fine (which is not saying storing them externally with metadata in
Cassandra won't, but using blobs can be simpler in some cases).
I've very successfully stored tons of images as blobs in Cassandra.
I just knew they couldn't get super big because the system wasn't
allowing it.
The point with the size being that each time you will get a document,
Cassandra will have to load it (entirely) in memory to return it.
--
Sylvain
Post by Sasha Dolgy
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html
Of significance from that link (which was great until feeling lucky
Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
--- store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently. we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files. works a treat
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if this
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve this ?
Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
Damien Picard
2011-06-22 11:10:05 UTC
Permalink
Post by Damien Picard
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
You will have a bunch of services on the machine competing with each other
for resources (cpu, memory and network IO). It's not an approach I would
take.
You will also tightly couple the front end HTTP capacity to the DB
capacity. e.g. consider what happens when a cassandra node is down for a
while, what does this mean for your ability to accept http connections?
If the Cassandra JVM is down, Tomcat and Httpd will continue to handle
requests. And Pelops will redirect these requests to another Cassandra node
on another server (maybe am I wrong with this assertion).
Post by Damien Picard
Requests from your web app may go to the local cassandra node, but thats
just the coordinator. They will be forwarded onto the replicas that contain
the data.
Yes, but as you notice before, this node can be down, so I will configure
Pelops to redistribute requests on another node. So there is no strong
couple between Cassandra and Tomcat ; It will works as if they was on
different servers.
Post by Damien Picard
Data are stored with RandomPartitionner, replication factor is 2.
RF 3 is the minimum RF you need to use for QUORUM to be less than the RF.
Thank you for this advice ; I will reconsider the RF, but for this time, I
use only CL.ONE, not QUORUM. But it could change in a near future.
Post by Damien Picard
In such case, do you advise me to store files in Cassandra ?
Depends on your scale, workload and performance requirements. I would do
some tests about how much data you expect to hold and what sort of workloads
you need to support. Personally I think files are best kept in a file
system, until a compelling reason is found to do other wise.
Thank you, I think that distributing files in the cluster with something
like distributed file systems is a compelling reason to store files on
Cassandra. I don't want to add another complex component to my arch.
Post by Damien Picard
Hope that helps.
It does ! A lot ! Thank you.
Post by Damien Picard
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Post by Sasha Dolgy
store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently
Thank you for your answers.
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
In front of these, I use a Round-Robin DNS load balancer which balance
request on every httpd.
Every Tomcat instance can access every Cassandra node, allowing them to
deal with every request.
Data are stored with RandomPartitionner, replication factor is 2.
In my case, it would be very easy to store images in Cassandra because
these images will be accessible everywhere in my cluster. If I store images
in FileSystem, I have to replicate them manually (probably with a
distributed filesystem) on every server (quite complicated). This is why I
prefer to store files into Cassandra.
According to Sylvain, the main thing to know is the max size of a file. In
so far as I am on a web purpose, I can define this max file size to 10 Mb
(HTTP POST max size) without disapointing my users.Furthermore, most of
these files will not exceed 2 or 3 Mb. In such case, do you advise me to
store files in Cassandra ?
Thank you.
Post by Sasha Dolgy
Let's be more precise in saying that this all depends on the
expected size of the documents. If you know that the documents
will be on the few hundreds kilobytes mark on average and
no more than a few megabytes (say < 5MB, even though there is
no magic number), then storing them as blob will work perfectly
fine (which is not saying storing them externally with metadata in
Cassandra won't, but using blobs can be simpler in some cases).
I've very successfully stored tons of images as blobs in Cassandra.
I just knew they couldn't get super big because the system wasn't
allowing it.
The point with the size being that each time you will get a document,
Cassandra will have to load it (entirely) in memory to return it.
--
Sylvain
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html
Post by Sasha Dolgy
Of significance from that link (which was great until feeling lucky
Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
Post by Sasha Dolgy
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
--- store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently. we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files. works a treat
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if
this
Post by Sasha Dolgy
Post by Damien Picard
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve
this ?
Post by Sasha Dolgy
Post by Damien Picard
Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
aaron morton
2011-06-22 12:28:58 UTC
Permalink
If the Cassandra JVM is down, Tomcat and Httpd will continue to handle requests. And Pelops will redirect these requests to another Cassandra node on another server (maybe am I wrong with this assertion).
I was thinking of the server been turned off / broken / rebooting / disconnected from the network / taken out of rotation for maintenance. There are lots of reasons for a server to not be doing what it should be.


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Post by Damien Picard
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
You will have a bunch of services on the machine competing with each other for resources (cpu, memory and network IO). It's not an approach I would take.
You will also tightly couple the front end HTTP capacity to the DB capacity. e.g. consider what happens when a cassandra node is down for a while, what does this mean for your ability to accept http connections?
If the Cassandra JVM is down, Tomcat and Httpd will continue to handle requests. And Pelops will redirect these requests to another Cassandra node on another server (maybe am I wrong with this assertion).
Requests from your web app may go to the local cassandra node, but thats just the coordinator. They will be forwarded onto the replicas that contain the data.
Yes, but as you notice before, this node can be down, so I will configure Pelops to redistribute requests on another node. So there is no strong couple between Cassandra and Tomcat ; It will works as if they was on different servers.
Post by Damien Picard
Data are stored with RandomPartitionner, replication factor is 2.
RF 3 is the minimum RF you need to use for QUORUM to be less than the RF.
Thank you for this advice ; I will reconsider the RF, but for this time, I use only CL.ONE, not QUORUM. But it could change in a near future.
Post by Damien Picard
In such case, do you advise me to store files in Cassandra ?
Depends on your scale, workload and performance requirements. I would do some tests about how much data you expect to hold and what sort of workloads you need to support. Personally I think files are best kept in a file system, until a compelling reason is found to do other wise.
Thank you, I think that distributing files in the cluster with something like distributed file systems is a compelling reason to store files on Cassandra. I don't want to add another complex component to my arch.
Hope that helps.
It does ! A lot ! Thank you.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Post by Damien Picard
Post by Sasha Dolgy
store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently
Thank you for your answers.
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
In front of these, I use a Round-Robin DNS load balancer which balance request on every httpd.
Every Tomcat instance can access every Cassandra node, allowing them to deal with every request.
Data are stored with RandomPartitionner, replication factor is 2.
In my case, it would be very easy to store images in Cassandra because these images will be accessible everywhere in my cluster. If I store images in FileSystem, I have to replicate them manually (probably with a distributed filesystem) on every server (quite complicated). This is why I prefer to store files into Cassandra.
According to Sylvain, the main thing to know is the max size of a file. In so far as I am on a web purpose, I can define this max file size to 10 Mb (HTTP POST max size) without disapointing my users.Furthermore, most of these files will not exceed 2 or 3 Mb. In such case, do you advise me to store files in Cassandra ?
Thank you.
Let's be more precise in saying that this all depends on the
expected size of the documents. If you know that the documents
will be on the few hundreds kilobytes mark on average and
no more than a few megabytes (say < 5MB, even though there is
no magic number), then storing them as blob will work perfectly
fine (which is not saying storing them externally with metadata in
Cassandra won't, but using blobs can be simpler in some cases).
I've very successfully stored tons of images as blobs in Cassandra.
I just knew they couldn't get super big because the system wasn't
allowing it.
The point with the size being that each time you will get a document,
Cassandra will have to load it (entirely) in memory to return it.
--
Sylvain
Post by Sasha Dolgy
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html
Of significance from that link (which was great until feeling lucky
Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
--- store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently. we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files. works a treat
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if this
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve this ?
Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
Damien Picard
2011-06-22 12:43:23 UTC
Permalink
In this case, the load balancer has to detect (or is configured) that the
server is down and does not route request to this one anymore.
Post by Damien Picard
If the Cassandra JVM is down, Tomcat and Httpd will continue to handle
requests. And Pelops will redirect these requests to another Cassandra node
on another server (maybe am I wrong with this assertion).
I was thinking of the server been turned off / broken / rebooting /
disconnected from the network / taken out of rotation for maintenance. There
are lots of reasons for a server to not be doing what it should be.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Post by Damien Picard
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
You will have a bunch of services on the machine competing with each other
for resources (cpu, memory and network IO). It's not an approach I would
take.
You will also tightly couple the front end HTTP capacity to the DB
capacity. e.g. consider what happens when a cassandra node is down for a
while, what does this mean for your ability to accept http connections?
If the Cassandra JVM is down, Tomcat and Httpd will continue to handle
requests. And Pelops will redirect these requests to another Cassandra node
on another server (maybe am I wrong with this assertion).
Post by Damien Picard
Requests from your web app may go to the local cassandra node, but thats
just the coordinator. They will be forwarded onto the replicas that contain
the data.
Yes, but as you notice before, this node can be down, so I will configure
Pelops to redistribute requests on another node. So there is no strong
couple between Cassandra and Tomcat ; It will works as if they was on
different servers.
Post by Damien Picard
Data are stored with RandomPartitionner, replication factor is 2.
RF 3 is the minimum RF you need to use for QUORUM to be less than the RF.
Thank you for this advice ; I will reconsider the RF, but for this time, I
use only CL.ONE, not QUORUM. But it could change in a near future.
Post by Damien Picard
In such case, do you advise me to store files in Cassandra ?
Depends on your scale, workload and performance requirements. I would do
some tests about how much data you expect to hold and what sort of workloads
you need to support. Personally I think files are best kept in a file
system, until a compelling reason is found to do other wise.
Thank you, I think that distributing files in the cluster with something
like distributed file systems is a compelling reason to store files on
Cassandra. I don't want to add another complex component to my arch.
Post by Damien Picard
Hope that helps.
It does ! A lot ! Thank you.
Post by Damien Picard
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Post by Sasha Dolgy
store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently
Thank you for your answers.
- a Cassandra node
- a Tomcat instance
- the webapp, deployed on Tomcat
- Apache httpd, in front of Tomcat with mod_jakarta
In front of these, I use a Round-Robin DNS load balancer which balance
request on every httpd.
Every Tomcat instance can access every Cassandra node, allowing them to
deal with every request.
Data are stored with RandomPartitionner, replication factor is 2.
In my case, it would be very easy to store images in Cassandra because
these images will be accessible everywhere in my cluster. If I store images
in FileSystem, I have to replicate them manually (probably with a
distributed filesystem) on every server (quite complicated). This is why I
prefer to store files into Cassandra.
According to Sylvain, the main thing to know is the max size of a file. In
so far as I am on a web purpose, I can define this max file size to 10 Mb
(HTTP POST max size) without disapointing my users.Furthermore, most of
these files will not exceed 2 or 3 Mb. In such case, do you advise me to
store files in Cassandra ?
Thank you.
Post by Sasha Dolgy
Let's be more precise in saying that this all depends on the
expected size of the documents. If you know that the documents
will be on the few hundreds kilobytes mark on average and
no more than a few megabytes (say < 5MB, even though there is
no magic number), then storing them as blob will work perfectly
fine (which is not saying storing them externally with metadata in
Cassandra won't, but using blobs can be simpler in some cases).
I've very successfully stored tons of images as blobs in Cassandra.
I just knew they couldn't get super big because the system wasn't
allowing it.
The point with the size being that each time you will get a document,
Cassandra will have to load it (entirely) in memory to return it.
--
Sylvain
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-td6078278.html
Post by Sasha Dolgy
Of significance from that link (which was great until feeling lucky
Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
Post by Sasha Dolgy
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
--- store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently. we employ a solution that uses Amazon S3 for
storage and Cassandra as the reference to the meta data and location
of the files. works a treat
On Wed, Jun 22, 2011 at 9:07 AM, Damien Picard <
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if
this
Post by Sasha Dolgy
Post by Damien Picard
is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve
this ?
Post by Sasha Dolgy
Post by Damien Picard
Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
mcasandra
2011-06-22 17:11:28 UTC
Permalink
Speaking purely from my personal experience, I haven't found cassandra
optimal for storing big fat rows. Even if it is only 100s of KB I didn't
find cassandra suitable for it. In my case I am looking at 400 writes + 400
reads per sec and grow 20%-30% every ear with file sizes from 70k-300k. What
I found is that when you have simultaneous reads and writes going in
parallel that is inserting and reading big rows it kills the performance of
cassandra. Even if you add more nodes it doesn't scale at the level you
would expect it to. You would start to see "dropped" messaged all around.
With 8 node cluster, good disks (SAS) and following recommendations of
tunning cassandra performance I was only able to get 140 inserts and 80
reads per sec.

You can simply test it by using stress tool and you will see the difference
as you start to increase the column size. and you would see that performance
of small columns that starts with 1000s / sec gets dropped quickly as you
start to increase column size.

But if your traffic is low volume it might work ok. Also, if over period you
will have tons of Blobs you might find yourself in difficult situation. I
suggest doing some tests.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-files-in-blob-into-Cassandra-tp6503165p6505188.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.
Don Ledford
2011-06-23 03:40:40 UTC
Permalink
Post by Sasha Dolgy
store your images / documents / etc. somewhere and reference them
in Cassandra. That's the consensus that's been bandied about on this
list quite frequently
I store large files in Cassandra using columns for file blocks. I have
a simple blob class over the top of this which handles input and output
streaming so reads/writes are only one column at a time. I don't have
time to post the details, but it's pretty simple - each blob has it's
own key and it's blocks are the columns. Column names are essentially
the block number.

I prefer this since there's no need to deal with yet another disk store
and it's associated security, redundancy, and cost.

- Don
AJ
2011-06-23 04:29:32 UTC
Permalink
Post by Damien Picard
Hi,
I have to store some files (Images, documents, etc.) for my users in a
webapp. I use Cassandra for all of my data and I would like to know if
this is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve this ?
Thank you
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
I was thinking of doing the same thing. But, to compensate for the
bandwidth usage during the read, I was hoping to find a way for the
httpd or app server to cache the file either in RAM or on disk so
subsequent reads could just reference the in-mem cache or local hdd. I
have big data requirements, so duplicating the storage of file blobs by
adding them to the hdd would almost double my storage requirements. So,
the hdd cache would have to be limited with the LRU removed periodically.

I was thinking about making the key for each file be a relative file
path as if it were on disk. This same path could also be used as it's
actual location on disk in the local disk cache. Using a path as the
key makes it flexible in many ways if I ever change my mind and want to
store all files on disk, or when backing-up or archiving, etc..

But, I'm rusty on my apache http knowledge but I also thought there was
an apache cache mod that would use both ram and disk depending on the
frequency of use. But, I don't know if you can tell it to "cache this
blob like it's a file".

Just some thoughts.
Sasha Dolgy
2011-06-23 05:43:42 UTC
Permalink
maybe you want to spend a few minutes reading about Haystack over at
facebook to give you some ideas...

https://www.facebook.com/note.php?note_id=76191543919

Not saying what they've done is the right way... just sayin'
I was thinking of doing the same thing.  But, to compensate for the
bandwidth usage during the read, I was hoping to find a way for the httpd or
app server to cache the file either in RAM or on disk so subsequent reads
could just reference the in-mem cache or local hdd.  I have big data
requirements, so duplicating the storage of file blobs by adding them to the
hdd would almost double my storage requirements.  So, the hdd cache would
have to be limited with the LRU removed periodically.
I was thinking about making the key for each file be a relative file path as
if it were on disk.  This same path could also be used as it's actual
location on disk in the local disk cache.  Using a path as the key makes it
flexible in many ways if I ever change my mind and want to store all files
on disk, or when backing-up or archiving, etc..
But, I'm rusty on my apache http knowledge but I also thought there was an
apache cache mod that would use both ram and disk depending on the frequency
of use.  But, I don't know if you can tell it to "cache this blob like it's
a file".
Just some thoughts.
Damien Picard
2011-06-23 07:36:50 UTC
Permalink
I have a simple blob class over the top of this which handles input and
output streaming so reads/writes are only one column at a >time

Thank you for the tips. I think I will do the same ; for this time, I've
developped a simple version which store the entire file in one column, but
I've already observe that it is a performance killer.
According to you, the idea is to write a "CassandraInputStream" and
"CassandraOutputStream" that store files in multiple columns on one row.
Could you tell me the size you put on a single column ? Have you benchmark
this to determine an ideal column size ?

Thank you.
maybe you want to spend a few minutes reading about Haystack over at
facebook to give you some ideas...
https://www.facebook.com/note.php?note_id=76191543919
Not saying what they've done is the right way... just sayin'
Post by AJ
I was thinking of doing the same thing. But, to compensate for the
bandwidth usage during the read, I was hoping to find a way for the httpd
or
Post by AJ
app server to cache the file either in RAM or on disk so subsequent reads
could just reference the in-mem cache or local hdd. I have big data
requirements, so duplicating the storage of file blobs by adding them to
the
Post by AJ
hdd would almost double my storage requirements. So, the hdd cache would
have to be limited with the LRU removed periodically.
I was thinking about making the key for each file be a relative file path
as
Post by AJ
if it were on disk. This same path could also be used as it's actual
location on disk in the local disk cache. Using a path as the key makes
it
Post by AJ
flexible in many ways if I ever change my mind and want to store all
files
Post by AJ
on disk, or when backing-up or archiving, etc..
But, I'm rusty on my apache http knowledge but I also thought there was
an
Post by AJ
apache cache mod that would use both ram and disk depending on the
frequency
Post by AJ
of use. But, I don't know if you can tell it to "cache this blob like
it's
Post by AJ
a file".
Just some thoughts.
--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html
Don Ledford
2011-06-23 17:53:08 UTC
Permalink
Exactly right - I wrote BlobInputStream and BlobOutputStream classes to
go with a Blob class.

I use 1MB for the block size, but I haven't done any performance
testing. I went small to favor low memory and bandwidth foot prints.
I'd, of course, be very interested in any performance tests.
I have a simple blob class over the top of this which handles input
and output streaming so reads/writes are only one column at a >time
Thank you for the tips. I think I will do the same ; for this time, I've
developped a simple version which store the entire file in one column,
but I've already observe that it is a performance killer.
According to you, the idea is to write a "CassandraInputStream" and
"CassandraOutputStream" that store files in multiple columns on one row.
Could you tell me the size you put on a single column ? Have you
benchmark this to determine an ideal column size ?
AJ
2011-06-23 23:34:10 UTC
Permalink
Post by Sasha Dolgy
maybe you want to spend a few minutes reading about Haystack over at
facebook to give you some ideas...
https://www.facebook.com/note.php?note_id=76191543919
Not saying what they've done is the right way... just sayin'
Thanks for the tip Sasha; will do.

Loading...