What does ReadRepair exactly do?

Discussion:

What does ReadRepair exactly do?

Markus Klems

2012-10-18 17:33:56 UTC

Hi guys,

I am looking through the Cassandra source code in the github trunk to
better understand how Cassandra's fault-tolerance mechanisms work. Most
things make sense. I am also aware of the wiki and DataStax documentation.
However, I do not understand what read repair does in detail. The method
RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to
do the trick of merging conflicting versions of column family replicas and
builds the set of columns that need to be "repaired". From looking at the
source code, I do not understand how this set is built and I do not
understand how the reconciliation is executed. ReadRepair does not seem to
trigger a Column.reconcile() to reconcile conflicting column versions on
different servers. Does it?

If this is not what read repair does, then: What kind of inconsistencies
are resolved by read repair? And: How are the inconsistencies resolved?

Could someone give me a hint?

Thanks so much,

-Markus

aaron morton

2012-10-21 22:49:40 UTC

There are two processes in cassandra that trigger Read Repair like behaviour.

During a DigestMismatchException is raised if the responses from the replicas do not match. In this case another read is run that involves reading all the data. This is the CL level agreement kicking in.

The other "Read Repair" is the one controlled by the "read_repair_chance". When RR is active on a request ALL up replicas are involved in the read. When RR is not active only CL replicas are involved. When test for CL agreement occurs synchronously to the request; the RR check waits asynchronously to the request for all nodes in the request to return. It then checks for consistency and repairs differences.

From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed.

When a DigestMismatch is detected a read is run using RepairCallback. The callback will call the RowRepairResolver.resolve() when enough responses have been collected.

resolveSuperset() picks one response to the baseline, and then calls delete() to apply row level deletes from the other responses (ColumnFamily's). It collects the other CF's into an iterator with a filter that returns all columns. The columns are then applied to the baseline CF which may result in reconcile() being called.

reconcile() is used when a AbstractColumnContainer has two versions of a column and it wants to only have one.

RowRepairResolve.scheduleRepairs() works out the delta for each node by calling ColumnFamily.diff(). The delta is then sent to the appropriate node.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Hi guys,
I am looking through the Cassandra source code in the github trunk to better understand how Cassandra's fault-tolerance mechanisms work. Most things make sense. I am also aware of the wiki and DataStax documentation. However, I do not understand what read repair does in detail. The method RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to do the trick of merging conflicting versions of column family replicas and builds the set of columns that need to be "repaired". From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. ReadRepair does not seem to trigger a Column.reconcile() to reconcile conflicting column versions on different servers. Does it?
If this is not what read repair does, then: What kind of inconsistencies are resolved by read repair? And: How are the inconsistencies resolved?
Could someone give me a hint?
Thanks so much,
-Markus

Manu Zhang

2012-10-22 15:45:33 UTC

Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and
then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what
happens hereafter? How is reconcile exactly been called?

Post by aaron morton
There are two processes in cassandra that trigger Read Repair like behaviour.
During a DigestMismatchException is raised if the responses from the
replicas do not match. In this case another read is run that involves
reading all the data. This is the CL level agreement kicking in.
The other "Read Repair" is the one controlled by the "read_repair_chance".
When RR is active on a request ALL up replicas are involved in the read.
When RR is not active only CL replicas are involved. When test for CL
agreement occurs synchronously to the request; the RR check
waits asynchronously to the request for all nodes in the request to return.
It then checks for consistency and repairs differences.
From looking at the source code, I do not understand how this set is built
and I do not understand how the reconciliation is executed.
When a DigestMismatch is detected a read is run using RepairCallback. The
callback will call the RowRepairResolver.resolve() when enough responses
have been collected.
resolveSuperset() picks one response to the baseline, and then calls
delete() to apply row level deletes from the other responses
(ColumnFamily's). It collects the other CF's into an iterator with a filter
that returns all columns. The columns are then applied to the baseline CF
which may result in reconcile() being called.
reconcile() is used when a AbstractColumnContainer has two versions of a
column and it wants to only have one.
RowRepairResolve.scheduleRepairs() works out the delta for each node by
calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
Hope that helps.
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
Hi guys,
I am looking through the Cassandra source code in the github trunk to
better understand how Cassandra's fault-tolerance mechanisms work. Most
things make sense. I am also aware of the wiki and DataStax documentation.
However, I do not understand what read repair does in detail. The method
RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to
do the trick of merging conflicting versions of column family replicas and
builds the set of columns that need to be "repaired". From looking at the
source code, I do not understand how this set is built and I do not
understand how the reconciliation is executed. ReadRepair does not seem to
trigger a Column.reconcile() to reconcile conflicting column versions on
different servers. Does it?
If this is not what read repair does, then: What kind of inconsistencies
are resolved by read repair? And: How are the inconsistencies resolved?
Could someone give me a hint?
Thanks so much,
-Markus

aaron morton

2012-10-23 07:17:57 UTC

Yes, all this starts because of the call to filter.collateColumns()

The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer , the methods to add columns on that interface pass through to an implementation of ISortedColumns.

The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will call reconcile() on the IColumn if they need to.

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what happens hereafter? How is reconcile exactly been called?
There are two processes in cassandra that trigger Read Repair like behaviour.
During a DigestMismatchException is raised if the responses from the replicas do not match. In this case another read is run that involves reading all the data. This is the CL level agreement kicking in.
The other "Read Repair" is the one controlled by the "read_repair_chance". When RR is active on a request ALL up replicas are involved in the read. When RR is not active only CL replicas are involved. When test for CL agreement occurs synchronously to the request; the RR check waits asynchronously to the request for all nodes in the request to return. It then checks for consistency and repairs differences.

From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed.

When a DigestMismatch is detected a read is run using RepairCallback. The callback will call the RowRepairResolver.resolve() when enough responses have been collected.
resolveSuperset() picks one response to the baseline, and then calls delete() to apply row level deletes from the other responses (ColumnFamily's). It collects the other CF's into an iterator with a filter that returns all columns. The columns are then applied to the baseline CF which may result in reconcile() being called.
reconcile() is used when a AbstractColumnContainer has two versions of a column and it wants to only have one.
RowRepairResolve.scheduleRepairs() works out the delta for each node by calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
Hope that helps.
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Hi guys,
I am looking through the Cassandra source code in the github trunk to better understand how Cassandra's fault-tolerance mechanisms work. Most things make sense. I am also aware of the wiki and DataStax documentation. However, I do not understand what read repair does in detail. The method RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to do the trick of merging conflicting versions of column family replicas and builds the set of columns that need to be "repaired". From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. ReadRepair does not seem to trigger a Column.reconcile() to reconcile conflicting column versions on different servers. Does it?
If this is not what read repair does, then: What kind of inconsistencies are resolved by read repair? And: How are the inconsistencies resolved?
Could someone give me a hint?
Thanks so much,
-Markus

Shankaranarayanan P N

2012-10-23 21:22:22 UTC

Hello,

This conversation precisely targets a question that I had been having for a
while - would be grateful if you someone cloud clarify it a little further:

Considering the case of a "repair" created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?

1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by
virtue of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date.
4. ReadRepair is scheduled to the above replicas.
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again.

Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the
read correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.

Thanks,
Shankar

Post by aaron morton
Yes, all this starts because of the call to filter.collateColumns()
The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer ,
the methods to add columns on that interface pass through to an
implementation of ISortedColumns.
The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will
call reconcile() on the IColumn if they need to.
Cheers
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE)
and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what
happens hereafter? How is reconcile exactly been called?

Post by aaron morton
There are two processes in cassandra that trigger Read Repair like behaviour.
During a DigestMismatchException is raised if the responses from the
replicas do not match. In this case another read is run that involves
reading all the data. This is the CL level agreement kicking in.
The other "Read Repair" is the one controlled by the
"read_repair_chance". When RR is active on a request ALL up replicas are
involved in the read. When RR is not active only CL replicas are involved.
When test for CL agreement occurs synchronously to the request; the RR
check waits asynchronously to the request for all nodes in the request to
return. It then checks for consistency and repairs differences.
From looking at the source code, I do not understand how this set is
built and I do not understand how the reconciliation is executed.
When a DigestMismatch is detected a read is run using RepairCallback. The
callback will call the RowRepairResolver.resolve() when enough responses
have been collected.
resolveSuperset() picks one response to the baseline, and then calls
delete() to apply row level deletes from the other responses
(ColumnFamily's). It collects the other CF's into an iterator with a filter
that returns all columns. The columns are then applied to the baseline CF
which may result in reconcile() being called.
reconcile() is used when a AbstractColumnContainer has two versions of a
column and it wants to only have one.
RowRepairResolve.scheduleRepairs() works out the delta for each node by
calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
Hope that helps.
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
Hi guys,
I am looking through the Cassandra source code in the github trunk to
better understand how Cassandra's fault-tolerance mechanisms work. Most
things make sense. I am also aware of the wiki and DataStax documentation.
However, I do not understand what read repair does in detail. The method
RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to
do the trick of merging conflicting versions of column family replicas and
builds the set of columns that need to be "repaired". From looking at the
source code, I do not understand how this set is built and I do not
understand how the reconciliation is executed. ReadRepair does not seem to
trigger a Column.reconcile() to reconcile conflicting column versions on
different servers. Does it?
If this is not what read repair does, then: What kind of inconsistencies
are resolved by read repair? And: How are the inconsistencies resolved?
Could someone give me a hint?
Thanks so much,
-Markus

shankarpnsn

2012-10-23 22:10:38 UTC

Hello,

This conversation precisely targets a question that I had been having for a
while - would be grateful if you someone cloud clarify it a little further:

Considering the case of a "repair" created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?

1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by virtue
of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date.
4. ReadRepair is scheduled to the above replicas.
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again.

Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the read
correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.

Thanks,
Shankar

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.

Manu Zhang

2012-10-24 00:49:07 UTC

why repair again? We block until the consistency constraint is met. Then
the latest version is returned and repair is done asynchronously if any
mismatch. We may retry read if fewer columns than required are returned.

Post by Shankaranarayanan P N
Hello,
This conversation precisely targets a question that I had been having for a
Considering the case of a "repair" created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?
1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by virtue
of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date.
4. ReadRepair is scheduled to the above replicas.
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again.
Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the read
correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.
Thanks,
Shankar
--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html

shankarpnsn

2012-10-24 01:04:59 UTC

Post by Manu Zhang
why repair again? We block until the consistency constraint is met. Then
the latest version is returned and repair is done asynchronously if any
mismatch. We may retry read if fewer columns than required are returned.

Just to make sure I understand you correct, considering the case when a read
repair is in flight and a subsequent write affects one or more of the
replicas that was scheduled to received the repair mutations. In this case,
are you saying that we return the older version to the user rather than the
latest version that was effected by the write ?

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.

Manu Zhang

2012-10-24 01:08:50 UTC

I think so. Otherwise, we may never complete a read if writes come in
continuously.

Post by shankarpnsn

Post by Manu Zhang
why repair again? We block until the consistency constraint is met. Then
the latest version is returned and repair is done asynchronously if any
mismatch. We may retry read if fewer columns than required are returned.

Just to make sure I understand you correct, considering the case when a read
repair is in flight and a subsequent write affects one or more of the
replicas that was scheduled to received the repair mutations. In this case,
are you saying that we return the older version to the user rather than the
latest version that was effected by the write ?
--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html

Hiller, Dean

2012-10-24 11:51:51 UTC

Keep in mind, returning the older version is usually fine. Just imagine
if your user clicked write 1 ms before, then the new version might be
returned. If he gets the older version and refreshes the page, he gets
the newer version. Same with an automated program as wellŠ.in general it
is okay to get the older or newer value. If you are reading 2 rows
however instead of one, that may change.

Dean

Post by shankarpnsn

Post by Manu Zhang
why repair again? We block until the consistency constraint is met. Then
the latest version is returned and repair is done asynchronously if any
mismatch. We may retry read if fewer columns than required are returned.

Just to make sure I understand you correct, considering the case when a read
repair is in flight and a subsequent write affects one or more of the
replicas that was scheduled to received the repair mutations. In this case,
are you saying that we return the older version to the user rather than the
latest version that was effected by the write ?
--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583355.html

shankarpnsn

2012-10-24 14:02:13 UTC

in general it is okay to get the older or newer value. If you are reading
2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
<R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583366.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.

Hiller, Dean

2012-10-24 14:20:06 UTC

The user will meet the required consistency unless you encounter some kind
of bug in cassandra. You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved. If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak.

In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value. If it read from R2 just after the write
was applied, it gets the new value. BOTH of these met the consistency
constraint. A better example to clear this up may be the following... If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right? And it met the consistency level, right? NOW, what
about if the write is 1ms later? What if it the right is .00001ms later?
It still met the consistency level, right? If it is .00001ms before, you
get the new value as it repairs first with the new node.

It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).

I hope that clears it up

Later,
Dean

Post by shankarpnsn

in general it is okay to get the older or newer value. If you are reading
2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
<R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here.
--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583366.html

Manu Zhang

2012-10-24 14:26:23 UTC

And we don't send read request to all of the three replicas (R1, R2, R3) if
CL=QUOROM; just 2 of them depending on proximity

Post by Hiller, Dean
The user will meet the required consistency unless you encounter some kind
of bug in cassandra. You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved. If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak.
In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value. If it read from R2 just after the write
was applied, it gets the new value. BOTH of these met the consistency
constraint. A better example to clear this up may be the following... If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right? And it met the consistency level, right? NOW, what
about if the write is 1ms later? What if it the right is .00001ms later?
It still met the consistency level, right? If it is .00001ms before, you
get the new value as it repairs first with the new node.
It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).
I hope that clears it up
Later,
Dean

Post by shankarpnsn

in general it is okay to get the older or newer value. If you are reading
2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
<R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here.
--

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does

Post by shankarpnsn
-ReadRepair-exactly-do-tp7583261p7583366.html

Hiller, Dean

2012-10-24 14:31:21 UTC

I guess one more thing is I completely ignore your second write mainly because I assume it comes after we already read so your let's say you current state is

node1 = val1 node2 = val1 node3 = val1

You do a write quorom of val=2 which is IN the middle!!!

node1 = val1 node2 = val2 node3 = val1 (NOTICE the write is not complete yet)

If you read from node1 and node3, you get val1. If you read from node1 and node2, you get val2 as a read repair will happen.

Ie. You always get the older value or newer value.

If you have two writes come in like so

node1 = val1 node2 = val2 and node3= val3

Well, I think you can figure it out when you do a read ;). If your read quorum reads from node1 and node3 , you get val3, etc. etc.

This is basically how it works….If your scenario is a web page, a user simply hits the refresh button and sees the values changing.

Later,
Dean

From: Manu Zhang <***@gmail.com<mailto:***@gmail.com>>
Reply-To: "***@cassandra.apache.org<mailto:***@cassandra.apache.org>" <***@cassandra.apache.org<mailto:***@cassandra.apache.org>>
Date: Wednesday, October 24, 2012 8:26 AM
To: "***@cassandra.apache.org<mailto:***@cassandra.apache.org>" <***@cassandra.apache.org<mailto:***@cassandra.apache.org>>
Subject: Re: What does ReadRepair exactly do?

And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity

On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean <***@nrel.gov<mailto:***@nrel.gov>> wrote:
The user will meet the required consistency unless you encounter some kind
of bug in cassandra. You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved. If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak.

In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value. If it read from R2 just after the write
was applied, it gets the new value. BOTH of these met the consistency
constraint. A better example to clear this up may be the following... If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right? And it met the consistency level, right? NOW, what
about if the write is 1ms later? What if it the right is .00001ms later?
It still met the consistency level, right? If it is .00001ms before, you
get the new value as it repairs first with the new node.

It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).

I hope that clears it up

Later,
Dean

Post by shankarpnsn

in general it is okay to get the older or newer value. If you are reading
2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
<R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here.
--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583366.html
Nabble.com.

shankarpnsn

2012-10-24 15:28:32 UTC

Post by Hiller, Dean
I guess one more thing is I completely ignore your second write mainly
because I assume it comes after we already read so your let's say you
current state is
node1 = val1 node2 = val1 node3 = val1
You do a write quorom of val=2 which is IN the middle!!!
node1 = val1 node2 = val2 node3 = val1 (NOTICE the write is not complete yet)
If you read from node1 and node3, you get val1. If you read from node1
and node2, you get val2 as a read repair will happen.
Ie. You always get the older value or newer value.
If you have two writes come in like so
node1 = val1 node2 = val2 and node3= val3
Well, I think you can figure it out when you do a read ;). If your read
quorum reads from node1 and node3 , you get val3, etc. etc.
This is basically how it works….If your scenario is a web page, a user
simply hits the refresh button and sees the values changing. I'm extending
your example
Later,
Dean

Thanks for the example Dean. This definitely clears things up when you have
an overlap between the read and the write, and one comes after the other.
I'm still missing, how read repairs behave. Just extending your example for
the following case:

1. node1 = val1 node2 = val1 node3 = val1

2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)

3. Now with a read (R1) from node1 and node2, a read repair will be
initiated that needs to write val2 on node 1.
node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not complete
yet)

4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
now arrives at node 1 but sees a newer value val4.
node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, read
repair val2 not complete)

In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

Post by Hiller, Dean
And we don't send read request to all of the three replicas (R1, R2, R3)
if CL=QUOROM; just 2 of them depending on proximity

Thanks Zhang. But, this again seems a little strange thing to do, since one
(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
read failure while there are still enough number of replicas (R1 and R3)
live to satisfy a read.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583372.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.

Hiller, Dean

2012-10-24 15:39:35 UTC

Post by shankarpnsn
Thanks Zhang. But, this again seems a little strange thing to do, since one
(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
read failure while there are still enough number of replicas (R1 and R3)
live to satisfy a read.

He means in the case where all 3 nodes are liveŠ.if a node is down,
naturally it redirects to the other node and still succeeds because it
found 2 nodes even with one node down(feel free to test this live though
!!!!!)

Post by shankarpnsn
Thanks for the example Dean. This definitely clears things up when you have
an overlap between the read and the write, and one comes after the other.
I'm still missing, how read repairs behave. Just extending your example for
1. node1 = val1 node2 = val1 node3 = val1
2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)
3. Now with a read (R1) from node1 and node2, a read repair will be
initiated that needs to write val2 on node 1.
node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not complete
yet)
4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
now arrives at node 1 but sees a newer value val4.
node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, read
repair val2 not complete)
In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

At this point as Manu suggests, you need to look at the code but most
likely what happens is they lock that row, receive the write in memory(ie.
Not losing it) and return to client, caching it so as soon as read-repair
is over, it will write that next value. Ie. Your client would receive
val2 and val4 would be the value in the database right after you received
val2. Ie. When a client interacts with cassandra and you have tons of
writes to a row, val1, val2, val3, val4 in a short time period, just like
a normal database, your client may get one of those 4 values depending on
here the read gets inserted in the order of the writesŠsame as a normal
RDBMS. The only thing you don't have is the atomic nature with other rows.

NOTICE: they would not have to cache val4 very long, and if a newer write
came in, they would just replace it with that newer val and cache that one
instead so it would not be a queueŠbut this is all just a guessŠread the
code if you really want to know.

Post by shankarpnsn

Post by Manu Zhang
And we don't send read request to all of the three replicas (R1, R2, R3)
if CL=QUOROM; just 2 of them depending on proximity

--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583372.html

aaron morton

2012-10-25 08:29:41 UTC

It's import to point out the difference between Read Repair, in the context of the read_repair_chance setting, and Consistent Reads in the context of the CL setting.

If RR is active on a request it means the request is sent to ALL UP nodes for the key and the RR process is ASYNC to the request. If all of the nodes involved in the request return to the coordinator before rpc_timeout ReadCallback.maybeResolveForRepair() will put a repair task into the READ_REPAIR stage. This will compare the values and IF there is a DigestMismatch it will start a Row Repair read that reads the data from all nodes and MAY result in differences being detected and fixed.

All of this is outside of the processing of your read request. It is separate from the stuff below.

Inside the user read request when ReadCallback.get() is called and CL nodes have responded the responses are compared. If a DigestMismatch happens then a Row Repair read is started, the result of this read is returned to the user. This Row Repair read MAY detect differences, if it does it resolves the super set, sends the delta to the replicas and returns the super set value to be returned to the client.

Post by shankarpnsn
I'm still missing, how read repairs behave. Just extending your example for

The example does not use Read Repair, it is handled by Consistent Reads.

The purpose of RR is to reduce the probability that a read in the future using any of the replicas will result in a Digest Mismatch. "Any of the replicas" means ones that were not necessary for this specific read request.

Post by shankarpnsn
2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)

If the write has not completed then it is not a successful write at the specified CL as it could fail now.

Therefor the R +W > N Strong Consistency guarantee does not apply at this exact point in time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1. Again the operation W1 has not completed, if read R' starts and completes while W1 is processing it may or may not return the result of W1.

Post by shankarpnsn
In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

If val4 is in the memtable on node before the second read the result will be val4.
Writes that happen between the initial read and the second read after a Digest Mismatch are included in the read result.

The way I think about consistency is "what value do reads see if writes stop":

* If you have R + W > N, so all writes succeeded at CL QUOURM, all successful reads are guaranteed to see the last write.
* If you are using a low CL and/or had a failed writes at QUOURM then R + W < N. All successful reads will *eventually* see the last value written, and they are guaranteed to return the value of a previous write or no value. Eventually background Read Repair, Hinted Handoff or nodetool repair will repair the inconsistency.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Post by shankarpnsn

Post by shankarpnsn
Thanks Zhang. But, this again seems a little strange thing to do, since one
(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
read failure while there are still enough number of replicas (R1 and R3)
live to satisfy a read.

He means in the case where all 3 nodes are live.if a node is down,
naturally it redirects to the other node and still succeeds because it
found 2 nodes even with one node down(feel free to test this live though
!!!!!)

Post by shankarpnsn
Thanks for the example Dean. This definitely clears things up when you have
an overlap between the read and the write, and one comes after the other.
I'm still missing, how read repairs behave. Just extending your example for
1. node1 = val1 node2 = val1 node3 = val1
2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)
3. Now with a read (R1) from node1 and node2, a read repair will be
initiated that needs to write val2 on node 1.
node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not complete
yet)
4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
now arrives at node 1 but sees a newer value val4.
node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, read
repair val2 not complete)
In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

At this point as Manu suggests, you need to look at the code but most
likely what happens is they lock that row, receive the write in memory(ie.
Not losing it) and return to client, caching it so as soon as read-repair
is over, it will write that next value. Ie. Your client would receive
val2 and val4 would be the value in the database right after you received
val2. Ie. When a client interacts with cassandra and you have tons of
writes to a row, val1, val2, val3, val4 in a short time period, just like
a normal database, your client may get one of those 4 values depending on
here the read gets inserted in the order of the writessame as a normal
RDBMS. The only thing you don't have is the atomic nature with other rows.
NOTICE: they would not have to cache val4 very long, and if a newer write
came in, they would just replace it with that newer val and cache that one
instead so it would not be a queuebut this is all just a guessread the
code if you really want to know.

Post by shankarpnsn

Post by Manu Zhang
And we don't send read request to all of the three replicas (R1, R2, R3)
if CL=QUOROM; just 2 of them depending on proximity

--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583372.html

shankarpnsn

2012-10-25 15:45:38 UTC

Post by aaron morton

Post by shankarpnsn
2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)

If the write has not completed then it is not a successful write at the
specified CL as it could fail now.
Therefor the R +W > N Strong Consistency guarantee does not apply at this
exact point in time. A read to the cluster at this exact point in time
using QUOURM may return val2 or val1. Again the operation W1 has not
completed, if read R' starts and completes while W1 is processing it may
or may not return the result of W1.

I agree completely that it is fair to have this indeterminism in case of
partial/failed/in-flight writes, based on what nodes respond to a subsequent
read.

Post by aaron morton
It's import to point out the difference between Read Repair, in the
context of the read_repair_chance setting, and Consistent Reads in the
context of the CL setting. All of this is outside of the processing of
your read request. It is separate from the stuff below.
Inside the user read request when ReadCallback.get() is called and CL
nodes have responded the responses are compared. If a DigestMismatch
happens then a Row Repair read is started, the result of this read is
returned to the user. This Row Repair read MAY detect differences, if it
does it resolves the super set, sends the delta to the replicas and
returns the super set value to be returned to the client.

Post by shankarpnsn
In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

If val4 is in the memtable on node before the second read the result will be val4.
Writes that happen between the initial read and the second read after a
Digest Mismatch are included in the read result.

Thanks for clarifying this, Aaron. This is very much in line with what I
figured out from the code and brings me back to my initial question on the
point of when and what the user/client gets to see as the read result. Let
us, for now, consider only the repairs initiated as a part of /consistent
reads/. If the Row Repair (after resolving and sending the deltas to
replicas, but not waiting for a quorum success after the repair) returns the
super set value immediately to the user, wouldn't it be a breach of the
consistent reads paradigm? My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.

I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4 at the second read after digest
mismatch. But wouldn't it be incorrect to respond to user with any value
when the second read (after mismatch) doesn't find a quorum. So after
sending the deltas to the replicas as a part of the repair (still a part of
/consistent reads/), shouldn't the value be read again to check for the
presence of a quorum after the repair?

In the example we had, assume the mismatch is detected during a read R1 from
coordinator node C, that reaches node1, node2
State seen by C after first read R1: <node1 = val1, node2 = val 2, node3 =
val1>

A second read is initiated as a part of repair for consistent read of R1.
This second read observes the values (val1, val2) from (node1, node2) and
sends the corresponding row repair delta to node1. I'm guessing C cannot
respond back to user with val2 until C knows that node1 has actually written
the value val2 thereby meeting the quorum. Is this interpretation correct ?

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583395.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.

Hiller, Dean

2012-10-25 16:00:18 UTC

Kind of an interesting question

I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after the
read, and before write was finished such that the client read a value that
was never stored in the database. The odds of two nodes going out are
pretty slim though.

Or, what if the node with part of the write went down, as long as the
client stays up, he would complete his write on the other two nodes.
Seems to me as long as two nodes don't fail, you are reading at quorum and
fit with the consistency model since you get a value that will be on two
nodes in the immediate future.

Thanks,
Dean

Post by shankarpnsn

Post by aaron morton

Post by shankarpnsn
2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)

If the write has not completed then it is not a successful write at the
specified CL as it could fail now.
Therefor the R +W > N Strong Consistency guarantee does not apply at this
exact point in time. A read to the cluster at this exact point in time
using QUOURM may return val2 or val1. Again the operation W1 has not
completed, if read R' starts and completes while W1 is processing it may
or may not return the result of W1.

I agree completely that it is fair to have this indeterminism in case of
partial/failed/in-flight writes, based on what nodes respond to a subsequent
read.

Post by aaron morton
It's import to point out the difference between Read Repair, in the
context of the read_repair_chance setting, and Consistent Reads in the
context of the CL setting. All of this is outside of the processing of
your read request. It is separate from the stuff below.
Inside the user read request when ReadCallback.get() is called and CL
nodes have responded the responses are compared. If a DigestMismatch
happens then a Row Repair read is started, the result of this read is
returned to the user. This Row Repair read MAY detect differences, if it
does it resolves the super set, sends the delta to the replicas and
returns the super set value to be returned to the client.

Post by shankarpnsn
In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

If val4 is in the memtable on node before the second read the result
will
be val4.
Writes that happen between the initial read and the second read after a
Digest Mismatch are included in the read result.

Thanks for clarifying this, Aaron. This is very much in line with what I
figured out from the code and brings me back to my initial question on the
point of when and what the user/client gets to see as the read result. Let
us, for now, consider only the repairs initiated as a part of /consistent
reads/. If the Row Repair (after resolving and sending the deltas to
replicas, but not waiting for a quorum success after the repair) returns the
super set value immediately to the user, wouldn't it be a breach of the
consistent reads paradigm? My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.
I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4 at the second read after digest
mismatch. But wouldn't it be incorrect to respond to user with any value
when the second read (after mismatch) doesn't find a quorum. So after
sending the deltas to the replicas as a part of the repair (still a part of
/consistent reads/), shouldn't the value be read again to check for the
presence of a quorum after the repair?
In the example we had, assume the mismatch is detected during a read R1 from
coordinator node C, that reaches node1, node2
State seen by C after first read R1: <node1 = val1, node2 = val 2, node3 =
val1>
A second read is initiated as a part of repair for consistent read of R1.
This second read observes the values (val1, val2) from (node1, node2) and
sends the corresponding row repair delta to node1. I'm guessing C cannot
respond back to user with val2 until C knows that node1 has actually written
the value val2 thereby meeting the quorum. Is this interpretation correct ?
--
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583395.html

Manu Zhang

2012-10-25 16:37:41 UTC

read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R > N.

Post by Hiller, Dean
Kind of an interesting question
I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after the
read, and before write was finished such that the client read a value that
was never stored in the database. The odds of two nodes going out are
pretty slim though.
Or, what if the node with part of the write went down, as long as the
client stays up, he would complete his write on the other two nodes.
Seems to me as long as two nodes don't fail, you are reading at quorum and
fit with the consistency model since you get a value that will be on two
nodes in the immediate future.
Thanks,
Dean

Post by shankarpnsn

Post by aaron morton

Post by shankarpnsn
2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet)

If the write has not completed then it is not a successful write at the
specified CL as it could fail now.
Therefor the R +W > N Strong Consistency guarantee does not apply at this
exact point in time. A read to the cluster at this exact point in time
using QUOURM may return val2 or val1. Again the operation W1 has not
completed, if read R' starts and completes while W1 is processing it may
or may not return the result of W1.

I agree completely that it is fair to have this indeterminism in case of
partial/failed/in-flight writes, based on what nodes respond to a subsequent
read.

Post by aaron morton
It's import to point out the difference between Read Repair, in the
context of the read_repair_chance setting, and Consistent Reads in the
context of the CL setting. All of this is outside of the processing of
your read request. It is separate from the stuff below.
Inside the user read request when ReadCallback.get() is called and CL
nodes have responded the responses are compared. If a DigestMismatch
happens then a Row Repair read is started, the result of this read is
returned to the user. This Row Repair read MAY detect differences, if it
does it resolves the super set, sends the delta to the replicas and
returns the super set value to be returned to the client.

Post by shankarpnsn
In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ?

If val4 is in the memtable on node before the second read the result
will
be val4.
Writes that happen between the initial read and the second read after a
Digest Mismatch are included in the read result.

Thanks for clarifying this, Aaron. This is very much in line with what I
figured out from the code and brings me back to my initial question on the
point of when and what the user/client gets to see as the read result. Let
us, for now, consider only the repairs initiated as a part of /consistent
reads/. If the Row Repair (after resolving and sending the deltas to
replicas, but not waiting for a quorum success after the repair) returns the
super set value immediately to the user, wouldn't it be a breach of the
consistent reads paradigm? My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.
I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4 at the second read after digest
mismatch. But wouldn't it be incorrect to respond to user with any value
when the second read (after mismatch) doesn't find a quorum. So after
sending the deltas to the replicas as a part of the repair (still a part of
/consistent reads/), shouldn't the value be read again to check for the
presence of a quorum after the repair?
In the example we had, assume the mismatch is detected during a read R1 from
coordinator node C, that reaches node1, node2
State seen by C after first read R1: <node1 = val1, node2 = val 2, node3 =
val1>
A second read is initiated as a part of repair for consistent read of R1.
This second read observes the values (val1, val2) from (node1, node2) and
sends the corresponding row repair delta to node1. I'm guessing C cannot
respond back to user with val2 until C knows that node1 has actually written
the value val2 thereby meeting the quorum. Is this interpretation correct ?
--

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does

Post by shankarpnsn
-ReadRepair-exactly-do-tp7583261p7583395.html

shankarpnsn

2012-10-25 18:15:24 UTC

Post by Manu Zhang
read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R > N.

I beg to differ here. Any read/write, by definition of quorum, should have
at least n/2 + 1 replicas that agree on that read/write value. Responding to
the user with a newer value, even if the write creating the new value hasn't
completed cannot guarantee any read consistency > 1.

Post by Manu Zhang

Post by Hiller, Dean
Kind of an interesting question
I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after the
read, and before write was finished such that the client read a value that
was never stored in the database. The odds of two nodes going out are
pretty slim though.
Thanks,
Dean

Bingo! I do understand that the odds of a quorum nodes going down are low
and that any subsequent read would achieve a quorum. However, I'm wondering
what would be the right thing to do here, given that the client has
particularly asked for a certain consistency on the read and cassandra
returns a value that doesn't have the consistency. The heart of the problem
here is that the coordinator responds to a client request "assuming" that
the consistency has been achieved the moment is issues a row repair with the
super-set of the resolved value; without receiving acknowledgement on the
success of a repair from the replicas for a given consistency constraint.

In order to adhere to the given consistency specification, the row repair
(due to consistent reads) should repeat the read after issuing a
"consistency repair" to ensure if the consistency is met. Like Manu
mentioned, this could of course lead to a number of repeat reads if the
writes arrive quickly - until the read gets timed out. However, note that we
would still be honoring the consistency constraint for that read.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html
Sent from the cassandra-***@incubator.apache.org mailing list archive at Nabble.com.

aaron morton

2012-10-26 09:53:56 UTC

Post by shankarpnsn

Post by Manu Zhang
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R > N.

This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes participate. If a quorum participate in both the write and read you are guaranteed that one node was involved in both. The wikipedia definition helps here "A quorum is the minimum number of members of a deliberative assembly necessary to conduct the business of that group" http://en.wikipedia.org/wiki/Quorum

It's a two step process: First do we have enough people to make a decision? Second following the rules what was the decision?

In C* the rule is to use the value with the highest time stamp. Not the value with the highest number of "votes". The red boxes on this slide are the winning values http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67 (thinking one of my slides in that deck may have been misleading in the past). In Riak the rule is to use Vector Clocks.

So

Post by shankarpnsn
I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4

Is incorrect.
We return the value with the highest time stamp returned from the nodes involved in the read. Only one needs to have val4.

Post by shankarpnsn
The heart of the problem
here is that the coordinator responds to a client request "assuming" that
the consistency has been achieved the moment is issues a row repair with the
super-set of the resolved value; without receiving acknowledgement on the
success of a repair from the replicas for a given consistency constraint.

and

Post by shankarpnsn
My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.

It is not necessary for the coordinator to wait.

Consider an example: The app has stopped writing to the cluster, for a certain column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 respectively. The last write was a successful CL QUORUM write of bar with timestamp 2. However node 3 did acknowledge this write for some reason.

To make it interesting the commit log volume on node 3 is full. Mutations are blocking in the commit log queue so any write on node 3 will timeout and fail, but reads are still working. We could imagine this is why node 3 did not commit bar:2

Some read examples, RR is not active:

1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to nodes 1 and 2. Both agree on bar as value.
2) Client reads from node 3 with CL QUORUM, request is processed locally and on node 2.
* There is a digest mismatch
* Row Repair read runs to read from for nodes 2 and 3.
* The super set resolves to bar:2
* Node 3 (the coordinator) queues a delta write locally to write bar:2. No other delta writes are sent.
* Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and bar:2 is returned.
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the same thing as (2) happens and bar:2 is returned.
5) Client reads from node 1 as CL ONE. Read happens locally only and returns bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and returns foo:1

So:
* A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on disk.
* A read at CL ONE will return no value or any previous write.

The delta write from the Row Repair goes to a single node so R + W > N cannot be applied. It can almost be thought of as internal implementation. The delta write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are used to:

* Reduce the chance of a Digest Mismatch when CL > ONE
* Eventually reach a state where reads at any CL return the last write.

They are not used to ensure strong consistency when R + W > N. You could turn those things off and R + W > N would still work.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Post by shankarpnsn

Post by Manu Zhang
read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R > N.

I beg to differ here. Any read/write, by definition of quorum, should have
at least n/2 + 1 replicas that agree on that read/write value. Responding to
the user with a newer value, even if the write creating the new value hasn't
completed cannot guarantee any read consistency > 1.

Post by Manu Zhang

Post by Hiller, Dean
Kind of an interesting question
I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after the
read, and before write was finished such that the client read a value that
was never stored in the database. The odds of two nodes going out are
pretty slim though.
Thanks,
Dean

Bingo! I do understand that the odds of a quorum nodes going down are low
and that any subsequent read would achieve a quorum. However, I'm wondering
what would be the right thing to do here, given that the client has
particularly asked for a certain consistency on the read and cassandra
returns a value that doesn't have the consistency. The heart of the problem
here is that the coordinator responds to a client request "assuming" that
the consistency has been achieved the moment is issues a row repair with the
super-set of the resolved value; without receiving acknowledgement on the
success of a repair from the replicas for a given consistency constraint.
In order to adhere to the given consistency specification, the row repair
(due to consistent reads) should repeat the read after issuing a
"consistency repair" to ensure if the consistency is met. Like Manu
mentioned, this could of course lead to a number of repeat reads if the
writes arrive quickly - until the read gets timed out. However, note that we
would still be honoring the consistency constraint for that read.
--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html

Manu Zhang

2012-10-24 14:31:54 UTC

oh, it would clarity a lot if you go to read the source code; the method is
o.a.c.service.StorageProxy.fetchRows if I remember it correctly

Post by Manu Zhang
And we don't send read request to all of the three replicas (R1, R2, R3)
if CL=QUOROM; just 2 of them depending on proximity

Post by Hiller, Dean
The user will meet the required consistency unless you encounter some kind
of bug in cassandra. You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved. If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak.
In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value. If it read from R2 just after the write
was applied, it gets the new value. BOTH of these met the consistency
constraint. A better example to clear this up may be the following... If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right? And it met the consistency level, right? NOW, what
about if the write is 1ms later? What if it the right is .00001ms later?
It still met the consistency level, right? If it is .00001ms before, you
get the new value as it repairs first with the new node.
It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).
I hope that clears it up
Later,
Dean

Post by shankarpnsn

in general it is okay to get the older or newer value. If you are reading
2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
<R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2

and

Post by shankarpnsn
R3 have seen two different writes with newer values than what was

computed

Post by shankarpnsn
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here.
--

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does

Post by shankarpnsn
-ReadRepair-exactly-do-tp7583261p7583366.html

at

Post by shankarpnsn
Nabble.com.

22 Replies
98 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Markus Klems 2012-10-18 17:33:56 UTC

aaron morton 2012-10-21 22:49:40 UTC

Manu Zhang 2012-10-22 15:45:33 UTC

aaron morton 2012-10-23 07:17:57 UTC

Shankaranarayanan P N 2012-10-23 21:22:22 UTC

shankarpnsn 2012-10-23 22:10:38 UTC

Manu Zhang 2012-10-24 00:49:07 UTC

shankarpnsn 2012-10-24 01:04:59 UTC

Manu Zhang 2012-10-24 01:08:50 UTC

Hiller, Dean 2012-10-24 11:51:51 UTC

shankarpnsn 2012-10-24 14:02:13 UTC

Hiller, Dean 2012-10-24 14:20:06 UTC

Manu Zhang 2012-10-24 14:26:23 UTC

Hiller, Dean 2012-10-24 14:31:21 UTC

shankarpnsn 2012-10-24 15:28:32 UTC

Hiller, Dean 2012-10-24 15:39:35 UTC

aaron morton 2012-10-25 08:29:41 UTC

shankarpnsn 2012-10-25 15:45:38 UTC

Hiller, Dean 2012-10-25 16:00:18 UTC

Manu Zhang 2012-10-25 16:37:41 UTC

shankarpnsn 2012-10-25 18:15:24 UTC

aaron morton 2012-10-26 09:53:56 UTC

Manu Zhang 2012-10-24 14:31:54 UTC

about - legalese

Loading...