Repair hangs when merkle tree request is not acknowledged

Discussion:

Paul Sudol

2013-04-04 13:11:41 UTC

Hello,

I have a cluster with 4 nodes, 2 nodes in 2 data centers. I had a hardware failure in one DC and had to replace the nodes. I'm running 1.2.3 on all of the nodes now. I was able to run nodetool rebuild on the two replacement nodes, but now I cannot finish a repair on any of them. I have 18 column families, if I run a repair on a single CF at a time, I can get the node repaired eventually. A repair on a certain CF will fail, and I run it again and again, eventually it will succeed.

I've got an RF of 2, 1 copy in each DC, so the repair needs to pull data from the other DC to finish it's repair.

The problem seems to be that the merkle tree request sometimes is not received by the node in the other DC. Usually when the merkle tree request is sent, the nodes that it was sent to start a compaciton/validation. In certain cases this does not happen, only the node that I ran the repair on will begin compaction/validation and send the merkle tree to itself. Then it's waiting for a merkle tree from the other node, and it will never get it. After about 24 hours it will time out and say the node in question died.

Is there a setting I can use to force the merkle tree request to be acknowledged or resent if it's not acknowledged? I setup NTPD on all the nodes and tried the cross_node_timeout, but that did not help.

Thanks in advance,

Paul

aaron morton

2013-04-06 03:31:55 UTC

Permalink

If I wait 24 hours, the repair command will return an error saying that the node died but the node really didn't die, I watch it the whole time.

Can you include the error, it makes it easier to know what's going on.

You should see INFO messages on the node you are running repair on that say something like

"[repair #%s] requesting merkle trees for %s (to %s)"

the last variable substitution is endpoints.

On the receiving side you will see this message at DEBUG level

"Queueing validation compaction for "

If you really want to see the messages turn in TRAVE for the MessagingService.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

How does it fail?

If I wait 24 hours, the repair command will return an error saying that the node died but the node really didn't die, I watch it the whole time.
I have the DEBUG messages on in the log files, when the node I'm repairing sends out a merkle tree request, I will normally see, {ColumnFamilyStore.java (line 700) forceFlush requested but everything is clean in <COLUMN FAMILY NAME>}, in the log of the node that should be generating the merkle tree request. (in addition, when I run nodetool -h localhost compactionstats, I will see activity).
When the node that should be generating a merkle tree does not have this message, and has no activity to see via nodetool compactionstats, it will fail.
There are no errors about streaming, it does not even get to the point of streaming. One node will send requests for merkle trees, and sometimes, the node in the other data center just doesn't get the message. At least that's what it looks like.
Should I still try the phi_convict_threshold?
Thanks!
Paul

Post by Paul Sudol
A repair on a certain CF will fail, and I run it again and again, eventually it will succeed.

Can you see the repair start on the other node ?
If you are getting errors in the log about streaming failing because a node died, and the FailureDetector is in the call stack, change the phi_convict_threshold. You can set it in the yaml file or via JMX on the FailureDetectorMBean, in either case boost it from 8 to 16 to get the repair through. This will make it less likely that a node is marked as down, you probably want to run with 8 or a little bit higher normally.
Cheers
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand
@aaronmorton
http://www.thelastpickle.com

Post by Paul Sudol
Hello,
I have a cluster with 4 nodes, 2 nodes in 2 data centers. I had a hardware failure in one DC and had to replace the nodes. I'm running 1.2.3 on all of the nodes now. I was able to run nodetool rebuild on the two replacement nodes, but now I cannot finish a repair on any of them. I have 18 column families, if I run a repair on a single CF at a time, I can get the node repaired eventually. A repair on a certain CF will fail, and I run it again and again, eventually it will succeed.
I've got an RF of 2, 1 copy in each DC, so the repair needs to pull data from the other DC to finish it's repair.
The problem seems to be that the merkle tree request sometimes is not received by the node in the other DC. Usually when the merkle tree request is sent, the nodes that it was sent to start a compaciton/validation. In certain cases this does not happen, only the node that I ran the repair on will begin compaction/validation and send the merkle tree to itself. Then it's waiting for a merkle tree from the other node, and it will never get it. After about 24 hours it will time out and say the node in question died.
Is there a setting I can use to force the merkle tree request to be acknowledged or resent if it's not acknowledged? I setup NTPD on all the nodes and tried the cross_node_timeout, but that did not help.
Thanks in advance,
Paul

aaron morton

2013-04-05 17:19:25 UTC

Permalink

Post by Paul Sudol
A repair on a certain CF will fail, and I run it again and again, eventually it will succeed.

How does it fail?

Can you see the repair start on the other node ?
If you are getting errors in the log about streaming failing because a node died, and the FailureDetector is in the call stack, change the phi_convict_threshold. You can set it in the yaml file or via JMX on the FailureDetectorMBean, in either case boost it from 8 to 16 to get the repair through. This will make it less likely that a node is marked as down, you probably want to run with 8 or a little bit higher normally.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

Paul Sudol

2013-04-05 18:03:15 UTC

Permalink

How does it fail?

If I wait 24 hours, the repair command will return an error saying that the node died but the node really didn't die, I watch it the whole time.
I have the DEBUG messages on in the log files, when the node I'm repairing sends out a merkle tree request, I will normally see, {ColumnFamilyStore.java (line 700) forceFlush requested but everything is clean in <COLUMN FAMILY NAME>}, in the log of the node that should be generating the merkle tree request. (in addition, when I run nodetool -h localhost compactionstats, I will see activity).

When the node that should be generating a merkle tree does not have this message, and has no activity to see via nodetool compactionstats, it will fail.

There are no errors about streaming, it does not even get to the point of streaming. One node will send requests for merkle trees, and sometimes, the node in the other data center just doesn't get the message. At least that's what it looks like.

Should I still try the phi_convict_threshold?

Thanks!

Paul

Post by Paul Sudol
A repair on a certain CF will fail, and I run it again and again, eventually it will succeed.