Mike Neir
2013-08-30 15:57:09 UTC
Greetings folks,
I'm faced with the need to update a 36 node cluster with roughly 25T of data on
disk to a version of cassandra in the 1.2.x series. While it seems that 1.2.8
will play nicely in the 1.0.9 cluster long enough to do a rolling upgrade, I'd
still like to have a roll-back plan in case the rolling upgrade goes sideways.
I've tried to upgrade a single node in my dev cluster, then roll back using a
snapshot taken previously, but things don't appear to be going smoothly. The
node will rejoin the ring eventually, but not after spending some time in the
"Joining" state as shown by "nodetool ring", and spewing a ton of error messages
similar to the following:
ERROR [MutationStage:31] 2013-08-29 14:07:20,530 RowMutationVerbHandler.java
(line 61) Error in row mutation
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1178
My test procedure is as follows:
1) nodetool -h localhost snapshot
2) nodetool -h localhost drain
3) service cassandra stop
4) back up cassandra configs
5) remove cassandra 1.0.9
6) install cassandra 1.2.8
7) restore cassandra configs, alter them to remove configuration entries no
longer used
8) start cassandra 1.2.8, let it run for a bit, then drain/stop it
9) remove cassandra 1.2.8
10) reinstall cassandra 1.0.9
11) restore original cassandra configs
12) remove any commit logs present
13) remove folders for system_auth and system_traces Keyspaces (since they don't
seem to be present in 1.0.9)
14) Move snapshots back to where they should be for 1.0.9 and remove cass 1.2.8 data
# cd /var/lib/cassandra/data/$KEYSPACE/
# mv */snapshots/$TIMESTAMP/* .
# find . -mindepth 1 -type d -exec rm -rf {} \;
# cd /var/lib/cassandra/data/system
# mv */snapshots/$TIMESTAMP/* .
# find . -mindepth 1 -type d -exec rm -rf {} \;
15) start cassandra 1.0.9
16) observe cassandra system.log
Does anyone have any insight on things I may be doing wrong, or whether this is
just an unavoidable pain point caused by rolling back? It seems that since there
are no schema changes going on, the node should be able to just hop back into
the cluster without error and without transitioning through the "Joining" state.
I'm faced with the need to update a 36 node cluster with roughly 25T of data on
disk to a version of cassandra in the 1.2.x series. While it seems that 1.2.8
will play nicely in the 1.0.9 cluster long enough to do a rolling upgrade, I'd
still like to have a roll-back plan in case the rolling upgrade goes sideways.
I've tried to upgrade a single node in my dev cluster, then roll back using a
snapshot taken previously, but things don't appear to be going smoothly. The
node will rejoin the ring eventually, but not after spending some time in the
"Joining" state as shown by "nodetool ring", and spewing a ton of error messages
similar to the following:
ERROR [MutationStage:31] 2013-08-29 14:07:20,530 RowMutationVerbHandler.java
(line 61) Error in row mutation
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1178
My test procedure is as follows:
1) nodetool -h localhost snapshot
2) nodetool -h localhost drain
3) service cassandra stop
4) back up cassandra configs
5) remove cassandra 1.0.9
6) install cassandra 1.2.8
7) restore cassandra configs, alter them to remove configuration entries no
longer used
8) start cassandra 1.2.8, let it run for a bit, then drain/stop it
9) remove cassandra 1.2.8
10) reinstall cassandra 1.0.9
11) restore original cassandra configs
12) remove any commit logs present
13) remove folders for system_auth and system_traces Keyspaces (since they don't
seem to be present in 1.0.9)
14) Move snapshots back to where they should be for 1.0.9 and remove cass 1.2.8 data
# cd /var/lib/cassandra/data/$KEYSPACE/
# mv */snapshots/$TIMESTAMP/* .
# find . -mindepth 1 -type d -exec rm -rf {} \;
# cd /var/lib/cassandra/data/system
# mv */snapshots/$TIMESTAMP/* .
# find . -mindepth 1 -type d -exec rm -rf {} \;
15) start cassandra 1.0.9
16) observe cassandra system.log
Does anyone have any insight on things I may be doing wrong, or whether this is
just an unavoidable pain point caused by rolling back? It seems that since there
are no schema changes going on, the node should be able to just hop back into
the cluster without error and without transitioning through the "Joining" state.
--
Mike Neir
Liquid Web, Inc.
Infrastructure Administrator
Mike Neir
Liquid Web, Inc.
Infrastructure Administrator