Discussion:
Can Cassandra client programs use hostnames instead of IPs?
Huiliang Zhang
2014-05-12 18:31:27 UTC
Permalink
Hi,

Cassandra returns ips of the nodes in the cassandra cluster for further
communication between hadoop program and the casandra cluster. Is there a
way to configure the cassandra cluster to return hostnames instead of ips?
My cassandra cluster is on AWS and has no elastic ips which can be accessed
outside AWS.

Thanks,
Huiliang
Ben Bromhead
2014-05-13 14:45:45 UTC
Permalink
You can set listen_address in cassandra.yaml to a hostname (http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html).

Cassandra will use the IP address returned by a DNS query for that hostname. On AWS you don't have to assign an elastic IP, all instances will come with a public IP that lasts its lifetime (if you use ec2-classic or your VPC is set up to assign them).

Note that whatever hostname you set in a nodes listen_address, it will need to return the private IP as AWS instances only have network access via there private address. Traffic to a instances public IP is NATed and forwarded to the private address. So you may as well just use the nodes IP address.

If you run hadoop on instances in the same AWS region it will be able to access your Cassandra cluster via private IP. If you run hadoop externally just use the public IPs.

If you run in a VPC without public addressing and want to connect from external hosts you will want to look at a VPN (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_VPN.html).

Ben Bromhead
Hi,
Cassandra returns ips of the nodes in the cassandra cluster for further communication between hadoop program and the casandra cluster. Is there a way to configure the cassandra cluster to return hostnames instead of ips? My cassandra cluster is on AWS and has no elastic ips which can be accessed outside AWS.
Thanks,
Huiliang
Huiliang Zhang
2014-05-16 00:16:08 UTC
Permalink
Thanks. My case is that there is no public ip and VPN cannot be set up. It
seems that I have to run EMR job to operate on the AWS cassandra cluster.

I got some timeout errors during running the EMR job as:
java.lang.RuntimeException: Could not retrieve endpoint ranges:
at
org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:333)
at
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:149)
at
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:144)
at
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:228)
at
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:213)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:658)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.thrift.transport.TTransportException:
java.net.ConnectException: Connection timed out
at org.apache.thrift.transport.TSocket.open(TSocket.java:183)
at
org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at
org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.createThriftClient(BulkRecordWriter.java:348)
at
org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:293)
... 12 more
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift.transport.TSocket.open(TSocket.java:178)
... 15 more

Appreciated if some suggestions are provided.
Post by Ben Bromhead
You can set listen_address in cassandra.yaml to a hostname (
http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html
).
Cassandra will use the IP address returned by a DNS query for that
hostname. On AWS you don't have to assign an elastic IP, all instances will
come with a public IP that lasts its lifetime (if you use ec2-classic or
your VPC is set up to assign them).
Note that whatever hostname you set in a nodes listen_address, it will
need to return the private IP as AWS instances only have network access via
there private address. Traffic to a instances public IP is NATed and
forwarded to the private address. So you may as well just use the nodes IP
address.
If you run hadoop on instances in the same AWS region it will be able to
access your Cassandra cluster via private IP. If you run hadoop externally
just use the public IPs.
If you run in a VPC without public addressing and want to connect from
external hosts you will want to look at a VPN (
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_VPN.html).
Ben Bromhead
+61 415 936 359
Hi,
Cassandra returns ips of the nodes in the cassandra cluster for further
communication between hadoop program and the casandra cluster. Is there a
way to configure the cassandra cluster to return hostnames instead of ips?
My cassandra cluster is on AWS and has no elastic ips which can be accessed
outside AWS.
Thanks,
Huiliang
Loading...