Setup GlusterFS on Linux

Written by masteryeti on June 4, 2023.

Create Wireguard VPN

Setup a Wireguard VPN, so that the `glusterd` service can privately/securely connect to each other.

/etc/wireguard/fs0.conf:


# This WireGuard device is for creating a VPN with other servers in the same FileSystem cluster (running GlusterFS).
#
# Update this configuration seamlessly using: wg syncconf fs0 <(wg-quick strip fs0)
# Wireguard start/stop/status usage: wg-quick up fs0 / wg-quick down fs0 / wg-quick show fs0
# wg-quick executes PreUp/PostUp/PreDown/PostDown, and also interprets Address with mask to add the device, set the virtual IP-address, and routing.
# Endpoints cannot contain domain names, they must be IP-addresses.
# 

[Interface] # s1.example.org
PrivateKey = <s1 Private Key>
Address = 10.0.0.1/32
ListenPort = 5102

# Ensure UDP port 5102 is open:
PostUp = ! iptables-save | grep -qFx -- "-A INPUT -p udp --dport 5102 -j ACCEPT" && iptables -A INPUT -p udp --dport 5102 -j ACCEPT || true
# Set default forwarding policy to DROP:
PostUp = iptables -P FORWARD DROP
# Create new FS0_FW chain that contains forwarding rules related to the fs0 interface:
PostUp = iptables -N FS0_FW && iptables -A FORWARD -i fs0 -j FS0_FW || iptables -F FS0_FW
# Setup forwarding rules, only UDP traffic from 10.0.0.1/24 is allowed to go to 10.0.0.1/24 and only within the same device:
PostUp = iptables -A FS0_FW -m state --state INVALID -j DROP
PostUp = iptables -A FS0_FW -m state --state RELATED,ESTABLISHED -j ACCEPT
PostUp = iptables -A FS0_FW -s 10.0.0.0/24 -d 10.0.0.0/24 -o fs0 -j ACCEPT
PostUp = iptables -A FS0_FW -j DROP
# Enable IPv4 forwarding:
PostUp = sysctl -w net.ipv4.ip_forward=1

# Clean up rules:
PostDown = iptables -D INPUT -p udp --dport 5102 -j ACCEPT || true
PostDown = iptables -D FORWARD -i fs0 -j FS0_FW || true
PostDown = iptables -F FS0_FW || true
PostDown = iptables -X FS0_FW || true

[Peer] # s1.example.org
PublicKey = <s1 Public Key>
Endpoint = :5102
AllowedIPs = 10.0.0.1/32

[Peer] # s2.example.org
PublicKey = <s2 Public Key>
Endpoint = :5102
AllowedIPs = 10.0.0.2/32

[Peer] # s3.example.org
PublicKey = <s3 Public Key>
Endpoint = :5102
AllowedIPs = 10.0.0.3/32

Generate a new private key on every server using wg genkey. Then derive the public key, by pasting the private key in the stdin of wg pubkey (use CTRL+D to close stdin). Finally, run wg-quick up fs0 to enable the Wireguard VPN.

Ping to one of the AllowedIPs to test the connection, and review the configuration status in wg show fs0. Note that Wireguard uses UDP, and thus there is no "active" connection. A handshake only shows up, when the tunnel is actually being used.

If a ping to the virtual address fails, check the configuration on both ends. If the configuration is incorrect on one end, communication cannot succeed either way.

Use ip addr to see if the device fs0 exists, is UP, and has the correct virtual IP-address.
Use ip route to see if the entire subnet (10.0.0.0/24) is correctly routed to the fs0 device, and there are no conflicting routes.
Use iptables-save (this just prints the rules to stdout, despite its name, nothing is saved) to check if the rules ACCEPT the UDP-traffic.
Use wg show fs0 to check if the ListenPort matches the configuration, and rules in iptables.

Setup the GlusterFS server pool

Let's customize some easy to use DNS records, and mark on each server the correct localhost entry. The name 'virtual' has no special meaning, and is just an example to indicate these are not actual public DNS-records.

/etc/hosts:


10.0.0.1 virtual.s1.example.org virtual.localhost.example.org
10.0.0.2 virtual.s2.example.org
10.0.0.3 virtual.s3.example.org
(...)

Now make sure you can ping these virtual hostnames. The /etc/hosts file is enabled in /etc/nsswitch.conf.

Setup the GlusterFS pool of peers. The probe automatically ensures other peers will connect with each other in both directions. Therefore, this only needs to be executed once on one of the servers. It doesn't hurt to execute more than once, or if the peer is actually the localhost (this is automatically detected).


gluster peer probe virtual.s1.example.org
gluster peer probe virtual.s2.example.org
gluster peer probe virtual.s3.example.org
(...)

Create the virtual GlusterFS volume

Now let's create the distributed replicated volume. If you only want a volume that expands over multiple disks/servers, without any replication, just leave out the replica argument and its replication count. Advised is to use a replication count of at least 3, to avoid split brain situations (for robust partition tolerance). The force argument at the end is needed in order to define an absolute path to the data storage directory (/srv/example/...). Furthermore, the self-healing and bitrot detection and resolving daemons are enabled for automatic redundancy and improved consistency.


gluster volume create example replica 3 virtual.s1.example.org:/srv/example/s1 virtual.s2.example.org:/srv/example/s2 virtual.s3.example.org:/srv/example/s3 force
gluster volume heal example enable
gluster volume bitrot example enable

replica 3 enables replication for every three servers in the complete list of servers. If the replication count is set to 3 (recommended minimum), and let's say there are 12 servers listed. Then the volume is distributed over 4 groups of 3 servers (order of listing matters). So if every server provides 1TB of storage, then the storage capacity of the volume is 4TB. Although the actual consumed storage is 12TB, because every 1TB is replicated 3 times. If the total is not a multiple of the replication count, you may add arbiters or thin-arbiters, that help to decide on the majority in case of a partition, to avoid split brain situations.

Now setup the bind-address in the configuration file. This is an important step, otherwise the GlusterFS volume is publicly available as by default it listens at all interfaces (instead of only the local loopback address). Verify which services are listening at what port and bind address using netstat -ntlepa | grep gluster.

/etc/glusterfs/glusterd.vol:


volume management
	type mgmt/glusterd
	option working-directory /var/lib/glusterd
	option transport-type socket
	option transport.socket.keepalive-time 10
	option transport.socket.keepalive-interval 2
	option transport.socket.read-fail-log off
	option transport.socket.listen-port 24007
	option transport.socket.bind-address virtual.localhost.example.org
	option ping-timeout 0
	option event-threads 1
#	option lock-timer 180
# Uncomment the following line, if the bind-address resolves to an IPv6 address:
#	option transport.address-family inet6
#	option base-port 49152
	option max-port 60999
end-volume

When glusterd.service is restarted, some glusterfsd processes may be lingering. These must be killed, before the glusterd service should be started again, otherwise the mount will fail. To ensure that the forked processes are also killed upon (re)starting glusterd, see this GitHub issue.

/etc/systemd/system/glusterd.service.d/override.conf:


[Service]
KillMode=control-group

Use systemctl daemon-reload to apply this file, before restarting the glusterd service.

Mount the virtual GlusterFS volume

To access the distributed virtual volume manually on each server:


mkdir /mnt/example
mount -t glusterfs virtual.localhost.example.org:example /mnt/example
# To use virtual IPv6-addresses add: -o xlator-option=transport.address-family=inet6

Or automatically, using:

/srv/example/mnt-example.mount:


[Unit]
Description = Mount the virtual volume by glusterd on /mnt/example
Requires = glusterd.service network-online.target
Wants = network-online.target
Conflicts = rescue.target rescue.service shutdown.target
After = glusterd.service

[Mount]
Type = glusterfs
What = virtual.localhost.example.org:example
Where = /mnt/example
# Note: don't use quotes in the Options=
# To use virtual IPv6-addresses add: xlator-option=transport.address-family=inet6
#Options = rw,default_permissions,defaults,_netdev,allow_other,loglevel=WARNING,max_read=131072,backup-volfile-servers=virtual.s1.example.org:virtual.s2.example.org:virtual.s3.example.org,...

[Install]
WantedBy = multi-user.target

About using IPv6-addresses with GlusterFS

In my experience, while using virtual IPv6-addresses in the Wireguard VPN, most just works as long as the transport.address-family=inet6 is set. Both for the server volume configuration, and as a mount option. However, the daemons like self-heal and bitrot don't seem to start due to an error resolving the address. Looking at the code repository, there are a lot of issues relating to IPv6 functionality which was only added later on through multiple separate patches and bug-fixes (in 2019-2021). My recommendation is to avoid using IPv6, since it may not necessarily be implemented as a stable feature yet (in 2023).

RPC port 111

GlusterFS makes use of an RPC portmapper listening on port :111. If this port is publicly accessible, then it may be utilized by malicious actors to amplify a DDoS attack. Therefore, rpcbind should only listen on local interfaces and/or the virtual private network addresses that belong to the host. Create the following file on the server with virtual IP-address 10.0.0.x:

/etc/systemd/system/rpcbind.socket.d/override.conf:


[Socket]
# By default rpcbind listens on all interfaces, which is a security risk as amplification for DDoS attacks
# Changes to ListenStream= or ListenDatagram= require a system reboot
ListenStream=127.0.0.1:111
ListenDatagram=127.0.0.1:111
ListenStream=10.0.0.x:111
ListenDatagram=10.0.0.x:111

So on server 3, substitute 10.0.0.x with 10.0.0.3.

Since 1/systemd listens on the port, the system must be rebooted in order to apply these changes.

Troubleshooting GlusterFS

The functionality of GlusterFS is pretty low-level, written in C. There may be error messages that can be safely ignored due to some unused functionality, and they might be misleading. Furthermore, errors that occur, usually refer to the log files. Logs by GlusterFS are stored as follows.

/var/log/glusterfs/glusterd.log:: Main service daemon log.
/var/log/glusterfs/glustershd.log:: Self-heal daemon log.
/var/log/glusterfs/mnt-example.log: FUSE-mount log. One dynamic log-file per mount location.