Block Storage - Rook with Ceph
Containerized storage on Kubernetes has always been troublesome. Containers tend to be transient and migratory - making it difficult to store information within or attached to the container. Rook solves this problem by tracking the storage requirements of each container and keeping the correct underlying storage attached to the underlying host node and container. When using Equinix Metal™, the underlying host node would be a physical Equinix Metal host.
While Rook can support many underlying storage technologies, this guide covers the most common use-case: Ceph. Ceph provides high resilience and performance by replicating data across multiple physical devices. Rook containerizes the various Ceph software components (MON, OSD, Web GUI, Toolbox) and runs them in a highly resilient manner on the Kubernetes cluster. A Ceph cluster on Equinix Metal consists of multiple Equinix Metal hosts providing the raw disk storage for Ceph to manage and provide as storage to the containerized applications.
This guide covers some of the typical use cases of Rook with Ceph on Equinix Metal, some best practices, and hardware recommendations. Building a Rook Cluster covers setting up a localhost as a cluster manager, and deploying the cluster on Equinix Metal. Setting up Rook with Ceph provides a summary of the Rook Quickstart Guide to help you get Rook and Ceph running on Equinix Metal.
The combination of Equinix Metal infrastructure, multiple hardware models across multiple data centers (and regions of the world), and Ceph configurations parameters, allows a business to custom-tailor the storage to its business needs. Some examples may include: solving data replication needs to support a world-wide footprint, tailoring the storage to hit a high performance IOPs requirement, or solving a long term storage requirement at an attractive price point. Through the right combination of Equinix Metal hardware and Ceph configuration, the ideal configuration can be attained.
Any high availability service needs to remove all single points of failure and replace them with multiple redundant components. RAID (Redundant Array of Inexpensive Disks), takes a notoriously unreliable hardware component (hard disk drives) and introduces redundancy by eliminating the dependency on a single hard drive. Ceph takes this concept of even further by extending the concept of this array across multiple servers. A Ceph cluster configured for High Availability will consist of multiple Equinix Metal servers configured such that if one or more physical servers fails, there will be zero loss of data and minimal loss of performance.
Removing dependency on a single physical location (data center or cluster) can be accomplished by replicating data to a second site. Ceph provides the capability to mirror data across to a second site. This second site could be a second Equinix Metal data center running a second Ceph/Rook/Kubernetes cluster. Such a configuration would allow traffic to be geographically load-balanced between two sites in an active/active or multiple redundant in a one-way replication. Equinix Metal offers private backend networking between a data center locations for just such replication.
Each Equinix Metal server is configured with SSD or NVMe storage - which offers some of the higher performance available from bare metal. Even higher throughput is obtainable with the use of Ceph data stripping and mirroring across multiple Equinix Metal servers. Rather than relying upon a single physical server, Ceph can replicate data across multiple servers to improve performance. Mirroring data across multiple servers can reduce the latency when retrieving data - thus improving service performance. When high performance is key, it is best to select multiple smaller servers for Ceph to mirror data rather than a larger number of larger servers.
Not all data needs to be stored on the fastest, most expensive storage media. Often times, less expensive storage media will fit the bill. In such a case, multiple classes of service (ServiceClass) can be defined by leveraging the appropriate storage class. Equinix Metal hosts come standard with SSD. Some systems have higher performance NVMe and some select systems have economical HDD storage. Rook and Ceph can be configured with multiple storage classes mapping back to the underlying Equinix Metal hardware types.
Equinix Metal offers a wide variety of servers that can be used to run Rook with Ceph. There are a few considerations when selecting which Equinix Metal hardware to utilize and how best to configure the cluster:
- Equinix Metal Server Configuration and Considerations for Storage Nodes
- Tag physical servers as storage nodes in Kubernetes for container affinity
- Validate that Servers have standalone storage disks (non boot) available for Ceph
- Disable software/hardware RAID
- Configure drives as RAID 0/JBOD (Just a Bunch of Disks) to work with Ceph
- Do not format or create a file system drives for Ceph drives
- Avoid using file based storage for Ceph
- Kubernetes/Rook Configuration
- Place an affinity to run Rook containers on your storage nodes
- Avoid running application containers on the storage nodes
- Ceph Monitor (MON)
- Utilize Ceph Monitoring to track the Ceph cluster configuration and state
- Run a minimum of three MONs containers to allow for graceful recovery
- Rook runs each MON within its own container
- Each MON should be on a different physical server
- Provision adequate Equinix Metal servers to allow recovery MONs to startup
- Validate that the third MON is isolated ideally with no other containers on the physical server
- Validate that each MON should be a different Equinix Metal rack/top of rack switch to handle network failovers/maintenance
- Plan at least 4 GB RAM per MON
- Ceph Object Store Daemon (OSD)
- The OSD is responsible for the storage and transmission of data from a local storage device
- Configure Equinix Metal hosts with 3 OSDs per SSD and 4 OSDs per NVMe
- Plan at least 2 GB RAM per OSD
Equinix Metal servers are available in a variety of configurations. For Rook with Ceph, the ideal systems have multiple storage devices (SSD or NVMe) for Ceph to stripe data across. Smaller systems without dedicated storage devices would be better suited for the failover Ceph Monitor process. The table below details the following:
- The total raw storage available for Rook (being the storage available minus any boot devices)
- The recommended number of OSDs (given the guidelines of 3 OSDs per SSD and 4 OSDs per NVMe device).
Some factors that you will need to consider are:
- Server Type - The Equinix Metal hardware type as described on the Equinix Metal website.
- Boot Drives - The quantity and type of boot (OS) drive. Boot drives are unsuitable for Ceph use.
- Storage Drives - The quantity and type of storage (non-boot) drives available for Ceph use.
- Total Storage - The total storage available for Ceph broken out by SSD, NVMe, and HDD.
- OSDs - The number of Object Storage Daemons recommended to run on this server type given the guidelines of 3 OSDs per SSD, 4 per NVMe, and 1 per HDD.
|Server Type||Boot Drives||Storage Drives||Total Storage||# OSDs|
|c3.small.x86||480 GB SSD||480 GB SSD||480 GB SSD||3|
|c3.medium.x86||240 GB SSD||240 GB SSD, 2 x 480 GB SSD||1200 GB SSD||9|
|m3.large.x86||2 x 240 GB SSD||2 x 3.8 TB NVMe||7.6 TB NVMe||8|
|s3.xlarge.x86||960 GB SSD||960 GB SSD, 2x 240 GB NVMe, 12x8 TB HDD||960 GB SSD, 480 GB NVMe, 96 TB HDD||23|
Using the devices listed above, the following sample cluster configuration is given as an example and covers all the best practices described above.
- Server Type - The Equinix Metal hardware type as described on the Equinix Metal website.
- Use - The purpose of the server be it for Storage or Monitor (Rook with Ceph) or general Kubernetes Workload (non-Rook)
- OSDs - The number of Object Storage Daemons to run on the device
- MONs - The number of Ceph Monitors to run on the device
- Label - The Kubernetes label to use for affinity. Rook containers are to run on those devices labeled as Storage.
|#||Server Type||Use||# OSDs||# MONs||Label|
|1||n2.xlarge.x86||Storage and Monitor||7||1||Storage|
|2||n2.xlarge.x86||Storage and Monitor||7||1||Storage|
This example runs multiple Ceph Monitors (three) with one on a dedicated server. This third monitor runs on a smaller device without any storage (OSDs). Failover of any monitors would be to the devices 3-5 before running on the general Kubernetes Workload nodes.
While the majority of the storage nodes are n2.xlarge.x86 configured with NVMe, there is one s3.xlarge.x86 hosting 12 8TB spinning HDDs. These HDDs can be configured as a separate class of service for near line, lower cost, long term storage. With 12 drives, Ceph can set up adequate redundancy to handle a drive failure but the single server is a single point of failure - a trade-off to keep costs down. Alternatively, a second s3.xlarge.x86 can be set up to mirror the drive pool at a second location.
Equinix Metal offers the ability to scale up and scale down the quantity and size of servers that make up a cluster. Should the storage space requirements go up, additional Equinix Metal servers can be spun up and added to the Rook cluster. If the storage performance requirements change, the bare metal make-up of the cluster may need to be changed so additional servers can be spun up to distribute the storage workload.
Since Rook runs on Kubernetes, the Rook containers will dynamically redistribute as new servers come online and join the Kubernetes cluster (provided they are tagged accordingly within Kubernetes allowing Rook to use those nodes). Rook dynamically examines all drives on a new server and will set up any unused (unformatted) drive as Ceph storage and start up the appropriate number of OSDs. Conversely, when a server is taken offline, Rook will redistribute the containers across the remaining servers. Take care to always replicate data and keep an appropriate number of MONs running before deprovisioning any bare metal servers.
Rook can be configured to automatically startup the desired number of MONs and OSDs per device (SSD, NVMe, or HDD) through the Ceph Cluster YAML configuration file. The best practices for bare metal infrastructure are described below.
Use a highly-available Ceph Monitor (MON) configuration.
- Deploy a minimum of three MONs across different bare metal nodes for proper failover.
# set the amount of mons to be started mon: count: 3 allowMultiplePerNode: false
Enable the Ceph Dashboard.
dashboard: enabled: true ssl: true
Utilize the Equinix Metal private backend network for Ceph.
- Optional: Utilize the backend private network Equinix Metal providers for each Project to isolate the storage traffic.
network: # enable host networking provider: private-backend-provider
Erase Disks after Use.
- Optional: Erase bare metal drives before deallocating the hardware back to Equinix Metal.
cleanupPolicy: # sanitizeDisks represents settings for sanitizing OSD disks on cluster deletion sanitizeDisks: # method indicates if the entire disk should be sanitized or simply ceph's metadata # in both case, re-install is possible # possible choices are 'complete' or 'quick' (default) method: complete # dataSource indicate where to get random bytes from to write on the disk # possible choices are 'zero' (default) or 'random' # using random sources will consume entropy from the system and will take much more time then the zero source dataSource: random # iteration overwrite N times instead of the default (1) # takes an integer value iteration: 3
Place Rook containers on Rook specific bare metal nodes.
- This should be in place for all Rook containers (MONs and OSDs)
placement: all: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: role operator: In values: - storage-node
Follow OSD Guidelines.
- Three OSDs per SSD, four OSDs per NVMe, and one OSD per HDD
- Only use Rook designated bare metal nodes
storage: # cluster level storage configuration and selection useAllNodes: false # only use the Rook specific bare metal nodes useAllDevices: true #deviceFilter: config: osdsPerDevice: "3" # majority of Equinix Metal devices are SSD so use 3 OSDs # encryptedDevice: "true" # the default value for this option is "false" nodes: - name: "10.20.30.40" devices: # specific devices to use for storage can be specified for each node - name: "sdb" config: osdsPerDevice: "1" # for HDDs use 1 OSD - name: "nvme01" config: osdsPerDevice: "4" # for NVMe use 4 OSDs
This workflow takes you through creating your first Rook cluster on Equinix Metal. This includes deploying the Equinix Metal server and network infrastructure (via Terraform) and a Kubernetes cluster (via KubeSpray). Rook is then installed on the deployed cluster.
There are many ways to deploy a Kubernetes cluster on Equinix Metal and the example below using Kubespray is just one. Perhaps you're deploying Equinix Metal infrastructure via the Equinix Metal Web App and then deploying Kubernetes by hand. If so, take some time to review the Equinix Metal and Ceph hardware recommendations above and then deploy the Equinix Metal infrastructure and set up the cluster as you wish. You can then directly set up Rook.
Setting up this cluster requires an account with Equinix Metal as well as a localhost to run the provisioning tools. In this example we use CentOS 7 and recommend you use a similar flavor to follow along.
Create a new account here (or login to an existing one).
Create a project and give it a name (e.g. Rook). Then either generate new SSH keys or add an existing key to your Equinix Metal profile. These keys will need to be available on the localhost.
On the metal.equinix.com website, after signing in and selecting your project, go to
Project Settings and COPY the project ID to your cut/paste buffer. Write it down as you will need it later in this setup.
On the metal.equinix.com website, after signing in, go to drop down in the upper right corner and select
API Keys. ADD a new read/write API key and COPY the API key to your cut/paste buffer.
You'll need a localhost setup to run Terraform and Ansible as part of the Kubespray installation. This could be a laptop/desktop, virtual machines, or a Equinix Metal server running Linux. You'll need to have this environment up and running before proceeding. The smallest Equinix Metal host will likely be sufficient. In the walkthrough below, we used an CentOS 7 virtual machine. The following packages will need to be installed for Kubespray to run.
sudo yum install epel-release sudo yum install ansible sudo yum install git sudo yum install python-pip
Save the API key generated above as an environment variable for Terraform to use later. Save this into ~/.bashrc for later logins.
SSH Key Setup
An SSH key is needed by Kubespray and Ansible to connect (SSH) to the deployed Equinix Metal hosts and run the playbooks. Equinix Metal will install a key on your behalf to newly provisioned bare metal hosts as part of the Terraform deployment. Create a new key on local host. You can find documentation on how to generate SSH keys here!
Once, you've generated the public key
~/.ssh/id_rsa.pub save it into your Equinix Metal project. On the metal.equinix.com website go to drop down in the upper right corner and select
SSH Keys. ADD a new SSH key and paste the public key.
Terraform is required to deploy the bare metal infrastructure. The steps below are for installing on CentOS 7. More terraform installation options are available.
At this point the necessary installer software components have been downloaded and we can proceed with the configuration.
The Cluster Definition
In this example, a new cluster called "alpha" will be created.
cp -LRp contrib/terraform/packet/sample-inventory inventory/alpha cd inventory/alpha/ ln -s ../../contrib/terraform/packet/hosts
Update the clusters file with your packet_project_id (saved above) and change the stock Equinix Metal device type as you see fit.
# Your Equinix Metal project ID. See https://support.equinixmetal.com/ packet_project_id = "your_packet_project"
Deploying the Bare Metal Hosts
Initialize and then run Terraform in order to deploy the hardware. The deployed machines can now be verified via the Equinix Metal website.
With the bare metal infrastructure deployed, Kubespray can now install Kubernetes and setup the cluster.
With the hardware deployed and a Kubernetes cluster up and running, you can proceed to setting up Rook and Ceph. The Quickstart guide in the Rook documentation will take you through the process.
Start with Deploying the Rook Operator and Creating a Rook Ceph Cluster. Keep in mind that depending on the Equinix Metal device type that was selected to deploy, the number of OSDs will be different. By default, Rook will deploy one OSD for each unformatted device.
The Rook providers have created a toolbox container that contains the full suite of Ceph clients for debugging and troubleshooting your Rook cluster. The toolbox readme contains the setup and usage information. Also see the advanced configuration document for helpful maintenance and tuning examples.
Ceph has a dashboard in which you can view the status of your cluster. Enabling the dashboard, getting the login information, and making it accessible outside your cluster is covered on the Ceph Dashboard page.
The Rook documentation covers how to use Block Storage. It includes a tutorial for creating an application that runs on Kubernetes and uses the storage volumes enabled by Rook.
See the Rook Website for detailed information about the other ways to use Rook including
See the Equinix Metal Website for detailed information about the other ways to use Equinix Metal including
Rook on Bare Metal Workshop