README.md

# CephFS performance issues - problem description

## Summary

University of Helsinki has a Ceph cluster that can be accessed three
ways: CephFS, CephFS+NFS-Ganesha, RBD and S3. The people using CephFS
have reported significant performance issues. The key thing here is
that the tool they use for accessing their data uses seek before
reading the data.


## Cluster description

* Ceph version 18.2.1 (reef)

* Ceph is managed by cephadm

* Hardware
  * 15 HPE ProLiant XL420 Gen10 Plus servers each of which contains
    * 512 GiB memory
    * 24 18 TiB 7200 RPM HDDs
	* 2 1.5 TiB NVMe drives
	* 1 3 TiB NVMe drive
	* 2 25GE network interfaces
	
* The Ganesha containers are running on libvirt virtual machines
  running on ProLiant DL380 Gen10 Plus servers

* All nodes run RHEL 9.4


## The problem

The clients are using a tool that wants to read a part of the file and
process it further. When the tool issues a seek system call the CephFS
performs very poorly when the size of the file opened is at least 100
MiB. The same behaviour can be observed with tools like tac or
ddrescue both of which can read the files in reverse. The performance
hit is huge. When a 100 MiB file is read from the beginning to the end
the process completes in less than 10 seconds including the time
required to start the actual procesess. Instead, when the file is read
from the end to the beginning the time required is 10 minutes.

Intially, we thought the culprit was Ganesha but we were wrong. When
we let the client hosts connect the cluster (almost) directly the
performance didn't improve. The almost here means that there is an
intermediate host that routes the traffic between the client and the
Ceph public network. If we run iperf between the client and a cluster
host we get nearly a line speed performance.

The confusing thing is that if we set up a libvirt virtual machine
that can connect to cluter's public network directly we don't see any
performance hit.


``` mermaid
graph TB
  SubGraph1 --> Lab

  subgraph "Laboratory hosts"
    Lab(Analysis host)
    Lab -- sshfs --< DoChoice1
    Lab -- Choice2 --> DoChoice2
  end

  subgraph "Ceph cluster"
    Node1[Node 1] --> Node2[Node 2]
    Node2 --> SubGraph1[Jump to SubGraph1]
    SubGraph1 --> FinalThing[Final Thing]
  end
  
 
```