Skip to content
Snippets Groups Projects
README.md 2.24 KiB
Newer Older
  • Learn to ignore specific revisions
  • Matti Saarinen's avatar
    Matti Saarinen committed
    # CephFS performance issues - problem description
    
    ## Summary
    
    University of Helsinki has a Ceph cluster that can be accessed three
    
    Matti Saarinen's avatar
    Matti Saarinen committed
    ways: CephFS, CephFS+NFS-Ganesha, RBD and S3. The people using CephFS
    have reported significant performance issues. The key thing here is
    that the tool they use for accessing their data uses seek before
    reading the data.
    
    Matti Saarinen's avatar
    Matti Saarinen committed
    
    
    ## Cluster description
    
    
    Matti Saarinen's avatar
    Matti Saarinen committed
    * Ceph version 18.2.1 (reef)
    
    * Ceph is managed by cephadm
    
    Matti Saarinen's avatar
    Matti Saarinen committed
    
    * Hardware
    
    Matti Saarinen's avatar
    Matti Saarinen committed
      * 15 HPE ProLiant XL420 Gen10 Plus servers each of which contains
        * 512 GiB memory
        * 24 18 TiB 7200 RPM HDDs
    	* 2 1.5 TiB NVMe drives
    	* 1 3 TiB NVMe drive
    	* 2 25GE network interfaces
    	
    * The Ganesha containers are running on libvirt virtual machines
      running on ProLiant DL380 Gen10 Plus servers
    
    * All nodes run RHEL 9.4
    
    
    
    ## The problem
    
    The clients are using a tool that wants to read a part of the file and
    process it further. When the tool issues a seek system call the CephFS
    performs very poorly when the size of the file opened is at least 100
    MiB. The same behaviour can be observed with tools like tac or
    ddrescue both of which can read the files in reverse. The performance
    hit is huge. When a 100 MiB file is read from the beginning to the end
    the process completes in less than 10 seconds including the time
    required to start the actual procesess. Instead, when the file is read
    from the end to the beginning the time required is 10 minutes.
    
    Intially, we thought the culprit was Ganesha but we were wrong. When
    we let the client hosts connect the cluster (almost) directly the
    performance didn't improve. The almost here means that there is an
    intermediate host that routes the traffic between the client and the
    Ceph public network. If we run iperf between the client and a cluster
    host we get nearly a line speed performance.
    
    The confusing thing is that if we set up a libvirt virtual machine
    that can connect to cluter's public network directly we don't see any
    performance hit.
    
    
    ``` mermaid
    
    graph TB
    
      SubGraph1 --> Lab
    
      subgraph "Laboratory hosts"
        Lab(Analysis host)
    
        Lab -- sshfs --< DoChoice1
    
        Lab -- Choice2 --> DoChoice2
    
      subgraph "Ceph cluster"
    
        Node1[Node 1] --> Node2[Node 2]
        Node2 --> SubGraph1[Jump to SubGraph1]
        SubGraph1 --> FinalThing[Final Thing]
      end
      
    
    Matti Saarinen's avatar
    Matti Saarinen committed
     
    ```