I'm about to share a lengthy tale that begins with Apache OpenDAL™ op.read() and concludes with an unexpected twist. This journey was quite enlightening for me, and I hope it will be for you too. I'll do my best to recreate the experience, complete with the lessons I've learned along the way. Let's dive in!

All the code snippets and scripts are available in Xuanwo/when-i-find-rust-is-slow

TL;DR

Jump to Conclusion if you want to know the answer ASAP.

OpenDAL Python Binding is slower than Python?

OpenDAL is a data access layer that allows users to easily and efficiently retrieve data from various storage services in a unified way. We provided python binding for OpenDAL via pyo3.

One day, @beldathas reports a case to me at discord that OpenDAL's python binding is slower than python:

import pathlib
import timeit

import opendal

root = pathlib.Path(__file__).parent
op = opendal.Operator("fs", root=str(root))
filename = "lorem_ipsum_150mb.txt"

def read_file_with_opendal() -> bytes:
    with op.open(filename, "rb") as fp:
        result = fp.read()
    return result

def read_file_with_normal() -> bytes:
    with open(root / filename, "rb") as fp:
        result = fp.read()
    return result

if __name__ == "__main__":
    print("normal: ", timeit.timeit(read_file_with_normal, number=100))
    print("opendal: ", timeit.timeit(read_file_with_opendal, number=100))

The result shows that

(venv) $ python benchmark.py
normal:  4.470868484000675
opendal:  8.993250704006641

Well, well, well. I'm somewhat embarrassed by these results. Here are a few quick hypotheses:

  • Does Python have an internal cache that can reuse the same memory?
  • Does Python possess some trick to accelerate file reading?
  • Does PyO3 introduce additional overhead?

I've refactored the code to:

python-fs-read:

with open("/tmp/file", "rb") as fp:
    result = fp.read()
assert len(result) == 64 * 1024 * 1024

python-opendal-read:

import opendal

op = opendal.Operator("fs", root=str("/tmp"))

result = op.read("file")
assert len(result) == 64 * 1024 * 1024

The result shows that python is much faster than opendal:

Benchmark 1: python-fs-read/test.py
  Time (mean ± σ):      15.9 ms ±   0.7 ms    [User: 5.6 ms, System: 10.1 ms]
  Range (min … max):    14.9 ms …  21.6 ms    180 runs

Benchmark 2: python-opendal-read/test.py
  Time (mean ± σ):      32.9 ms ±   1.3 ms    [User: 6.1 ms, System: 26.6 ms]
  Range (min … max):    31.4 ms …  42.6 ms    85 runs

Summary
  python-fs-read/test.py ran
    2.07 ± 0.12 times faster than python-opendal-read/test.py

The Python binding for OpenDAL seems to be slower than Python itself, which isn't great news. Let's investigate the reasons behind this.

OpenDAL Fs Service is slower than Python?

This puzzle involves numerous elements such as rust, opendal, python, pyo3, among others. Let's focus and attempt to identify the root cause.

I implement the same logic via opendal fs service in rust:

rust-opendal-fs-read:

use std::io::Read;
use opendal::services::Fs;
use opendal::Operator;

fn main() {
    let mut cfg = Fs::default();
    cfg.root("/tmp");
    let op = Operator::new(cfg).unwrap().finish().blocking();

    let mut bs = vec![0; 64 * 1024 * 1024];

    let mut f = op.reader("file").unwrap();
    let mut ts = 0;
    loop {
        let buf = &mut bs[ts..];
        let n = f.read(buf).unwrap();
        let n = n as usize;
        if n == 0 {
            break
        }
        ts += n;
    }

    assert_eq!(ts, 64 * 1024 * 1024);
}

However, the result shows that opendal is slower than python even when opendal is implemented in rust:

Benchmark 1: rust-opendal-fs-read/target/release/test
  Time (mean ± σ):      23.8 ms ±   2.0 ms    [User: 0.4 ms, System: 23.4 ms]
  Range (min … max):    21.8 ms …  34.6 ms    121 runs

Benchmark 2: python-fs-read/test.py
  Time (mean ± σ):      15.6 ms ±   0.8 ms    [User: 5.5 ms, System: 10.0 ms]
  Range (min … max):    14.4 ms …  20.8 ms    166 runs

Summary
  python-fs-read/test.py ran
    1.52 ± 0.15 times faster than rust-opendal-fs-read/target/release/test

While rust-opendal-fs-read performs slightly better than python-opendal-read, indicating room for improvement in the binding & pyo3, these aren't the core issues. We need to delve deeper.

Ah, opendal fs service is slower than python.

Rust std fs is slower than Python?

OpenDAL implement fs service via std::fs. Could there be additional costs incurred by OpenDAL itself?

I implement the same logic via rust std::fs:

rust-std-fs-read:

use std::io::Read;
use std::fs::OpenOptions;

fn main() {
    let mut bs = vec![0; 64 * 1024 * 1024];
    let mut f = OpenOptions::new().read(true).open("/tmp/file").unwrap();
    let mut ts = 0;
    loop {
        let buf = &mut bs[ts..];
        let n = f.read(buf).unwrap();
        let n = n as usize;
        if n == 0 {
            break
        }
        ts += n;
    }

    assert_eq!(ts, 64 * 1024 * 1024);
}

But....

Benchmark 1: rust-std-fs-read/target/release/test
  Time (mean ± σ):      23.1 ms ±   2.5 ms    [User: 0.3 ms, System: 22.8 ms]
  Range (min … max):    21.0 ms …  37.6 ms    124 runs

Benchmark 2: python-fs-read/test.py
  Time (mean ± σ):      15.2 ms ±   1.1 ms    [User: 5.4 ms, System: 9.7 ms]
  Range (min … max):    14.3 ms …  21.4 ms    178 runs

Summary
  python-fs-read/test.py ran
    1.52 ± 0.20 times faster than rust-std-fs-read/target/release/test

Wow, Rust's std fs is slower than Python? How can that be? No offense intended, but how is that possible?

Rust std fs is slower than Python? Really!?

I can't believe the results: rust std fs is surprisingly slower than Python.

I learned how to use strace for syscall analysis. strace is a Linux syscall tracer that allows us to monitor syscalls and understand their processes.

The strace will encompass all syscalls dispatched by the program. Our attention should be on aspects associated with /tmp/file. Each line of the strace output initiates with the syscall name, followed by input arguments and output.

For example:

openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3

Means we invoke the openat syscall using arguments AT_FDCWD, "/tmp/file", and O_RDONLY|O_CLOEXEC. This returns output 3, which is the file descriptor referenced in the subsequent syscall.

Alright, we've mastered strace. Let's put it to use!

strace of rust-std-fs-read:

> strace ./rust-std-fs-read/target/release/test
...
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f290dd40000
openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3
read(3, "\tP\201A\225\366>\260\270R\365\313\220{E\372\274\6\35\"\353\204\220s\2|7C\205\265\6\263"..., 67108864) = 67108864
read(3, "", 0)                          = 0
close(3)                                = 0
munmap(0x7f290dd40000, 67112960)        = 0
...

strace of python-fs-read:

> strace ./python-fs-read/test.py
...
openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=67108864, ...}, AT_EMPTY_PATH) = 0
ioctl(3, TCGETS, 0x7ffe9f844ac0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=67108864, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f13277ff000
read(3, "\tP\201A\225\366>\260\270R\365\313\220{E\372\274\6\35\"\353\204\220s\2|7C\205\265\6\263"..., 67108865) = 67108864
read(3, "", 1)                          = 0
close(3)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f132be5c710}, {sa_handler=0x7f132c17ac36, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7f132be5c710}, 8) = 0
munmap(0x7f13277ff000, 67112960)        = 0
...

From analyzing strace, it's clear that python-fs-read has more syscalls than rust-std-fs-read, with both utilizing mmap. So why python is faster than rust?

Why we are using mmap here?

I initially believed mmap was solely for mapping files to memory, enabling file access through memory. However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications.

This can be seen in the strace results:

mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f13277ff000

This syscall means

  • NULL: the first arg means start address of the memory region to map. NULL will let OS to pick up a suitable address for us.
  • 67112960: The size of the memory region to map. We are allocating 64MiB + 4KiB memory here, the extra page is used to store the metadata of this memory region.
  • PROT_READ|PROT_WRITE: The memory region is readable and writable.
  • MAP_PRIVATE|MAP_ANONYMOUS:
    • MAP_PRIVATE means changes to this memory region will not be visible to other processes mapping the same region, and are not carried through to the underlying file (if we have).
    • MAP_ANONYMOUS means we are allocating anonymous memory that not related to a file.
  • -1: The file descriptor of the file to map. -1 means we are not mapping a file.
  • 0: The offset in the file to map from. Use 0 here since we are not mapping a file.

But I don't use mmap in my code.

The mmap syscall is dispatched by glibc. We utilize malloc to solicit memory from the system, and in response, glibc employs both the brk and mmap syscalls to allocate memory according to our request size. If the requested size is sufficiently large, then glibc opts for using mmap, which helps mitigate memory fragmentation issues.

By default, all Rust programs compiled with target x86_64-unknown-linux-gnu use the malloc provided by glibc.

Does python has the same memory allocator with rust?

Python, by default, utilizes pymalloc, a memory allocator optimized for small allocations. Python features three memory domains, each representing different allocation strategies and optimized for various purposes.

pymalloc has the following behavior:

Python has a pymalloc allocator optimized for small objects (smaller or equal to 512 bytes) with a short lifetime. It uses memory mappings called “arenas” with a fixed size of either 256 KiB on 32-bit platforms or 1 MiB on 64-bit platforms. It falls back to PyMem_RawMalloc() and PyMem_RawRealloc() for allocations larger than 512 bytes.

Rust is slower than Python with default memory allocator?

I suspect that mmap is causing this issue. What would occur if I switched to jemalloc?

rust-std-fs-read-with-jemalloc:

use std::io::Read;
use std::fs::OpenOptions;

#[global_allocator]
static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc;

fn main() {
    let mut bs = vec![0; 64 * 1024 * 1024];
    let mut f = OpenOptions::new().read(true).open("/tmp/file").unwrap();
    let mut ts = 0;
    loop {
        let buf = &mut bs[ts..];
        let n = f.read(buf).unwrap();
        let n = n as usize;
        if n == 0 {
            break
        }
        ts += n;
    }

    assert_eq!(ts, 64 * 1024 * 1024);
}

Wooooooooooooooow?!

Benchmark 1: rust-std-fs-read-with-jemalloc/target/release/test
  Time (mean ± σ):       9.7 ms ±   0.6 ms    [User: 0.3 ms, System: 9.4 ms]
  Range (min … max):     9.0 ms …  12.4 ms    259 runs

Benchmark 2: python-fs-read/test.py
  Time (mean ± σ):      15.8 ms ±   0.9 ms    [User: 5.9 ms, System: 9.8 ms]
  Range (min … max):    15.0 ms …  21.8 ms    169 runs

Summary
  rust-std-fs-read-with-jemalloc/target/release/test ran
    1.64 ± 0.14 times faster than python-fs-read/test.py

What?! I understand that jemalloc is a proficient memory allocator, but how can it be this exceptional? This is baffling.

Rust is slower than Python only on my machine!

As more friends joined the discussion, we discovered that rust runs slower than python only on my machine.

My CPU:

> lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 9 5950X 16-Core Processor
    CPU family:          25
    Model:               33
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  53%
    CPU max MHz:         5083.3979
    CPU min MHz:         2200.0000
    BogoMIPS:            6787.49
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con
                         stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f
                         16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpex
                         t perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap
                          clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wb
                         noinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl
                         umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    64 MiB (2 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

My memory:

> sudo dmidecode --type memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0014, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: None
        Maximum Capacity: 64 GB
        Error Information Handle: 0x0013
        Number Of Devices: 4

Handle 0x001C, DMI type 17, 92 bytes
Memory Device
        Array Handle: 0x0014
        Error Information Handle: 0x001B
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: P0 CHANNEL A
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 3200 MT/s
        Manufacturer: Unknown
        Serial Number: 04904740
        Asset Tag: Not Specified
        Part Number: LMKUFG68AHFHD-32A
        Rank: 2
        Configured Memory Speed: 3200 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: Unknown
        Module Manufacturer ID: Bank 9, Hex 0xC8
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 16 GB
        Cache Size: None
        Logical Size: None

So I tried the following things:

Enable Mitigations

CPUs possess numerous vulnerabilities that could expose private data to attackers, with Spectre being one of the most notable. The Linux kernel has developed various mitigations to counter these vulnerabilities and they are enabled by default. However, these mitigations can impose additional system costs. Therefore, the Linux kernel also offers a mitigations flag for users who wish to disable them.

I used to disable all mitigations like the following:

title Arch Linux
linux /vmlinuz-linux-zen
initrd /amd-ucode.img
initrd /initramfs-linux-zen.img
options root="PARTUUID=206e7750-2b89-419d-978e-db0068c79c52" rw mitigations=off

Enable it back didn't change the result.

Tune Transparent Hugepage

Transparent Hugepage can significantly impact performance. Most modern distributions enable it by default.

> cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Switching to madvise or never alters the absolute outcome, but the relative ratio remains consistent.

Tune CPU Core Affinity

@Manjusaka guesses this related to CPU Core Spacing.

I tried to use core_affinity to bind process to specific CPU, but the result is the same.

Measure syscall latency by eBPF

@Manjusaka also created an eBPF program for me to gauge the latency of read syscalls. The findings indicate that Rust is also slower than Python at syscall level.

There's another lengthy tale about this eBPF program that @Manjusaka should share in a post!

# python fs read
Process 57555 read file 8134049 ns
Process 57555 read file 942 ns

# rust std fs read
Process 57634 read file 24636975 ns
Process 57634 read file 1052 ns

Observation: On my computer, Rust operates slower than Python and it doesn't appear to be related to the software.

C is slower than Python?

I'm quite puzzled and can't pinpoint the difference. I suspect it might have something to do with the CPU, but I'm unsure which aspect: cache? frequency? core spacing? core affinity? architecture?

Following the guidance from the Telegram group @rust_zh, I've developed a C version:

c-fs-read:

#include <stdio.h>
#include <stdlib.h>

#define FILE_SIZE 64 * 1024 * 1024  // 64 MiB

int main() {
    FILE *file;
    char *buffer;
    size_t result;

    file = fopen("/tmp/file", "rb");
    if (file == NULL) {
        fputs("Error opening file", stderr);
        return 1;
    }

    buffer = (char *)malloc(sizeof(char) * FILE_SIZE);
    if (buffer == NULL) {
        fputs("Memory error", stderr);
        fclose(file);
        return 2;
    }

    result = fread(buffer, 1, FILE_SIZE, file);
    if (result != FILE_SIZE) {
        fputs("Reading error", stderr);
        fclose(file);
        free(buffer);
        return 3;
    }

    fclose(file);
    free(buffer);

    return 0;
}

But...

Benchmark 1: c-fs-read/test
  Time (mean ± σ):      23.8 ms ±   0.9 ms    [User: 0.3 ms, System: 23.6 ms]
  Range (min … max):    23.0 ms …  27.1 ms    120 runs

Benchmark 2: python-fs-read/test.py
  Time (mean ± σ):      19.1 ms ±   0.3 ms    [User: 8.6 ms, System: 10.4 ms]
  Range (min … max):    18.6 ms …  20.6 ms    146 runs

Summary
  python-fs-read/test.py ran
    1.25 ± 0.05 times faster than c-fs-read/test

The C version is also slower than Python! Does python have magic?

C is slower than Python without specified offset!

At this time, @lilydjwg has joined the discussion and noticed a difference in the memory region offset between C and Python.

strace -e raw=read,mmap ./program is used to print the undecoded arguments for the syscalls: the pointer address.

strace for c-fs-read:

> strace -e raw=read,mmap ./c-fs-read/test
...
mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f96d1a18000
read(0x3, 0x7f96d1a18010, 0x4000000)    = 0x4000000
close(3)                                = 0

strace for python-fs-read

> strace -e raw=read,mmap ./python-fs-read/test.py
...
mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f27dcfbe000
read(0x3, 0x7f27dcfbe030, 0x4000001)    = 0x4000000
read(0x3, 0x7f27e0fbe030, 0x1)          = 0
close(3)                                = 0

In c-fs-read, mmap returns 0x7f96d1a18000, but read syscall use 0x7f96d1a18010 as the start address, the offset is 0x10. In python-fs-read, mmap returns 0x7f27dcfbe000, and read syscall use 0x7f27dcfbe030 as the start address, the offset is 0x30.

So @lilydjwg tried to calling read with the same offset:

:) ./bench c-fs-read c-fs-read-with-offset python-fs-read
['hyperfine', 'c-fs-read/test', 'c-fs-read-with-offset/test', 'python-fs-read/test.py']
Benchmark 1: c-fs-read/test
  Time (mean ± σ):      23.7 ms ±   0.8 ms    [User: 0.2 ms, System: 23.6 ms]
  Range (min … max):    23.0 ms …  25.5 ms    119 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: c-fs-read-with-offset/test
  Time (mean ± σ):       8.9 ms ±   0.4 ms    [User: 0.2 ms, System: 8.8 ms]
  Range (min … max):     8.3 ms …  10.6 ms    283 runs

Benchmark 3: python-fs-read/test.py
  Time (mean ± σ):      19.1 ms ±   0.3 ms    [User: 8.6 ms, System: 10.4 ms]
  Range (min … max):    18.6 ms …  20.0 ms    147 runs

Summary
  c-fs-read-with-offset/test ran
    2.15 ± 0.11 times faster than python-fs-read/test.py
    2.68 ± 0.16 times faster than c-fs-read/test

!!!

Applying an offset to buffer in c-fs-read enhances its speed, outperforming Python! Additionally, we've verified that this issue is replicable on both the AMD Ryzen 9 5900X and AMD Ryzen 7 5700X.

The new information led me to other reports about a similar issue, Std::fs::read slow?. In this post, @ambiso discovered that syscall performance is linked to the offset of the memory region. He noted that this CPU slows down when writing from the first 0x10 bytes of each page:

offset milliseconds
 ...
 14   130
 15   130
 16    46   <----- 0x10!
 17    48
 ...

AMD Ryzen 9 5900X is slow without specified offset!

We've confirmed that this issue is related to the CPU. However, we're still unsure about its potential reasons. @Manjusaka has invited kernel developer @ryncsn to join the discussion.

He can reproduce the same outcome using our c-fs-read and c-fs-read-with-offset demos on AMD Ryzen 9 5900HX. He also attempted to profile the two programs using perf.

Without offset:

perf stat -d -d -d --repeat 20 ./a.out
 Performance counter stats for './a.out' (20 runs):

             30.89 msec task-clock                       #    0.968 CPUs utilized               ( +-  1.35% )
                 0      context-switches                 #    0.000 /sec
                 0      cpu-migrations                   #    0.000 /sec
               598      page-faults                      #   19.362 K/sec                       ( +-  0.05% )
        90,321,344      cycles                           #    2.924 GHz                         ( +-  1.12% )  (40.76%)
           599,640      stalled-cycles-frontend          #    0.66% frontend cycles idle        ( +-  2.19% )  (42.11%)
           398,016      stalled-cycles-backend           #    0.44% backend cycles idle         ( +- 22.41% )  (41.88%)
        43,349,705      instructions                     #    0.48  insn per cycle
                                                  #    0.01  stalled cycles per insn     ( +-  1.32% )  (41.91%)
         7,526,819      branches                         #  243.701 M/sec                       ( +-  5.01% )  (41.22%)
            37,541      branch-misses                    #    0.50% of all branches             ( +-  4.62% )  (41.12%)
       127,845,213      L1-dcache-loads                  #    4.139 G/sec                       ( +-  1.14% )  (39.84%)
         3,172,628      L1-dcache-load-misses            #    2.48% of all L1-dcache accesses   ( +-  1.34% )  (38.46%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
           654,651      L1-icache-loads                  #   21.196 M/sec                       ( +-  1.71% )  (38.72%)
             2,828      L1-icache-load-misses            #    0.43% of all L1-icache accesses   ( +-  2.35% )  (38.67%)
            15,615      dTLB-loads                       #  505.578 K/sec                       ( +-  1.28% )  (38.82%)
            12,825      dTLB-load-misses                 #   82.13% of all dTLB cache accesses  ( +-  1.15% )  (38.88%)
                16      iTLB-loads                       #  518.043 /sec                        ( +- 27.06% )  (38.82%)
             2,202      iTLB-load-misses                 # 13762.50% of all iTLB cache accesses  ( +- 23.62% )  (39.38%)
         1,843,493      L1-dcache-prefetches             #   59.688 M/sec                       ( +-  3.36% )  (39.40%)
   <not supported>      L1-dcache-prefetch-misses

          0.031915 +- 0.000419 seconds time elapsed  ( +-  1.31% )

With offset:

perf stat -d -d -d --repeat 20 ./a.out
 Performance counter stats for './a.out' (20 runs):

             15.39 msec task-clock                       #    0.937 CPUs utilized               ( +-  3.24% )
                 1      context-switches                 #   64.972 /sec                        ( +- 17.62% )
                 0      cpu-migrations                   #    0.000 /sec
               598      page-faults                      #   38.854 K/sec                       ( +-  0.06% )
        41,239,117      cycles                           #    2.679 GHz                         ( +-  1.95% )  (40.68%)
           547,465      stalled-cycles-frontend          #    1.33% frontend cycles idle        ( +-  3.43% )  (40.60%)
           413,657      stalled-cycles-backend           #    1.00% backend cycles idle         ( +- 20.37% )  (40.50%)
        37,009,429      instructions                     #    0.90  insn per cycle
                                                  #    0.01  stalled cycles per insn     ( +-  3.13% )  (40.43%)
         5,410,381      branches                         #  351.526 M/sec                       ( +-  3.24% )  (39.80%)
            34,649      branch-misses                    #    0.64% of all branches             ( +-  4.04% )  (39.94%)
        13,965,813      L1-dcache-loads                  #  907.393 M/sec                       ( +-  3.37% )  (39.44%)
         3,623,350      L1-dcache-load-misses            #   25.94% of all L1-dcache accesses   ( +-  3.56% )  (39.52%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
           590,613      L1-icache-loads                  #   38.374 M/sec                       ( +-  3.39% )  (39.67%)
             1,995      L1-icache-load-misses            #    0.34% of all L1-icache accesses   ( +-  4.18% )  (39.67%)
            16,046      dTLB-loads                       #    1.043 M/sec                       ( +-  3.28% )  (39.78%)
            14,040      dTLB-load-misses                 #   87.50% of all dTLB cache accesses  ( +-  3.24% )  (39.78%)
                11      iTLB-loads                       #  714.697 /sec                        ( +- 29.56% )  (39.77%)
             3,657      iTLB-load-misses                 # 33245.45% of all iTLB cache accesses  ( +- 14.61% )  (40.30%)
           395,578      L1-dcache-prefetches             #   25.702 M/sec                       ( +-  3.34% )  (40.10%)
   <not supported>      L1-dcache-prefetch-misses

          0.016429 +- 0.000521 seconds time elapsed  ( +-  3.17% )

He found the value of L1-dcache-prefetches and L1-dcache-loads differs a lot.

  • L1-dcache-prefetches is the prefetches of CPU L1 data cache.
  • L1-dcache-loads is the loads of CPU L1 data cache.

Without a specified offset, the CPU will perform more loads and prefetches of L1-dcache, resulting in increased syscall time.

He did a further research over the hotspot ASM:

Samples: 15K of event 'cycles:P', Event count (approx.): 6078132137
  Children      Self  Command    Shared Object         Symbol
-   94.11%     0.00%  a.out      [kernel.vmlinux]      [k] entry_SYSCALL_64_after_hwframe                                                                                                                        ◆
   - entry_SYSCALL_64_after_hwframe                                                                                                                                                                              ▒
      - 94.10% do_syscall_64                                                                                                                                                                                     ▒
         - 86.66% __x64_sys_read                                                                                                                                                                                 ▒
              ksys_read                                                                                                                                                                                          ▒
            - vfs_read                                                                                                                                                                                           ▒
               - 85.94% shmem_file_read_iter                                                                                                                                                                     ▒
                  - 77.17% copy_page_to_iter                                                                                                                                                                     ▒
                     - 75.80% _copy_to_iter                                                                                                                                                                      ▒
                        + 19.41% asm_exc_page_fault                                                                                                                                                              ▒
                       0.71% __might_fault                                                                                                                                                                       ▒
                  + 4.87% shmem_get_folio_gfp                                                                                                                                                                    ▒
                    0.76% folio_mark_accessed                                                                                                                                                                    ▒
         + 4.38% __x64_sys_munmap                                                                                                                                                                                ▒
         + 1.02% 0xffffffffae6f6fe8                                                                                                                                                                              ▒
         + 0.79% __x64_sys_execve                                                                                                                                                                                ▒
         + 0.58% __x64_sys_mmap                                                                                                                                                                                  ▒

Inside _copy_to_iter, the ASM will be:

       │     copy_user_generic():
  2.19 │       mov    %rdx,%rcx
       │       mov    %r12,%rsi
 92.45 │       rep    movsb %ds:(%rsi),%es:(%rdi)
  0.49 │       nop
       │       nop
       │       nop

The key difference here is the performance of rep movsb.

AMD Ryzen 9 5900X is slow for FSRM!

At this time, one of my friend sent me a link about Terrible memcpy performance on Zen 3 when using rep movsb. In which also pointed to rep movsb:

I've found this using a memcpy benchmark at https://github.com/ska-sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp (compiled with the adjacent Makefile). To demonstrate the issue, run

./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0

This runs:

  • 2113-byte memory copies
  • 1,000,000 times per timing measurement
  • in memory allocated with mmap
  • with the source 0 bytes from the start of the page
  • with the destination 1 byte from the start of the page
  • on core 0.

It reports about 3.2 GB/s. Change the -b argument to 2111 and it reports over 100 GB/s. So the REP MOVSB case is about 30× slower!

FSRM, short for Fast Short REP MOV, is an innovation originally by Intel, recently incorporated into AMD as well, to enhance the speed of rep movsb and rep movsd. It's designed to boost the efficiency of copying large amounts of memory. CPUs that declare support for it will use FSRM as a default in glibc.

@ryncsn has conducted further research and discovered that it's not related to L1 prefetches.

It seems that rep movsb performance poorly when DATA IS PAGE ALIGNED, and perform better when DATA IS NOT PAGE ALIGNED, this is very funny...

Conclusion

In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug. (I can finally get some sleep now.)

However, our users continue to struggle with this problem. Unfortunately, features like FSRM will be implemented in ucode, leaving us no choice but to wait for AMD's response. An alternative solution could be not using FSRM or providing a flag to disable it. Rust developers might consider switching to jemallocator for improved performance - a beneficial move even without the presence of AMD CPU bugs.

Final Remarks

I spent nearly three days addressing this issue, which began with complaints from opendal users and eventually led me to the CPU's ucode.

This journey taught me a lot about strace, perf and eBPF. It was my first time using eBPF for diagnostics. I also explored various unfruitful avenues such as studying the implementations of rust's std::fs and Python & CPython's read implementation details. Initially, I hoped to resolve this at a higher level but found it necessary to delve deeper.

A big thank you to everyone who contributed to finding the answer:

  • @beldathas from opendal's discord for identifying the problem.
  • The team at @datafuselabs for their insightful suggestions.
  • Our friends over at @rust_zh (a rust telegram group mainly in zh-Hans) for their advice and reproduction efforts.
  • @Manjusaka for reproducing the issue and use eBPF to investigate, which helped narrow down the problem to syscall itself.
  • @lilydjwg for pinpointing the root cause: a 0x20 offset in memory
  • @ryncsn for his thorough analysis
  • And a friend who shared useful links about FSRM

Looking forward to our next journey!

Reference


Updates

This article, written on 2023-11-29, has gained widespread attention! To prevent any confusion among readers, I've decided not to alter the original content. Instead, I'll provide updates here to keep the information current and address frequently asked questions.

2023-12-01: Does AMD know about this bug?

TL;DR: Yes

To my knowledge, AMD has been aware of this bug since 2021. After the article was published, several readers forwarded the link to AMD, so I'm confident they're informed about it.

I firmly believe that AMD should take responsibility for this bug and address it in amd-ucode. However, unverified sources suggest that a fix via amd-ucode is unlikely (at least for Zen 3) due to limited patch space. If you have more information on this matter, please reach out to me.

Our only hope is to address this issue in glibc by disabling FSRM as necessary. Progress has been made on the glibc front: x86: Improve ERMS usage on Zen3. Stay tuned for updates.

2023-12-01: Is jemalloc or pymalloc faster than glibc's malloc?

TL;DR: No

Apologies for not detailing the connection between the memory allocator and this bug. It may seem from this post that jemalloc (pymalloc, mimalloc) is significantly faster than glibc's malloc. However, that's not the case. The issue doesn't related to the memory allocator. The speed difference where jemalloc or pymalloc outpaces glibc malloc is coincidental due to differing memory region offsets.

Let's analyze the strace of rust-std-fs-read and rust-std-fs-read-with-jemalloc:

strace for rust-std-fs-read:

> strace -e raw=read,mmap ./rust-std-fs-read/target/release/test
...
mmap(0, 0x4001000, 0x3, 0x22, 0xffffffff, 0) = 0x7f39a6e49000
openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3
read(0x3, 0x7f39a6e49010, 0x4000000)    = 0x4000000
read(0x3, 0x7f39aae49010, 0)            = 0
close(3)                                = 0

strace for rust-std-fs-read-with-jemalloc:

> strace -e raw=read,mmap ./rust-std-fs-read-with-jemalloc/target/release/test
...
mmap(0, 0x200000, 0x3, 0x4022, 0xffffffff, 0) = 0x7f7a5a400000
mmap(0, 0x5000000, 0x3, 0x4022, 0xffffffff, 0) = 0x7f7a55400000
openat(AT_FDCWD, "/tmp/file", O_RDONLY|O_CLOEXEC) = 3
read(0x3, 0x7f7a55400740, 0x4000000)    = 0x4000000
read(0x3, 0x7f7a59400740, 0)            = 0
close(3)                                = 0

In rust-std-fs-read, mmap returns 0x7f39a6e49000, but read syscall use 0x7f39a6e49010 as the start address, the offset is 0x10. In rust-std-fs-read-with-jemalloc, mmap returns 0x7f7a55400000, and read syscall use 0x7f7a55400740 as the start address, the offset is 0x740.

rust-std-fs-read-with-jemalloc outperforms rust-std-fs-read due to its larger offset, which falls outside the problematic range: 0x00..0x10 within a page. It's possible to reproduce the same issue with jemalloc.