fangpsh's blog

drop_caches

atemyram Don't Panic! Your ram is fine!

问题排查

一台机器的CPU iowait 非常高,perf 分析发现耗时都在isolate_freepages_block 函数,iostat 看系统盘的iops 也达到瓶颈,free 查看机器内存 buff/cache 占了很大一部分,free 快没了。网友的类似状况: Ceph节点load很高问题的分析解决

程序对文件系统大量的随机读写,page cache 上涨,内存吃紧,page cache 不断换入换出,磁盘瓶颈,CPU 堵住。

可以临时drop 掉:

echo 1 > /proc/sys/vm/drop_caches

想要也释放脏页,先执行sync,再drop。

This is a non-destructive operation and will not free any dirty objects. To increase the number of objects freed by this operation, the user may run `sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the number of dirty objects on the system and create more candidates to be dropped.

另外一种类似的情况是reclaimable slab objects 占用高(dentries,inodes),分析过程可参考:谁吃了我的Linux内存?

echo 2 > /proc/sys/vm/drop_caches

解决方法,换大内存机器,或者慢慢调整系统的proc/sys/vm/pagecache_limit* 等参数,还有文件系统的/proc/sys/vm/dirty_*,没太多经验,瞎子过河。

To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free reclaimable slab objects (includes dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
To free slab objects and pagecache:
echo 3 > /proc/sys/vm/drop_caches

默认值是0,写入1 drop pagecache,写入2 drop reclaimable slab objects,写入3 都释放。


linux/fs/drop_caches.c

问题一:echo 写入之后,值不会变,怎么办?会不会一直在drop?

不会,看看代码:

linux/fs/drop_caches.c

int drop_caches_sysctl_handler(struct ctl_table *table, int write,
        void __user *buffer, size_t *length, loff_t *ppos)
{
        int ret;

        ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
        if (ret)
            return ret;
        if (write) {
            static int stfu;

            if (sysctl_drop_caches & 1) {
                iterate_supers(drop_pagecache_sb, NULL);
                count_vm_event(DROP_PAGECACHE);
            }
            if (sysctl_drop_caches & 2) {
                drop_slab();
                count_vm_event(DROP_SLAB);
            }
            if (!stfu) {
                pr_info("%s (%d): drop_caches: %d\n",
                        current->comm, task_pid_nr(current),
                        sysctl_drop_caches);
           }
           stfu |= sysctl_drop_caches & 4;
    }
    return 0;
}

write 为True 的时候,才会执行drop 。

另外proc_dointvec_minmax() 这个函数的作用是检查传入的值符合大小,drop_cache 允许的值是1,2,3,4。

Reads/writes up to table->maxlen/sizeof(unsigned int) integer values from/to the user buffer, treated as an ASCII string.
This routine will ensure the values are within the range specified by table->extra1 (min) and table->extra2 (max).
Returns 0 on success.

linux/kernel/sysctl.c

{
    .procname        = "drop_caches",
    .data            = &sysctl_drop_caches,
    .maxlen          = sizeof(int),
    .mode            = 0644,
    .proc_handler    = drop_caches_sysctl_handler,
    .extra1          = &one,
    .extra2          = &four,
},

虽然默认值是0,但是如果你尝试写回0,是会失败的:

-> echo 0 > /proc/sys/vm/drop_caches
echo: write error: invalid argument

另外可以看到sysctl_drop_caches 为3(11b) 时,和1,2 做& 操作,都为True,所以都会drop。

问题二:echo 写入4,是什么效果?

4(100b),和1,2做& 操作,都为False,所以不会drop,相当于关闭了drop 功能。

stfu 为静态变量,第一次执行之后,stfu 为4,if (!stfu) 也始终为False。不过感觉这里有个问题,echo 4 一次之后,stfu 会始终为4,即使再echo 1|2|3,都无法改变,虽然能正常drop,但是pr_info 语句无法执行,dmesg 看不到日志,除非重启。

stfu |= sysctl_drop_caches & 4;

其他

和这篇笔记主题无关的一个问题,也值得深究:


kernel.org/doc/Documentation/sysctl/vm.txt

drop_caches

Writing to this will cause the kernel to drop clean caches, as well as
reclaimable slab objects like dentries and inodes.  Once dropped, their
memory becomes free.

To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free reclaimable slab objects (includes dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
To free slab objects and pagecache:
echo 3 > /proc/sys/vm/drop_caches

This is a non-destructive operation and will not free any dirty objects.
To increase the number of objects freed by this operation, the user may run
`sync' prior to writing to /proc/sys/vm/drop_caches.  This will minimize the
number of dirty objects on the system and create more candidates to be
dropped.

This file is not a means to control the growth of the various kernel caches
(inodes, dentries, pagecache, etc...)  These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system.

Use of this file can cause performance problems.  Since it discards cached
objects, it may cost a significant amount of I/O and CPU to recreate the
dropped objects, especially if they were under heavy use.  Because of this,
use outside of a testing or debugging environment is not recommended.

You may see informational messages in your kernel log when this file is
used:

cat (1234): drop_caches: 3

These are informational only.  They do not mean that anything is wrong
with your system.  To disable them, echo 4 (bit 3) into drop_caches.