diff options
author | David Ahern <dsahern@gmail.com> | 2018-12-07 12:24:57 -0800 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2018-12-07 16:03:10 -0800 |
commit | 58956317c8de52009d1a38a721474c24aef74fe7 (patch) | |
tree | 1d83aeed3288088a4f5b8ef1c0263908571bb1c4 /Documentation/networking/ip-sysctl.txt | |
parent | 12edfdfc79860f42fa493f81518e040376b6a5bc (diff) | |
download | linux-58956317c8de52009d1a38a721474c24aef74fe7.tar.gz linux-58956317c8de52009d1a38a721474c24aef74fe7.tar.bz2 linux-58956317c8de52009d1a38a721474c24aef74fe7.zip |
neighbor: Improve garbage collection
The existing garbage collection algorithm has a number of problems:
1. The gc algorithm will not evict PERMANENT entries as those entries
are managed by userspace, yet the existing algorithm walks the entire
hash table which means it always considers PERMANENT entries when
looking for entries to evict. In some use cases (e.g., EVPN) there
can be tens of thousands of PERMANENT entries leading to wasted
CPU cycles when gc kicks in. As an example, with 32k permanent
entries, neigh_alloc has been observed taking more than 4 msec per
invocation.
2. Currently, when the number of neighbor entries hits gc_thresh2 and
the last flush for the table was more than 5 seconds ago gc kicks in
walks the entire hash table evicting *all* entries not in PERMANENT
or REACHABLE state and not marked as externally learned. There is no
discriminator on when the neigh entry was created or if it just moved
from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).
It is possible for entries to be created or for established neighbor
entries to be moved to STALE (e.g., an external node sends an ARP
request) right before the 5 second window lapses:
-----|---------x|----------|-----
t-5 t t+5
If that happens those entries are evicted during gc causing unnecessary
thrashing on neighbor entries and userspace caches trying to track them.
Further, this contradicts the description of gc_thresh2 which says
"Entries older than 5 seconds will be cleared".
One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
whole point of having separate thresholds.
3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
when gc_thresh2 is exceeded is over kill and contributes to trashing
especially during startup.
This patch addresses these problems as follows:
1. Use of a separate list_head to track entries that can be garbage
collected along with a separate counter. PERMANENT entries are not
added to this list.
The gc_thresh parameters are only compared to the new counter, not the
total entries in the table. The forced_gc function is updated to only
walk this new gc_list looking for entries to evict.
2. Entries are added to the list head at the tail and removed from the
front.
3. Entries are only evicted if they were last updated more than 5 seconds
ago, adhering to the original intent of gc_thresh2.
4. Forced gc is stopped once the number of gc_entries drops below
gc_thresh2.
5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
when allocating a new neighbor for a PERMANENT entry. By extension this
means there are no explicit limits on the number of PERMANENT entries
that can be created, but this is no different than FIB entries or FDB
entries.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking/ip-sysctl.txt')
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index af2a69439b93..acdfb5d2bcaa 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER Default: 512 neigh/default/gc_thresh3 - INTEGER - Maximum number of neighbor entries allowed. Increase this - when using large numbers of interfaces and when communicating + Maximum number of non-PERMANENT neighbor entries allowed. Increase + this when using large numbers of interfaces and when communicating with large numbers of directly-connected peers. Default: 1024 |