The flow manager scans the hash table in chunks based on the flow timeout
settings. In the default config this will lead to a full hash pass every
240 seconds. Under pressure, this will lead to a large amount of memory
still in use by flows waiting to be evicted, or evicted flows waiting to
be freed.
This patch implements a new adaptive logic to the timing and amount of
work that is done by the flow manager. It takes the memcap budgets and
calculates the proportion of the memcap budgets in use. It takes the max
in-use percentage, and adapts the flow manager behavior based on that.
The memcaps considered are:
flow, stream, stream-reassembly and app-layer-http
The percentage in use, is inversely applies to the time the flow manager
takes for a full hash pass. In addition, it is also applied to the chunk
size and the sleep time.
Example: tcp.reassembly_memuse is at 90% of the memcap and normal flow
hash pass is 240s. Hash pass time will be:
240 * (100 - 90) / 100 = 24s
Chunk size and sleep time will automatically be updated for this.
Adds various counters.
Bug: #4650.
Bug: #4808.
The current code doesn't cover all rows when more than one flow manager is
used. It leaves a single row between ftd->max and ftd->min of the next
manager orphaned. As an example:
hash_size=1000
flowmgr_number=3
range=333
instance ftd->min ftd->max
0 0 333
1 334 666
2 667 1000
Rows not covered: 333, 666
Flows have been shown to linger for a long time w/o giving up their
resources. This would lead to higher memory use and memcaps getting
reached.
Three main causes have been identified:
Slow passes hash passes. By default the flow manager will scan the
flow hash slowly. It is based on the flow timeout settings, and with
the default config it will take 4 minutes for a full scan to be
complete. This leaves a window for flows that are timed out to linger
for minutes longer than expected.
Flow Manager yields under pressure. The per row TryLock causes work
to be delayed more. The Flow manager will use trylock on a hash row
and will yield immediately if the row is busy. This means that it will
take a full pass before the row is revisited again. If the row holds
busy flows, this could happen many times in a row.
Flow Manager favors evicted flows over active flows. The Flow Manager
will only process the evicted flows if they are present. These flows
have been evicted by workers. The active flows on that hash row will
have to wait until the next hash pass. Of course by then there could
be more evicted flows.
Combined these factors could lead to flows not being considered for
freeing and logging for a very long time, potentially even indefinitly.
The patch addresses the latter two flow manager issues by no longer
using TryLock. It will now simply wait for the lock to be released and
then do its work on it. Additionally for each row both the evicted list
and the active flow list will be processed.
Bug: #4650.
adds a container, ie a thread safe hash table whose
key is the filename
keep a tree of unordered ranges, up to a memcap limit
adds HTPFileOpenWithRange to handle like HTPFileOpen
if there is a range : open 2 files, one for the whole reassembled,
and one only for the current range
As the FlowBypassedTimeout function is interacting with the capture
method it is possible that the return changes between the call that
did trigger the timeout and the actual state (ie if packets arrive
in between the two calls). So we should not use the call to
FlowBypassedTimeout in the assert.
Don't increment the flow timeout counter for flows that are not
really timeout (as use_cnt is non zero). And also don't take into
account bypassed flows in the counter for flow timeout in use.
Previously the flow manager would share evicted flows with the workers
while keeping the flows mutex locked. This reduced the number of unlock/
lock cycles while there was guaranteed to be no contention.
This turns out to be undefined behavior. A lock is supposed to be locked
and unlocked from the same thread. It appears that FreeBSD is stricter on
this than Linux.
This patch addresses the issue by unlocking before handing a flow off
to another thread, and locking again from the new thread.
Issue was reported and largely analyzed by Bill Meeks.
Bug: #4478
Sleep 250 microseconds instead of 100 as running in KVM cause the
old value to use 100% CPU for these threads.
Perf testing suggests no measurable impact for the non-KVM case.
Ticket: #4096
Goals:
- reduce locking
- take advantage of 'hot' caches
- better locality
Locking reduction
New flow spare pool. The global pool is implmented as a list of blocks,
where each block has a 100 spare flows. Worker threads fetch a block at
a time, storing the block in the local thread storage.
Flow Recycler now returns flows to the pool is blocks as well.
Flow Recycler fetches all flows to be processed in one step instead of
one at a time.
Cache 'hot'ness
Worker threads now check the timeout of flows they evaluate during lookup.
The worker will have to read the flow into cache anyway, so the added
overhead of checking the timeout value is minimal. When a flow is considered
timed out, one of 2 things happens:
- if the flow is 'owned' by the thread it is handled locally. Handling means
checking if the flow needs 'timeout' work.
- otherwise, the flow is added to a special 'evicted' list in the flow
bucket where it will be picked up by the flow manager.
Flow Manager timing
By default the flow manager now tries to do passes of the flow hash in
smaller steps, where the goal is to do full pass in 8 x the lowest timeout
value it has to enforce. So if the lowest timeout value is 30s, a full pass
will take 4 minutes. The goal here is to reduce locking overhead and not
get in the way of the workers.
In emergency mode each pass is full, and lower timeouts are used.
Timing of the flow manager is also no longer relying on pthread condition
variables, as these generally cause waking up much quicker than the desired
timout. Instead a simple (u)sleep loop is used.
Both changes reduce the number of hash passes a lot.
Emergency behavior
In emergency mode there a number of changes to the workers. In this scenario
the flow memcap is fully used up and it is unavoidable that some flows won't
be tracked.
1. flow spare pool fetches are reduced to once a second. This avoids locking
overhead, while the chance of success was very low.
2. getting an active flow directly from the hash skips flows that had very
recent activity to avoid the scenario where all flows get only into the
NEW state before getting reused. Rather allow some to have a chance of
completing.
3. TCP packets that are not SYN packets will not get a used flow, unless
stream.midstream is enabled. The goal here is again to avoid evicting
active flows unnecessarily.
Better Localily
Flow Manager injects flows into the worker threads now, instead of one or
two packets. Advantage of this is that the worker threads can get packets
from their local packet pools, avoiding constant overhead of packets returning
to 'foreign' pools.
Counters
A lot of flow counters have been added and some have been renamed.
Overall the worker threads increment 'flow.wrk.*' counters, while the flow
manager increments 'flow.mgr.*'.
Additionally, none of the counters are snapshots anymore, they all increment
over time. The flow.memuse and flow.spare counters are exceptions.
Misc
FlowQueue has been split into a FlowQueuePrivate (unlocked) and FlowQueue.
Flow no longer has 'prev' pointers and used a unified 'next' pointer for
both hash and queue use.
Call Defrag and others only once per second. Flow Manager may wake
up (much) more often when flow engine is under resource pressure.
As this does not affect Defrag and others, it only unnecessarily
adds load.
When the flow engine enters emergency mode, 3 things happen:
1. a different set of (lower) timeout values are applied
2. the flow manager runs more often
3. worker threads go get a flow directly from the hash table
Testing showed that performance went down significantly due to concurrency
issues:
1. worker threads would fight each other over the hash access
2. flow manager would get in the way of workers
This patch changes the behavior in 2 ways:
1. it makes the flow manager slightly less aggressive. It will still
try to run ~3 times per second, but no longer 10 times.
This should be reducing the contention. At the same time flows
won't time out faster if they are checked many times per second.
2. The 'get a used flow' logic optimizes the use of atomics by only
doing an atomic operation once, and while doing so reserving
a slice of the hash per worker.
The worker will also give up much quicker, to avoid the overhead
of hash walking and taking and releasing locks.
These combined changes show much better 'under stress' behavior, esp
on multi-NUMA systems.
Until now both atomic ints and pointers were initialized by SC_ATOMIC_INIT
by setting them to 0. However, C11's atomic pointer type cannot be
initialized this way w/o causing compiler warnings.
As a preparation to supporting C11's atomics, this patch introduces a
new macro to initialize atomic pointers and updates the relevant callers
to use it.
Until this point the SC_ATOMIC_ADD macro pointed to a 'add_fetch'
intrinsic. This patch changes it to a 'fetch_add'.
There are 2 reasons for this:
1. C11 stdatomics.h has only 'atomic_fetch_add' and no 'add_fetch'
So this patch prepares for adding support for C11 atomics.
2. It was not consistent with SC_ATOMIC_SUB, which did use 'fetch_sub'
and not 'sub_fetch'.
Most callers are not using the return value, so these are unaffected.
The callers that do use the return value are updated.
This patch addresses two problems.
First, various parts of the engine, but most notably the flow manager (FM),
use a minimum of the time notion of the packet threads. This did not
however, take into account the scenario where one or more of these
threads would be inactive for prolonged times. This could lead to the
time used by the FM could get stale.
This is addressed by keeping track of the last time the per thread packet
timestamp was updated, and only considering it for the 'minimum' when it
is reasonably current.
Second, there was a minor race condition at start up, where the FM would
already inspect the hash table(s) while the packet threads weren't active
yet. Since FM gets the time from the packet threads, it would use a bogus
time of 0.
This is addressed by adding a wait loop to the start of the FM that waits
for 'time' to get ready.
This define is used to remove reference to capture bypass in case
no capture method implementing this is active.
This patch also introduces CAPTURE_OFFLOAD_MANAGER that is defined
if we need the flow bypass manager code.
This patch introduces and uses a new bypass strategy
based on a callback. EBPF bypass implementation is
updated to use this new strategy.
Once the flow manager detect that a flow should be timeouted,
it asks the capture method if it has seen packets in the interval.
If it is the case the lastts of the flow is updated and the timeout
is postponed.