Goals:
- reduce locking
- take advantage of 'hot' caches
- better locality
Locking reduction
New flow spare pool. The global pool is implmented as a list of blocks,
where each block has a 100 spare flows. Worker threads fetch a block at
a time, storing the block in the local thread storage.
Flow Recycler now returns flows to the pool is blocks as well.
Flow Recycler fetches all flows to be processed in one step instead of
one at a time.
Cache 'hot'ness
Worker threads now check the timeout of flows they evaluate during lookup.
The worker will have to read the flow into cache anyway, so the added
overhead of checking the timeout value is minimal. When a flow is considered
timed out, one of 2 things happens:
- if the flow is 'owned' by the thread it is handled locally. Handling means
checking if the flow needs 'timeout' work.
- otherwise, the flow is added to a special 'evicted' list in the flow
bucket where it will be picked up by the flow manager.
Flow Manager timing
By default the flow manager now tries to do passes of the flow hash in
smaller steps, where the goal is to do full pass in 8 x the lowest timeout
value it has to enforce. So if the lowest timeout value is 30s, a full pass
will take 4 minutes. The goal here is to reduce locking overhead and not
get in the way of the workers.
In emergency mode each pass is full, and lower timeouts are used.
Timing of the flow manager is also no longer relying on pthread condition
variables, as these generally cause waking up much quicker than the desired
timout. Instead a simple (u)sleep loop is used.
Both changes reduce the number of hash passes a lot.
Emergency behavior
In emergency mode there a number of changes to the workers. In this scenario
the flow memcap is fully used up and it is unavoidable that some flows won't
be tracked.
1. flow spare pool fetches are reduced to once a second. This avoids locking
overhead, while the chance of success was very low.
2. getting an active flow directly from the hash skips flows that had very
recent activity to avoid the scenario where all flows get only into the
NEW state before getting reused. Rather allow some to have a chance of
completing.
3. TCP packets that are not SYN packets will not get a used flow, unless
stream.midstream is enabled. The goal here is again to avoid evicting
active flows unnecessarily.
Better Localily
Flow Manager injects flows into the worker threads now, instead of one or
two packets. Advantage of this is that the worker threads can get packets
from their local packet pools, avoiding constant overhead of packets returning
to 'foreign' pools.
Counters
A lot of flow counters have been added and some have been renamed.
Overall the worker threads increment 'flow.wrk.*' counters, while the flow
manager increments 'flow.mgr.*'.
Additionally, none of the counters are snapshots anymore, they all increment
over time. The flow.memuse and flow.spare counters are exceptions.
Misc
FlowQueue has been split into a FlowQueuePrivate (unlocked) and FlowQueue.
Flow no longer has 'prev' pointers and used a unified 'next' pointer for
both hash and queue use.
Describe Changes
- Added ability to recursively read pcap directories
- src/suricata.c: addition of new command line parameter
--pcap-file-recursive
- src/source-pcap-file.c: parsing of the command line argument
- src/source-pcap-file-directory-helper.h: two thread vars tracking
directory depth and should recurse
- src/util-error.c / src/util-error.h:
Added new warning code "SC_WARN_PATH_READ_ERROR"
- Redmine ticket: https://redmine.openinfosecfoundation.org/issues/2363
Ticket: #2363
Previously each 'TmSlot' had it's own packet queue that was passed
to the registered SlotFunc as an argument. This was used mostly for
tunnel packets by the decoders and by defrag.
This patch removes that in favor of a single queue in the ThreadVars:
decode_pq. This is the non-locked version of the queue as this is
only a temporary store for handling packets within a thread.
This patch removes the PacketQueue pointer argument from the API.
The new queue can be accessed directly through the ThreadVars
pointer.
When a BPF filter is given on the command line when reading a
pcap file, the BPF filter is not honored.
The regression has been introduced in:
commit 3ab9120821
Author: Dana Helwig <dana.helwig@protectwise.com>
Date: Thu Apr 27 11:17:16 2017 -0600
source-pcap-file: Pcap Directory Mode (Feature #2222)
Reported-By: Tim Colin <tcolin@et.esiea.fr>
Better handle case where pcap file receive thread fails to initialize. Allow
initialize to complete, but terminate the thread quickly. Delay exiting
unix socket runmode as late as possible.
https://redmine.openinfosecfoundation.org/issues/2417
Add option to have pcap files deleted after they have been processed.
This option combines well with pcap file continuous and streaming
files to a directory being processed.
https://redmine.openinfosecfoundation.org/issues/2451
When a missing (or empty named) file is passed to source-pcap-file while
using unix socket, the pcap processing thread will incorrectly be stopped,
and no longer available for subsequent files.
https://redmine.openinfosecfoundation.org/issues/2394
Certain parameters of delay and poll interval could cause newly added
files in a directory to be missed. Cleaned up how time is handled for
files in a directory and fix which time is used for future directory
traversals. Add a mutex to make sure processing time is not optimized
away.
On OpenBSD 6.0 and 6.1 the following pcap gets a datalink type of
101 instead of our defined DLT_RAW.
File type: Wireshark/tcpdump/... - pcap
File encapsulation: Raw IP
File timestamp precision: microseconds (6)
Packet size limit: file hdr: 262144 bytes
Number of packets: 23
File size: 11 kB
Data size: 11 kB
Capture duration: 7,424945 seconds
First packet time: 2017-05-25 21:59:31,957953
Last packet time: 2017-05-25 21:59:39,382898
Data byte rate: 1536 bytes/s
Data bit rate: 12 kbps
Average packet size: 496,00 bytes
Average packet rate: 3 packets/s
SHA1: 120cff9878b93ac74b68fb9216027bef3b3c018f
RIPEMD160: 35fa287bf30d8be8b8654abfe26e8d3883262e8e
MD5: 13fe4bc50fe09bdd38f07739bd1ff0f0
Strict time order: True
Number of interfaces in file: 1
Interface #0 info:
Encapsulation = Raw IP (7/101 - rawip)
Capture length = 262144
Time precision = microseconds (6)
Time ticks per second = 1000000
Number of stat entries = 0
Number of packets = 23
On Linux it is 12.
On the tcpdump/libpcap site the DLT_RAW is defined as 101:
http://www.tcpdump.org/linktypes.html
Strangely, on OpenBSD the DLT_RAW macro is defined as 14 as expected.
So for some reason, libpcap on OpenBSD uses 101 which seems to match
the tcpdump/libpcap documentation.
So this patch adds support for datalink 101 as RAW.
Set flags by default:
-Wmissing-prototypes
-Wmissing-declarations
-Wstrict-prototypes
-Wwrite-strings
-Wcast-align
-Wbad-function-cast
-Wformat-security
-Wno-format-nonliteral
-Wmissing-format-attribute
-funsigned-char
Fix minor compiler warnings for these new flags on gcc and clang.
When we run on live traffic, time handling is simple. Packets have a
timestamp set by the capture method. Management threads can simply
use 'gettimeofday' to know the current time. There should never be
any serious gap between the two or major differnces between the
threads.
In offline mode, things are dramatically different. Here we try to keep
the time from the pcap, which means that if the packets are recorded in
2011 the log output should also reflect this. Multiple issues:
1. merged pcaps might have huge time jumps or time going backward
2. slowly recorded pcaps may be processed much faster than their
'realtime'
3. management threads need a concept of what the 'current' time is for
enforcing timeouts
4. due to (1) individual threads may have very different views on what
the current time is. E.g. T1 processed packet 1 with TS X, while T2
at the very same time processes packet 2 with TS X+100000s.
The changes in flow handling make the problems worse. The capture thread
no longer handles the flow lookup, while it did set the global 'time'.
This meant that a thread may be working on Packet 1 with TS 1, while the
capture thread already saw packet 2 with TS 10000. Management threads
would take TS 10000 as the 'current time', considering a flow created by
the first thread as timed out immediately.
This was less of a problem before the flow changes as the capture thread
would also create a flow reference for a packet, meaning the flow
couldn't time out as easily. Packets in the queues between capture
thread and workers would all hold such references.
The patch updates the time handling to be as follows.
In offline mode we keep the timestamp per thread. If a management thread
needs current time, it will get the minimum of the threads' values. This
is to avoid the problem that T2s time value might already trigger a flow
timeout as the flow lastts + 100000s is almost certainly meaning the
flow would be considered timed out.
This patch adds a new callback PktAcqBreakLoop() in TmModule to let
packet acquisition modules define "break-loop" functions to terminate
the capture loop. This is useful in case of blocking functions that
need special actions to take place in order to stop the execution.
Implement this for PF_RING
Implement LINKTYPE_NULL for pcap live and pcap file.
From: http://www.tcpdump.org/linktypes.html
"BSD loopback encapsulation; the link layer header is a 4-byte field,
in host byte order, containing a PF_ value from socket.h for the
network-layer protocol of the packet.
Note that ``host byte order'' is the byte order of the machine on
which the packets are captured, and the PF_ values are for the OS
of the machine on which the packets are captured; if a live capture
is being done, ``host byte order'' is the byte order of the machine
capturing the packets, and the PF_ values are those of the OS of
the machine capturing the packets, but if a ``savefile'' is being
read, the byte order and PF_ values are not necessarily those of
the machine reading the capture file."
Feature ticket #1445
The global variable suricata_ctl_flags needs to volatile, otherwise the
compiler might not cause the variable to be read every time because it
doesn't know other threads might write the variable.
This was causing Suricata to not exit under some conditions.
Fix pcap packet acquisition methods passing 0 to pcap_dispatch.
Previously they passed the packet pool size, but the packet_q_len
variable was now hardcoded at 0.
This patch sets packet_q_len to 64. If packet pool is empty, we fall
back to direct alloc. As the pcap_dispatch function is only called
when packet pool is not empty, we alloc at most 63 packets.
Using a stack for free Packet storage causes recently freed Packets to be
reused quickly, while there is more likelihood of the data still being in
cache.
The new structure has a per-thread private stack for allocating Packets
which does not need any locking. Since Packets can be freed by any thread,
there is a second stack (return stack) for freeing packets by other threads.
The return stack is protected by a mutex. Packets are moved from the return
stack to the private stack when the private stack is empty.
Returning packets back to their "home" stack keeps the stacks from getting out
of balance.
The PacketPoolInit() function is now called by each thread that will be
allocating packets. Each thread allocates max_pending_packets, which is a
change from before, where that was the total number of packets across all
threads.
Flow-timeout code injects pseudo packets into the decoders, leading
to various issues. For a full explanation, see:
https://redmine.openinfosecfoundation.org/issues/1107
This patch works around the issues with a hack. It adds a check to
each of the decoder entry points to bail out as soon as a pseudo
packet from the flow timeout is encountered.
Ticket #1107.