WARNING: The patches found in this directory are very experimental, and can (or will) cause data loss. As the Berkeley license on this code states, we're not responsible. Please don't use these patches unless you know what you're doing. These patches are generated by: p4 diff2 -u //depot/vendor/freebsd/src/sys/... //depot/user/rwatson/netperf/sys/... using the repository on perforce.FreeBSD.org. They are derived from the netperf_socket branch created by Scott Long to hold an updated and simplified set of patches from Sam Leffler's netperf+socket locking branch. Credit for much of the work can be found on the FreeBSD SMPng web page, and the list of contributors is long! Early work, performed by Jonathan Lemon, Jennifer Yang, Jeffrey Hsu, and Sam Leffler often adopted approaches prototyped by BSDi in a BSD/OS snapshot released made available to the FreeBSD Project. BSDi's initial shepherding of the SMPng project was invaluable, and is much appreciated. Financial support from the FreeBSD Foundation also made it possible for Sam Leffler to finish up a number of areas of network stack locking left undone in early work, kicking off a second phase of network stack SMP work. More recent work has been done by a larger set of developers, including: Max Laier, John-Mark Gurney, Brooks Davis, George Neville-Neil, Luigi Rizzo, Brian Feldman, Rick Macklem, Roman Kurakin, Don Lewis, Pawl Jakub Dawidek, Bosko Milekic, Julian Elischer, Gleb Smirnoff, Bruce Simpson, Mike Silbersack, Ed Maste, Nate Lawson, John Baldwin, and myself (Robert Watson). Additional support has been provided by the FreeBSD Foundation, Sentex Communications, FreeBSD Systems, and other sponsors who have generously supported development and testing resources for this project. These sponsors are particularly thanked for their contribution of hardware and services for performance and stability testing. The changes to disable Giant over the network stack by default have been merged to FreeBSD CVS; to re-enable Giant, build the kernel with options NET_WITH_GIANT, or set debug.mpsafenet=0 in loader.conf. Compiling or loading some network features, such as KAME IPSEC, will cause Giant to automatically be placed over the stack at boot. Compatibility code is present for network device drivers that are not locked down, but has a substantial performance impact. 20050809 log: More months of silence, in which code is written, but logs are not updated. Primarily, on-going work has been in the areas of general cleanup, improving kernel memory status management, ifnet locking, and device driver activity. At a high level: - Fix IPX support, which appears to have been broken by a combination of a Linux compatibility define and compiler padding changes. Merged to RELENG_5. - Fix races that could cause soreceive() to return ENOTCONN rather than an EOF condition during disconnect. Merged to RELENG_5. - Mark a number of system calls MPSAFE (some me, some jhb). - Modify malloc(9) to improve ABI resistance, make use of per-CPU memory allocation statistics, and use critical sections to protect those statistics. Results in a substantial performance improvement for CPU-intensive paths using kernel memory allocation. - De-spl UDP. Merged to RELENG_5. - Fix an IPv6 UDP bug, in which proper locking was not performed around in6_pcbdetach(). To catch other related bugs, add a number of inpcb and pcbinfo lock assertions in in_pcb.c. This could result in race conditions leading to panics. Merged to RELENG_5. - Additional TCP pcb and pcbinfo lock assertions hither and yon for similar reasons. Merged to RELENG_5. - Acquire the inpcb lock properly for tcp_attach() failures in both IPv4 and IPv6, closing race conditions leading to panics. Merged to RELENG_5. - Updated SMPng web page, netperf web page. - RELENG_6 branch! - Export memory allocator statistics streams from UMA(9) and malloc(9), and create a library for providing uniform kernel memory monitoring, libmemstat(3). Teach vmstat, netstat, et al, to use it. This closes race conditions in the monitoring of mbuf allocation. Generally move to 64-bit counters. Sample memtop(8), memstat(8), etc. http://www.watson.org/~robert/freebsd/libmemstat/ Merge to RELENG_6. - A variety of UMA statistics improvements; netstat now knows about the size of the mbuf cache. vmstat(8) now works on core dumps again. netstat changes merged to RELENG_6. - netnatm locking merged to CVS, marked as MPSAFE. Remove FreeBSD 2.2 kernel source compatibility. De-spl. Merged to RELENG_6. - Tough love e-mail sent to arch@ about which device drivers have not been made MPSAFE. A nice flurry of locking activity, including if_de, if_pcn cleanups, if_dc cleanups, etc. - Clean up and de-spl TCP, IP parts that have long since been made MPSAFE. Merged to RELENG_6. - Separate ifnet.if_flags into two separate fields: ifnet.if_flags (immutable and network stack owned flags) and ifnet.if_drv_flags (device driver owned flags). Rename IFF_OACTIVE to IFF_DRV_OACTIVE and move all references. Rename IFF_RUNNING to IFF_DRV_RUNNING amd move all references. Identify incorrect use of IFF_ALLMULTI, and prod driver writers to fix them. Identify poor use of IFF_DEBUG, and prod driver writer to fix it. This will eliminate a host of very small races involving incremental updates of these fields, which could result in poor driver behavior, stalled network interfaces, etc. Document the locking and ownership of each flag and field in if.h. - Add an IP multicast socket regression test, msocket. - Add ifnet.if_addr_mtx to lock ifnet-related address lists. Add accessor macros. Allocate link layer addresses with M_NOWAIT so that mutexes can be held over if_resolvemulti(). Protect multicast address lists at the link layer using if_addr_mtx; involves rewriting if_addmulti() and if_delmulti(). De-spl. Modify network protocol layer consumers to acquire locks as appropriate. Modify all network device drivers to acquire if_addr_mtx when walking multicast address lists to program hardware address filters. - Add new syscall_timing socket test cases. - Add in_multi_mtx to protect IPv4 layer multicast address lists; add accessor macros. Acquire around updates to the multicast address lists, across calls to link layer multicast calls to keep the layers in sync, in iterations in ip_input() and ip_output(), and in IGMP. Define a lock order inpcb -> in_multi_mtx -> igmp_mtx -> if_addr_mtx. - Michael Lucas has created an article on netperf locking work, based on my BSDCan presentation, for USENIX ;login:. 20050513 log: After several months of quiet in the log, some recent status: - My BSDCan 2005 netperf presentation is now online at http://www.watson.org/~robert/freebsd/netperf/20050513-bsdcan-netperf/ - UMA optimizations to use critical sections instead of mutexes for per-cpu caches have now been merged, resulting in a 2%-4% improvement in PPS for minimally sized UDP datagrams from user space. Larger improvements have been reported for kernel-only benchmarks, up to 20% for some packet sniffing environments. These and other per-cpu optimizations rely on critical section optimizations recently committed by John Baldwin , which substantially lower the cost of critical sections, to below the cost of mutexes on UP. On Xeon P4, they come to about 20-40 cycles, as compared to the cost of disabling interrupts previously. - Malloc patches to use critical sections and per-cpu data instead of mutexes for per-malloc type data are in testing an will be merged soon, with feedback from bde and others. Two challenges in this work: adapting/maintaining user/kernel API/ABI for monitoring with vmstat -m, and how to maintain global high watermarks in the absence of global state. - Prior to 5.4-RELEASE, netipx locking and mpsafety was merged from HEAD to RELENG_5. - TCP send optimizations to reduce the scope of the global pcbinfo lock for TCP in tcp_usr_send(), submitted by Kazuaki Oda , were merged to CVS HEAD. 20050227 log: Integrate the netperf branch from FreeBSD CVS HEAD, bringing in a broad set of changes: - The possible race in sonewconn() relating to the modification of the so_state connection state has been eliminated by moving the state change to before the socket is exposed to other threads. - kern_connect() has been de-spl'd. - Some stylistic and content cleanup to uipc_usrreq.c, the UNIX domain socket implementation. - netatalk now uses a callout rather than a timeout for the initial aarpprobe() call. There is a conflict between the at_ifaddr locking in the netperf branch and the aarp locking in the main tree, as the CVS version uses the aarp lock to protect the condition managed by the callout, whereas the netperf tree uses the at_ifaddr_mtx in this role. This will need further work. - Various mbuf allocations in the netatalk code have been changed to use non-sleeping allocations, as they may be called from the netisr, ithreads, or other contexts in which mutexes may be held. - The netatalk netisrs now run MPSAFE. - The netgraph netisr now runs MPSAFE. - Experimentation with per-CPU randomness state for the ip_id code is now in the netperf branch. However, this approach has problems due to poorly timed calls into the random code during a critical section. Further work required here. - Split of solisten() logic out into solisten_proto_check() and solisten_proto() has eliminated a class of races against the socket and protocol layer state during transitions. See the 20050227 log for details on the general issue and specifics of the solution. - Annotation that the ip6_forward_rt route cache for IPv6 packet forwarding needs synchronization (or elimination). - IPX/SPX restructuring and locking has been merged to FreeBSD CVS, including the move to queue(9), PCB locking, etc. Also relevant from looped back work in the main FreeBSD tree: - callouts can now define a mutex that will be held over callout state transitions. - mbuf cluster reference counting optimizations to avoid atomic operations on reference release in the common case (1 reference). - M_VLANTAG avoids walking m_tag chains unnecessarily. - netatm, netnatm are now marked NET_NEEDS_GIANT(). - Many netgraph modules have been converted to use ng_callout instead of timeout. - Several netgraph modules have been fixed so as to properly initialize and destroy mutexes on load/unload. - ICMPv6 pcb walking and insertion has been locked in the same form as IPv4 ICMP and raw sockets. - The NFS code has been changed to avoid sleeping mbuf allocations in a variety of stick situations. - Bugs have been fixed in the handling of raw socket cb walking relating to PF_KEY and FAST_IPSEC. 20050227 log: We've run into a number of cases in which related state is maintained at the socket and protocol layers in the network stack under different locks, resulting in races between protocol layers as the locks are not properly combined during multi-layer state transitions. The primary example of this has to do with socket layer connection state, maintained in the so_state and so_options socket layer fields. Protocols such as TCP maintain replicas of this state in the inpcb and tcpcb state of the protocol, such as the tp_state field. In the current world order, state transitions are driven by the socket layer, which performs its checks and state modifications, then notifies the protocol layer it should do likewise. The ordering varies by state transition, but the result is significant races, including one that recently triggered under Peter Holm's network stress tests: the socket layer would notify the protocol layer that it should transition to the listening state (TCPS_LISTEN in the case of TCP); then it would update the socket to include SO_ACCEPTCONN in the socket options field (so_options). If a TCP SYN packet came in after the tcpcb had transitioned to the listen state, but before the socket layer had set the SO_ACCEPTCONN flag, an assertion in the TCP code would fire because a listen socket was present without the socket being in the listen state. To correct this class of races, we've been exploring moving the FreeBSD socket layer to act more as a library to the protocol, rather than having the protocol act as a library to the sockets layer -- since protocol locks fall ahead of socket locks in the lock order, this permits protocols to hold their locks over the socket state transition, which isn't currently possible with state transitions driven in the sockets layer. This also eliminates time-of-check / time-of-use races in the socket layer. Another possible approach is to accept the asynchrony of changes between layers, rather than eliminating it. That is, to teach the protocol implementation to be aware that it may race with the socket code, and to act safely in the presence of a detected race. While in case of the solisten() implementation, this might be straight forward, I am concerned that that approach would make it hard for protocol implements to reason effectively about their implementation, introducing the opportunity for a broad range of hard-to-debug state bugs. The solution we're currently employing in FreeBSD CVS HEAD is to have the protocol invoke solisten_proto_check() when beginning a transition to the listen state, and solisten_proto() when the protocol wants to push the state transition into the socket layer. The protocol is responsible for holding its own locks and socket layer locks for the duration in order to ensure the state change is atomic between layers. 20050102 patch: Integrate to CVS HEAD: - Remove now-obsoleted "Unlocked read" annotations. - Pick up UNIX domain socket locking fix from Alan Cox. - Pick up BPF locking fixes from Brian Feldman and John-Mark Gurney. - Pick up SPPP locking from Roman Kurakin. - Merged ip_getmoptions(), ip_pcbopts(), ip_setmoptions() inpcb structure and locking improvements to FreeBSD CVS, making many IP-layer socket options more MPSAFE. There is some more work to do for multicast socket options. - Merged substantial quantity of TCP locking assertions, annotations, fixes, etc, including TCP timewait locking, almost universal lock assertions in TCP, TCP timer locking, TCP ISN locking to FreeBSD CVS. - Completed IPX/SPX queue(9) conversion for ipxpcb's, fixing a number of bugs, as well as eliminating quite a bit of dead or broken code, which simplified the conversion. - Merged IPX/SPX queue(9) conversion to FreeBSD CVS. - Merged constification of several IPX constants to FreeBSD CVS. - Fixed bugs in IPX/SPX due to a lack of __packed for packet header data structures, resulting in on-wire data corruption with recent gcc compilers. - Merged IPX/SPX __packed bug fixes to FreeBSD CVS. - Corrected route locking bugs in IPX due to using rtfree() rather than RTFREE() on an unlocked route entry. Converted to using the rtalloc_ign() API over the obsoleted rtalloc() API. - Merged IPX route lock bugfixes to FreeBSD CVS. - Added "show alllock" to DDB, which lists the locks visible to WITNESS that are owned by any running process/thread, not just the current one. - Merged "show alllocks" to FreeBSD CVS. - Merged many TCP locking fixes to RELENG_5. - Trimmed many XXXRW comments from the socket code following review; merged trimmage to FreeBSD CVS. - Merged netrate fixed-rate packet generation tool from HEAD to RELENG_5 at the request of Matthew George. 20041204 patch: Integrate to CVS HEAD: - Merged subr_mbufqueue from rwatson_dispatch to rwatson_netperf. This is a library of basic mbuf queue routines that can be used in passing around sets of mbufs to amortize costs. - When the netisr thread runs, dequeue all available mbufs into a thread-local mbuf queue and then process them one at a time rather than repeatedly re-locking the queue for each queued mbuf. - Merged synchronization micro-benchmarking from rwatson_percpu to rwatson_netperf. - Merged if_start_mbuf() from rwatson_dispatch to rwatson_netperf, which allows network device drivers to implement a direct dispatch of an mbuf into the driver without first queueing it. This avoids several lock operations for each transmit, at the cost of code replication in the device driver. This change pushes additional handoff cod into ifq_handoff(), and modifies if_handoff() to detect and use support for the entry point. - Implement if_start_mbuf in if_em. - Merged Giant-free close() for sockets/pipes/... to FreeBSD CVS. This reduces the overhead of close operations by about .5% in socket()/close() micro-benchmarks on SMP by avoiding locking Giant for each socket close. - Removed now defunct "Unlocked read" annotations in pipe, socet, and netinet code. Merged in some cases to FreeBSD CVS. - IF_DEQUEUE() and IF_DRAIN() now perform a lockless read before beginning work, in order to avoid locking when no work is to be performed. This should remove at least one lock operation pair from each run of the netisr, as well as each call into if_start. - Substantial cleanup of TCP lock assertions, TCP locking, etc; see log entry for 20041128. - Hold socket buffer lock over the duration of urgent pointer and window calculations in the TCP input path to avoid races in calculating available space. - Merged udp_in, udp_in6, udp_ip6 UDP bug fixes for multi-threaded udp_input to FreeBSD CVS. 20041128 log: The last week has seen work in the following areas: - A review of locking throughout the TCP implementation, including identifying and documenting under-documented locking stratgies, dribbling locking assertions throughout the implementation, and correcting a number of relatively minor locking bugs. This includes the following changes in FreeBSD CVS: - Remove now outdated "Unlocked read" annotations from tcp_input.c. - Document that the tcbinfo lock protects ISN global variable state in TCP. Acquire the tcpbinfo write lock in the ISN timer. Staticize ISN variables since they're used only in tcp_timer.c. - Additional TCP inpcb assertions in ICMP upcall notify handlers, including tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(), as well as timer tcp_timer_2msl(), and other TCP functions such as tcp_reass(). - Litter more TCP assertions in tcp_input(), especially relating to the many labels and gotos. - Document and enforce locking of the TCP time wait structure and chains as using the tcbinfo lock to protect the chains, and inpcb lock to protect the contents. Assert locks as needed in tcp_twstart(), tcp_twrecycleable() (note that a but exists here because it is called without the inpcb lock in some situations), tcp_twclose(), tcp_twrespond(), tcp_timewait(), tcp_timer_2msl_tw(), tcp_xmit_timer(). - De-spl several timer functions, including tcp_slowtimo(), tcp_imer_delack(), tcp_timer_2msl() and handlers, tcp_timer_2msl_tw(). - Expand coverage of the socket buffer lock in tcp_input() relating to urgent pointer checks and sets on the socket. - Merged udp_input/udp_append "don't use globals" fix to RELENG_5, making it substantially safer to use net.isr.enable=1. - Merged if_em promisc+vlan bug fix, although there are continued reports of problems that need to be investigated. - Updated the FreeBSD.org Netperf page, http://www.freebsd.org/projects/netperf Also updated SMPng task lists and announcements to reflect some of the ongoing work. - Implement the TCP_INFO socket option so that user processes can query TCP state information for a connection. - Additional experimental changes in the netperf development branches: - Merged subr_mbufqueue.c + mbufqueue.h from rwatson_dispatch to rwatson_netperf. - Broke out mbq_enqueue_from_ifq() and mbq_enqueue_from_altq(), depending on whether the ifqueue passed in by the caller is in fact an ifaltq or not. - Modified netisr_processqueue() to drain all available mbufs to a local struct mbufqueue in a single O(1) operation, rather them reading one by one grabbing and releasing the lock for each. - Updated the version of critical section optimizations in rwatson_percpu to John Baldwin's latest patch. Combined with per-CPU UMA caches, this results in a substantial performance benefit for micro-benchmarks of the socket() and pipe() system calls. On UP, the improvements are small but noticeable (1% for pipe/close/close), but on SMP they are quite real (8% pipe, 7% socket). On the netblast slightly-more-macro benchmark, the difference was not measurable. These changes appear to come primarily from the UMA optimizations, which depend on the critical section changes, rather than the reduced critical section costs alone. - Kernel synchronization micro-benchmark was merged from rwatson_percpu to rwatson_netperf. - Clean up UMA optimization comments. 20041120 log: Over the past week or two, I've primarily on chasing bugs in the IPv4 code, NFSv3 code, and if_em driver, with some success. A couple of new patches on my netperf web page include: 20041107-uma_critical_cache.diff - UMA using critical sections rather than mutexes to protect per-CPU caches. 20041119-uma-percpu-crit-opt.diff - UMA using critical sections rather than mutexes for per-CPU per-CPU caches, combined with a version of John Baldwin's critical section optimization that remove the need to disable interrupts during critical sections. This results in the change to critical sections cutting several hundred cycles off each cached allocation. 20041111-nfs-server-locking-fix.diff - correct nits in the locking of the NFS server around access control checks, both correcting a bug where the wrong locks were held, as well as reducing the level of lock thrashing by introducing variants on the call that expect either the NFS server lock or the Giant lock. This has been committed and now merged to RELENG_5. It is a RELENG_5_3 candidate. I have merged the socket reference count -> file descriptor reference count optimizations from HEAD to RELENG_5. I've fixed some minor bugs in the TCP input locking, wherein the pcbinfo lock would be released prematurely, before all references to any pcb's looked up were released. Since the pcbinfo lock currently acts as a reference to prevent the pcbs from being garbaged collected, there was the potential for a race condition. I've also performed a number of performance measurements relating to IP forwarding on 4.x, 5.x, and with various optimizations. Results are typically comparable to 4.x, albeit with slightly higher latency, on the systems I'm using (back-to-back gig-e and multi-GHz P4's). My suspicion is that the cost per-packet is still substantially higher than in 4.x, so in more CPU-constrained environemnts, this would be a problem. Interesting, 5.x with net.isr.enable (direct ithread dispatch of the network stack) on UP performance measurably better than 4.x; this suggests that the two biggest concerns right now remain latency to schedule the netisr for continued processing, and the cost of synchronization on SMP. An open question remains what downsides we'll see from possibly moving to direct ithread dispatch by default. There is now a centralized netperf web page for the FreeBSD Project at http://www.freebsd.org/projects/netperf/, which describes strategies and approaches for optimizing network performance with our multi-threaded network stack. 20041105 log: Over the last 48 hours, I've been measuring the cost of synchonization primitives in 5.x In particular, using the cycle counter to measure the cost of enter/exit for each of our locking primitives, as well as for critical sections. The results are interesting, especially when viewed in light of the changes between the PIII and P4 processors. You can see a PDF including some results and two graphs here: 20041105-synchronization-costs.pdf To create these measurements, I used the timing code now present in //depot/user/rwatson/percpu/test/test_synch_timing.c, which basically run 50,000 loops of acquire/drop for each primitive and return the results via a sysctl. Done carefully and with an awareness of interrupt behavior on the system, this appears to be fairly accurate. Conclusions are interesting: our critical section cost is the same across UP and SMP as it currently consists of disabling interrupts; however, the relative cost of the critical section vs locking differs based on running on UP and SMP (i.e., if atomic operations are required to implement locks or not). On SMP, critical sections offer a performance advantage when used, but on UP they are more expensive than simple sleep mutexes. We can and should investigate optimizing critical sections, as well as alternative locking primitives. On the PIII, the costs of our synchronization primitives are "not too bad" on UP and SMP, but on the P4, the added cost of atomic operations is pretty large, and adds up quickly for primitives such as sx locks, which perform more than one mutex operation. Mike Silbersack has suggested introducing a primitive to permit UP mutexes to be used with pinned threads on a specific CPU, which would substantially lower the cost of synchronization for per-CPU data structures. This is not disimilar to more classic use of a critical section to protect a per-CPU data structure, but doesn't rely on our having optimized critical sections further (and is therefore both easy to implement, as well as fast). Alan Cox has noted that this could be used in a multi-CPU environment given IPI-based rendezvous or messaging, but my leaning is to avoid mixing per-CPU and inter-CPU synchronization in order to simplify the implementation and avoid scenarios such as priority inversion, deadlock, etc. I've begun looking at removing mutex protection from the per-CPU UMA caches, but this is somewhat complicated as a result of the interlocking of mutexes in UMA (i.e., holding the per-CPU mutex over acquisition of the zone mutex). Bosko Milekic has similarly observed that we may need to eliminate some of this interlocking if we want to substitute critical sections (or other light-weight per-CPU primitives) for mutexes. I hope to have a basic prototype of some modifications to UMA that eliminate the mutexes within a week, permitting us to see the benefits of the earlier per-thread UMA caches without the memory overhead or questions about draining. I observe that the per-CPU cache mutexes are acquired during zone draining for release of a zone in UMA, and think this may be spurious locking anyway, since presumably you don't destroy a zone when it may still have operations pending. 20041103 patch: Integrate to CVS HEAD. A mutex to protect device polling state is introduced, pollmtx, and used to protect the global variables present in kern_poll.c. As a result, Giant is removed from the polling netisr code, and the #ifdef causing polling build to fail on SMP is removed. This has not been tested. A missing call to ACCEPT_LOCK() in soaccept() is added; the fix is also merged to FreeBSD CVS. The socket-specific system calls, such as send(), recv(), and variants are modified to use the file descriptor reference count instead of the socket reference count, reducing the number of mutex operations during a socket system call by at least 4, if not 6. This has a measurable improvement in terms of per-system call overhead. Merged to FreeBSD CVS. Further optimizations are made to the entropy code to avoid O(n) iterations over harvest event fifos, and to properly maintain counts of events in fifos. Merged to FreeBSD CVS. Unnecessary local comments in the netatalk code have been removed. The locking in tcp_input() is modified so that the pcbinfo lock is maintained longer, in order to eliminate some cases where the pcb reference was accessed after the lock acting as a reference was released. Assertions are similarly cleaned up. The locking in tcp_output() is similarly modified to avoid accessing pcbs after the pcbinfo lock has been released. The udp_in6 and udp_ip6 variables in udp_usrreq.c are made static, since they are not accessed from outside of that file. Further locking cleanups are required here to correct races that may occur during parallel execution of udp_input(). The if_dc driver will set IFF_NEEDSGIANT if not IS_MPSAFE. 20041019 patch: Integrate to CVS HEAD. Added and merged to FreeBSD CVS IFF_LOCKGIANT() and IFF_UNLOCKGIANT(), which conditionally lock and unlock Giant based on IFF_NEEDSGIANT on the passed interface. Employed in if.c around calls to ifp->if_ioctl() in the network interface ioctl() path to prevent entering the device driver's ioctl() handler without Giant held. Problem reported by John-Mark Gurney. Modified and merged fix to sofree()/soclose() race to FreeBSD CVS, in which the network protocol and socket close() would race to free the socket as a result of the socket lock being released in sofree() in order to grab the accept mutex. Pushed the accept mutex out of sofree() into the caller, which has the downside of broadening coverage of the mutex, but the upside of preventing the race from occurring. We will want to revisit the socket reference model, since it performs less well in threaded environments, and may still be a source of synchronization problems. Merged the hard-coding of entropy harvesting spin locks in WITNESS to FreeBSD CVS. Merged fix to FreeBSD CVS that moves entropy harvesting before ether_demux() in ether_input() so as to gather entropy from an un-free'd mbuf rather than a free'd one. 20041009 patch: Integrate to CVS HEAD. Merged /dev/random entropy harvesting locking strategy change to FreeBSD CVS. This substantially reduces the locking overhead of the entropy collection and processing subsystems. Merged removal of GIANT_REQUIRED in kqueue_close() to FreeBSD CVS, as KQeueue is now MPSAFE. Merged addition of BPF locks to hard-cded WITNESS lock order to FreeBSD CVS. Corrected a number of potential races in tcp_output() due to the socket buffer mutex not being held over a series of socket buffer operations. Merged fixes to FreeBSD CVS. Annotate potential races in if_findindex() and if_grow(), wherein unit number allocation problems may occur during interface allocation. Extend the coverage of the IFNET_WLOCK() during interface allocation. Add IFNET_*LOCK() assertion macros. Register AppleTalk netisrs as MPSAFE. Register NetGraph netisr as MPSAFE. Further refine locking in ip_ctloutput() to hold inpcb mutexes over ip_pcbopts(), clean up various annotations (and problems) associated with sooptcopyout() by making local copies rather than copying from shared data stuctures, and by caching values so as to use them in consisent (and stale) manners rather than inconsistent and stale manners. Likewise in multicast options. Some problems still remain in this code, and comprehensive testing is needed. In tcp_subr.c, assert inpcb, pcbinfo locks when executing various timer events and other manipulation of pcb state. Also add assertions to utility functions, the syncache, et al. Register the IPv6 netisr as MPSAFE. 20040908 patch: Integrate to CVS HEAD. Corrected a bug in sopoll() that could lead to races during the registration of poll()/select() to sleep: an event could be missed resulting in the appearance of a "wedge" by applications using poll() and select() on sockets on SMP. Merged sopoll() bug fix to FreeBSD CVS (HEAD and RELENG_5). Annotate an issue in UNIX domain socket stream socket deliver where the return value of sbappend_locked() is not properly checked, which might (or might not) cause data loss. Convert BPF descriptor implementation from custom-built linked lists to queue(3)-based lists, simplifying several aspects of the implementation. Merge BPF descriptor conversion to queue(3) to FreeBSD CVS. Slightly resort bpf_attachd() to move hook-up of the si_drv1 field after more BPF fields have been initialized. Annotate a number of potential races in the BPF descriptor implementation. Annotate possible races in if_attachdomain1() during ifnet attachment, as well as slide registration of an incompletely initialized ifnet later in ifnet initialization. Work around a race between if_afdata initialization and the IPv6 neighbor discovery timer, which could cause dereferencing of a NULL pointer. Merge removal of the &thread0 references in ng_ksocket.c into FreeBSD CVS. Merge removal of key_int_random and related randomization re-seed code in the KAME IPSEC implementation into FreeBSD CVS. 20040902 patch: Integrate to CVS HEAD. Merged NET_WITH_GIANT, NET_NEEDS_GIANT(), and default to MPSAFE network operation to FreeBSD CVS. Merged annotation of IPX, ng_tty, KAME IPSEC as requiring Giant to FreeBSD CVS. Merged marking of if_pcn, if_sf, if_ste, if_ti, if_tl, if_wb as IFF_NEEDSGIANT to FreeBSD CVS. Fixed bug wherein the kernel linker would be called from the Netgraph socket send code without Giant held; merged directly to FreeBSD CVS. 20040828 patch: Integrate to CVS HEAD. Change the default setting for debug_mpsafenet to 1 from 0, which runs the network stack without Giant unless explicitly specified otherwise. Add "options NET_WITH_GIANT", a kernel option that sets the default back to 0. Rewrite locking found in /dev/random entropy collection to coalesce a number of locks into a single harvest_mtx, which will reduce the number of mutex operations when gathering entropy from 4 to 2, and reduce to O(2) the number of mutex operations in the entropy thread when processing gathered entropy from O(4N). While this potentially increases contention, it will dramatically decrease cost in the uncontended case. Add NET_NEEDS_GIANT("component") declaration that allows kernel network components to declare a dependence on Giant for correct operation. This declaration is checked for the kernel and modules during boot to determine whether debug.mpsafenet should be forced to 0 to ensure correct operation, and if so, a warning is displayed. If it's too late in the boot, we display a difference warning and continue. Instrument ng_tty, IPX, KAME IPSEC to declare dependence on Giant. Remove unused random tick handling in KAME IPSEC to reduce need for synchronization. Merged removal of UNIX domain socket locks from unp_gc() to FreeBSD CVS repository. 20040825 patch: Integrate to CVS HEAD. Merged removal of conditional socket buffer locking in socket kqueue filters to FreeBSD CVS. Removed references to thread0 (for its credential) in ng_ksocket.c, as they are in practice unused in HEAD due to curthread always being defined as non-NULL. However, the use of a thread here is improper, and probably suggests replacing thread references with credential references at a number of points in the protocol API. Merged replacement of &thread0 with curthread in nfs_timer.c to prevent use of non-curthread for suser() by UDP6. Merged in6_prefix.[ch] and router renumbering removal from George Neville-Neil to FreeBSD CVS, which simplifies IPv6 locking. Merged marking of if_dc as IFF_NEEDSGIANT to FreeBSD CVS. While if_dc contains locking, it's disabled by default and has not been reviewed or adequately tested. Merged fix to an NFS server bug where Giant was improperly acquired instead of released in nfsrv_link(), resulting in an assertion failure with INVARIANTS (or a delayed failure for non-INVARIANTS in the event nfsd exited). 20040822 patch: Integrate to CVS HEAD. Merged removal of Giant assertion in setugidsafety() to FreeBSD CVS. Merged removal of Giant assertion in kqueue_close() to FreeBSD CVS. Removed conditional locking in socket kqueue filters, as the socket buffer mutex will now always be held at that point. Merged UDP link header mbuf allocation optimization to FreeBSD CVS. Merged in6_pcbnotify() bug fix to conditionally unlock based on return value of the notify routine to FreeBSD CVS. 20040819 patch: Integrate to CVS HEAD. Merged GIANT_REQUIRED for fwe_start() to FreeBSD CVS. Updated annotations of GIANT_REQUIRED in close()-related functions, including setugidsafety(), fdclosexec(), as KQueue now requires Giant less. Merged UNP_UNLOCK_ASSERT() to FreeBSD CVS. Merged assertion of INP_LOCK_ASSERT() in inp_rehashpcb() to FreeBSD CVS. Merged push-down of udp_send() locks into udp_output(), as well as avoidance of pcbinfo locking in the bound/connected/non-sendto send case to FreeBSD CVS. Possibly correct a bug involving TCP and in6_pcbnotify() where the return value of the notify routine was not being used to determine if the inpcb should be unlocked or not by the caller. Problem reported by Jun Kuriyama. Added initial task list for further netinet6 locking. 20040817 patch: Integrate to CVS HEAD: ipfw, dummynet, etc using PFIL_HOOKS. 6.x-CURRENT. 20040816 patch: Integrate to CVS HEAD. Cleanup to Giant-reduced close() following KQueue locking integration. Merged UNIX domain socket lock over sotounpcb() changes to FreeBSD CVS. 20040815 patch: Integrate to CVS HEAD. Pick up kqueue locking from John-Mark Gurney (woo hoo!). This removes the requirement for Giant in the high level close() system call code. Still required in closef() for VFS, however. Introduce additional UNP_UNLOCK_ASSERT() calls to check that the UNIX domain socket subsystem lock is released after unp_detach(), as a substitute for NB's that it will be. Annotate some potential additional potential races in the UNIX domain socket code. Merge UNIX domain socket locking description to FreeBSD CVS. Introduce MP_WATCHDOG, which dedicates a CPU in an SMP system to act as a system watchdog, substituting for a lack of an NMI button on systems that don't have one. Merge MP_WATCHDOG to FreeBSD CVS. 20040814 patch: Integrate to CVS HEAD. Reformulation of UNIX domain socket locking to make sure that the UNP subsystem lock covers checks of so_pcb pointers, which will prevent a variety of races. Also, introduce additional tests to check for races between close() and connect() which are present even in non-mpsafe network stacks, and triggered by recent sched_ule changes. Annotate additional race possibilities. Query: should we be modifying reference handling and locking at the file descriptor layer to prevent a close() from finishing until all system calls outstanding on the file descriptor complete? Don't hold the UNP subsystem lock over the un_gc() garbage collection pieces. Merged IFF_NEEDSGIANT flagging for many common non-MPSAFE network interface drivers to FreeBSD CVS. David Malone fixed a locking bug involving syncache access on IPv6. Merged move to non-blocking mbuf allocation for IPv6 raw socket sends to FreeBSD CVS. In if_dc, rely on debug_mpsafenet to determine if we should run MPSAFE rather than the IS_MPSAFE flag in the driver. Note: this requires much testing. In if_pcn, mark the interrupt handler as MPSAFE since the driver appears to be locked. Note: this requires much testing. 20040811 patch: Integrate to CVS HEAD. Merged IFF_NEEDSGIANT for if_fwe to FreeBSD CVS. Merged lockless read of entropy harvest fifo count to FreeBSD CVS. Merged IFF_NEEDSGIANT for USB network interfaces to FreeBSD CVS. Merged splnet()->locking reference in comment in sosend() merged to FreeBSD CVS. Merged inpcbinfo and inpcb locking assertions for in_pcbconnect() and in_pcbconnect_setup() to FreeBSD CVS. Merged udp_send() fix to free control mbufs on the so_pcb pointer being NULL to FreeBSD CVS. Reconstituted udp_output() to try to avoid locking the udbinfo structure when no rebinding is performed. Even when the lock is acquired, try to reduce the time it is held for. Locking is pushed down from udp_send() so that more information is available to make the locking decision. In udp_output(), include experimental code to provide additional mbuf storage on the front end of the user data to provide room for a link layer header to try to avoid additional mbuf allocation for the ethernet header on send. Merge raw_ip6 send M_PREPEND() change to M_DONTWAIT to avoid sleeping while holding the raw pcb mutex. Include necessary error handling. 20040810 patch: Integrate to CVS HEAD. Merged ADAPTIVE_GIANT in GENERIC to FreeBSD CVS; results in 30% improvement in MySQL benchmarks with SMP w/o debug.mpsafenet, and 6%+ improvement with debug.mpsafenet. Scott Long reports 16% performance improvement on buildworld on SMP. Merged KTR system call tracing for i386 to FreeBSD CVS. Merged Giant pushdown in fcntl() to FreeBSD CVS. MA_NOTOWNED assertions disabled by default due to use of Giant by Linux ABI wrapper. Removed use of atomic operations in the mbuf allocator for statistics. Merged narrowing of scope of uidinfo locking in sbchgsize() to FreeBSD CVS. Merged extension of KTR_PROC tracing in mi_switch() to FreeBSD CVS. Merged addition of KTR_CALLOUT and callout tracing to FreeBSD CVS. Started adding comments and annotation of AIO structures in preparation for starting locking for AIO. Merged GIANT_REQUIRED assertions for VFS operations, push-down of Giant in some VFS operations to FreeBSD CVS. Removed use of task queue in SLIP in if_start. Merged removal of GIANT_REQUIRED in netatalk to FreeBSD CVS. Began to push down inpcb references into various socket option processing routines, including ip_pcbopts(), ip_setmoptions(), ip_getmoptions(). This will permit these routines to acquire inpcb locks when needed, whereas current acquisition of the locks in the calling code will result in holding a mutex over potentially sleeping copyin and memory allocation routines. Annotate need for locking in these routines. Merge KTR_UMA and basic UMA allocation and free tracing to FreeBSD CVS. 20040806 patch: Integrate to CVS HEAD. Convert TIMEOUT_SAMPLING callout/timeout tracing to using KTR, which provides a much better vehicle for analysis with context. Now uses KTR_CALLOUT. Modify KTR tracing for thread_exit() to include more thread context for analysis with mi_switch() and fork_exit(). Additional inpcb/inpcbinfo locking assertions for the inpcb connect code. Less inpcb locking for retriving local and peer addresses from an inpcb. This could use refinement. Merged in6_pcbnotify() cleanup and fixes to FreeBSD CVS. Remove INP_LOCK_ASSERT() from tcp_time_2msl_stop(), as it's called after the inpcb is disconnected from the time wait state (and resulted in a NULL pointer dereference). Merged UDP broadcast/multicast receive locking optimization to FreeBSD CVS. 20040805 patch: Integrate to CVS HEAD. Perform a lockless read when harvesting entropy to check that the entropy fifo is not full, avoiding the mutex cost if it is. Reduce the size of the harvesting fifo experimentally. Add KTR tracing for system calls on i386. Similar changes are needed for non-i386. Merged GIANT_REQUIRED in fdfree(), setugidsafety(), fdcheckstd(), _fgetvp(), conditional assertion in fgetsock() to FreeBSD CVS. Merged spl() removal from chsbsize() to FreeBSD CVS. Merged lockless reads of bif_dlist in BPF packet tap to FreeBSD CVS to avoid BPF locking cost if there are no listeners. Don't harvest entropy in ether_input(); the current harvesting is a bug (and costly, due to mutex operations). Merged inpcb locking assertions in the presence of IPv6 to FreeBSD CVS. Pass inpcbinfo to in6_pcbnotify() rather than inpcbhead, as it needs to iterate the list of pcb's, requiring it to hold the info lock, as well as acquire inpcb locks before notifying them of events. Update various consumers, including UDP, TCP, and raw IPv6. Assert TCP inpcbs in TCP timers relating to TIME_WAIT, as they are required. Annotate possible locking problem of the global time wait list. Don't acquire inpcb locks for UDP pcb's when searching for potentially matching broadcast/multicast sockets. Acquire the inpcb mutex only when we've found a potential match. This avoid 120+ mutex operations per broadcast packet received in my local configuration (ouch!). Merge uidinfo locking key to FreeBSD CVS. Add rudimentary UMA KTR tracing. 20040802 patch: Integrate to CVS HEAD. Giant becomes optional for a number of fcntl() operations. Still held over fo_ioctl(). Documentation of uidinfo locking strategy. Slight optimization by reducing lock coverage. Spl removal. Trimming of possibly unnecessary sched_lock lockage in sched_4bsd userret(). Return path simplification in ioctl() relating to Giant dropping. Merged accept filter registration locking to FreeBSD CVS. Merged IFF_NEEDSGIANT to FreeBSD CVS; tweaks to individual device drivers not merged, however. Merged IPv6 in6pcb lock to FreeBSD CVS. 20040725 patch: Integrate to CVS HEAD. Introduce IFF_NEEDSGIANT, an interface flag to indicate that calls to ifp->if_start require Giant to be held. Abstract calls from the network stack to ifp->if_start behind if_start(). If a network driver sets IFF_NEEDSGIANT, the network stack will defer the call to a task queue that holds Giant if Giant isn't already held. This will permit less MPSAFE network device drivers to coexist with debug.mpsafenet=1 more easily. 20040724 patch: Integrate to CVS HEAD. Giant is no longer acquired in fdrop_locked() before calling fo_close(), and implementations of fo_close() now acquire Giant if they need it (kqueue, VFS). This doesn't get Giant completely out of the close() path as closef() does VFS-specific things that require Giant. Mostly merged to FreeBSD CVS; some changes to annotate and use of Giant in file descriptor close paths not merged. Giant no longer acquired in fstat() before calling fo_stat(), letting individual implementations acquire it if they need it (VFS). Merged to FreeBSD CVS. KTR tweaked to avoid additional newlines. Merged to FreeBSD CVS. Pipe allocation optimized to avoid acquiring mutexes and locks before the pipe is shared. Merged to FreeBSD CVS. Merged M_DONTWAIT change in raw_ip to avoid blocking allocations in raw socket send while holding a mutex to FreeBSD CVS. NFS server timeout/callout modified to run MPSAFE if mpsafenet is set. Merged to FreeBSD CVS. Sampling of timeouts/callouts now placed behind options TIMEOUT_SAMPLING. Add netatalk locks to hard-coded WITNESS lock order. Some assertion cleanup to synchronize with FreeBSD CVS. UNIX domain socket subsystem lock pushed further into uipc_rcvd() since it's not required over such a broad code base. Annotate a potential interest in dropping the UNIX domain socket lock before waking up the socket buffer, but note that we can't do that without acquiring a reference to the socket which is more expensive than just doing the wakeup while holding the lock. Annotate that the AIO code is not safe with debug.mpsafenet=1 due to its use of a socket upcall and inadequate locking of AIO data structures. Additional Giant requirement annotation with assertions in VFS. Make use of lockless reads of the interface BPF descsriptor list to avoid the cost of locking the interface if there are no active BPF listeners on the interface. Start to annotate use of ifnet structures in if.c as relates to locking. First cut at ifaddr locking for netatalk, introducing a global mutex to protect iteration of the ifaddr list, and when releasing the mutex, making sure to add references to any ifaddrs used after that point. Holds mutexes over potentially blocking malloc calls, so needs work (especially restructuring at_control()). GIANT_REQUIRED removed from netatalk code. 20040714 patch: Integrate to CVS HEAD. In SLIP, use a task queue to defer slstart() from the netisr context to a task queue so that the tty code can run with Giant. Untested. Merge nf_eiface, ng_fec, ng_iface, ng_ppp, ng_pppoe, ng_tty global locking and comments to FreeBSD CVS. 20040712 patch: Integrate to CVS HEAD. Disable PREEMPTION in the rwatson_netperf branch due to it introducing a large number of hangs that are distracting from debugging actual netperf problems. Will re-enable once that is fixed. Introduce a new function, sockbuf_pushsync(), for use in soreceive() when it needs to push changes in socket buffer state back into struct sockbuf (resynchronizing the cached pointers used for optimization). Merge socreceive() locking for control mbufs, sockbuf_pushsync(), and a variety of other locking and consistency improvemnts to FreeBSD CVS. Merge netatalk at_rmx debugging routine buffer changes to avoid a shared global string buffer to FreeBSD CVS. Fix several bugs in netatalk DDP PCB locking, merge locking to FreeBSD CVS. When performing a raw IP send, don't do an 'M_TRYWAIT' M_PREPEND() of the IP header while holding the raw IP mutex. Merge additional tcp_input() inpcbinfo and inpcb lock state assertion additions to FreeBSD CVS. Merge constification of spx_backoff and rpc_backoff to FreeBSD CVS. 20040711 patch: Integrate to CVS HEAD. Sanitize socket buffer lock assertions in soreceive() to occur largely at the beginnings of code blocks to assert the flow of locking over the function. Break-out of non-inline out-of-band handling in soreceive() into soreceive_rcvoob() in FreeBSD CVS. Add additional locking and flow annotations to soreceive() in FreeBSD CVS. Additional socket buffer/stack consistency work relating to nextrecord, et al, in soreceive(). Merge socket locking in NFS client nfs_connect() to FreeBSD CVS. if_xl in FreeBSD VFS is now MPSAFE, local MPSAFEty changes submerged. Thanks to Bruce Simpson! 20040703 patch: Integrate to CVS HEAD. Additional inpcb assertions for IPv6 pcb infrastructure. Fix nits in IPv6 inpcb locking that could result in panics when using raw IPv6 sockets. Begin to constify parts of the rpc tree. 20040627 patch: Integrate to CVS HEAD. Merge so_global_mtx protection of so_gencnt, numopensockets to FreeBSD CVS. Merge socket buffer lock over unp_scan() to FreeBSD CVS. Modify prsockaddr() in netatalk to use stack storage for its temporary buffer, and hold the buffer in the caller's stack. Lock down global unit registration lists in ng_eiface and ng_fec using a per-class mutex. This is the same strategy as used in ng_iface. Lock down access to ng_ppp_latencies in ng_ppp; this is ugly, but is a result of existing ugliness. Lock down global netgraph locket list in ng_socket. Annotate locking weakness in ng_pppoe: the eh_prototype isn't locked, but is mutable. Lock down some but not all globals in ng_tty.c. Annotate a weakness associated with ngt_nodeop_ok, which confuses me. Annotate segment as const in ng_frame_relay.c. Enable inpcb locking assertions regardless of inclusion of IPv6 in the kernel. Add additional inpcb and inpcbinfo locking assertions to tcp_input() after each label to maintain assertions regarding locking assumptions throughout. Initial inpcb locking for raw_ip6.c, udp_usrreq6.c. Diff reduction against mac_socket.c -- return (0), not (error). No semantic change. 20040626a patch: Integrate to CVS HEAD. Merge soabort() comment regarding locking to FreeBSD CVS. Merge lock coalescing between sbappend*() and sowakeup() to avoid extra unlocks/locks of socket buffer mutexes on wakeup following an append on the socket buffer. Likewise in soisdisconnecting() and soisdisconnected(). Add ng_iface_mtx to protect ng_iface unit allocation bitmap from unsafe concurrent access. Merge removal of spl's from tcp_usrreq.c to FreeBSD CVS. 20040626 patch: Integrate to CVS HEAD. When an sbappend*() variant is followed by a call to so[rw]wakeup() in a protocol, explicitly acquire the socket buffer lock in the protocol and use the locked variants of these calls, avoiding a socket buffer mutex unlock/lock for each instance. In heavy-contention mysql SMP test environment with with UNIX domain sockets, this resulted in a 1.5% performance boost. This change affects both generic socket functions and protocol-specific send and receive routines. 20040624 patch: Integrate to CVS HEAD. Merge socket buffer locking of high and low watermark socket option setting via SO_SNDLOWAT and SO_RCVLOWAT to FreeBSD CVS. Merge socket buffer locking of sb_cc/so_oobmark in SPX to FreeBSD CVS. Merge socket buffer lock assertion in sowwakeup_locked() to FreeBSD CVS. 20040623a patch: Integrate to CVS HEAD. Merge portalfs socket, socket buffer locking to FreeBSD CVS. Merge coverage of selrecord() in sopoll() by socket buffer locks to FreeBSD CVS. Merge sbreserve_locked(), soreserve() locking to FreeBSD CVS. Merge use of socket lock and msleep() in kern_connect() to FreeBSD CVS. Merge ng_base.c locking fixes for ng_ID_hash and debugging sysctls for ng_allnodes and ng_allhooks to FreeBSD CVS. Merge conditional assertion of Giant when asserting other stack locks (inpcb, inpcbinfo, dummynet, ipfw, mrouter mfc) to FreeBSD CVS. Merge initial inpcb locking in ip_ctloutput() to FreeBSD CVS. Merge expansion of socket buffer locking to cover read-modify-write of sb_cc and sbdrop() in TCP ACK processing to FreeBSD CVS. Merge socket buffer locking of so_oobmark in TCP to FreeBSD CVS. Merge socket buffer locking of high and low watermark in tcp_mss() to FreeBSD CVS. Merge constification of natm send/receive space constants to FreeBSD CVS. Merge socket buffer locking of sb_flags in nfs_socket.c to FreeBSD CVS. Merge addition of mac_ifnet_mtx, protection of ifnet MAC labels, introduction of mpo_copy_ifnet_label(), and policy modifications to FreeBSD CVS. Merge annotation of so_error, so_oobmark locking to FreeBSD CVS. Merge annotation of sb_flags fields. 20040623 patch: Integrate to CVS HEAD: pick up TCP SACK support. Trim incorrect "Unlocked read" comment that is no longer accurate for fifofs. Acquire socket buffer lock around SO_SNDLOWAT and SO_RCVLOWAT socket option setting. For soo_poll(), acquire the socket buffer lock before performing selrecord() on the socket buffer. Additional annotations about socket connection state. Remove possible "coallesce locking here" comments regarding adjacent calls to sorwakeup() and sowwakeup(), as that's no longer accurate due to the wakeup calls dropping locks to avoid holding them over upcalls. Annotate a possible bug in so_state handling when propagating socket state bits from a listen socket to an accept socket in sonewconn(). Introduce sbreserve_locked() and sbrelease_locked() with the normal unlocked wrappers, assertions, etc. Hold the socket buffer locks throughout soreserve() to prevent races during various buffer size calculations, reservation operations, etc. Acquire send lock before receive lock. Acquire socket buffer mutexes around watermark calculations and socket buffer reservations in tcp_mss(). 20040622 patch: Integrate to CVS HEAD: pick up ifnet clone locking from Brooks Davis. Trim stale "Unlocked read" annotations. Merge trimming of a number of spl() statements to FreeBSD CVS. Introduce basic locking of so_state, socket buffer in portalfs. Lock down so_gencnt, numopensockets using a new global mutex so_global_mtx. Possibly atomic operations should be used instead. Merge move to using incpb label where possible when performing MAC labeling in divert sockets to FreeBSD CVS. Merge socket buffer locking of some socket buffer fields and call to sbdrop() in tcp_input() to FreeBSD CVS. Receive socket buffer lock now used to protect so_oobmark field of socket structure, along with seetting of SBS_RCVATMARK atomically. In KAME IPSEC and FAST_IPSEC, use rawcb_mtx to protect interation over the raw socket list looking for pfkey sockets to deliver to. We may need to introduce use of a netisr here similar to that used in routing sockets to avoid lock order reversals. Socket buffer lock asserted in sowwakeup_locked now, not just in sorwakeup_locked(). 20040620b patch: Integrate to CVS HEAD. Merge annotations of unlocked reads in socket ioctls and elsewhere to FreeBSD CVS. Merge reformulation, socket buffer locking, and addition of _locked variants of sowakeup(), sbrelease(), sbflush(), socantsendmore(), socantrcvmore(), sbappend(), sbdrop(), sbinsertoob(), and more to FreeBSD CVS. Merge conversion of if->panic in soclose() to KASSERT to FreeBSD CVS. Merge cleanup of sorflush(), including proper initialization and copying of the socket buffer to a temporary buffer to FreeBSD CVS. Merge locking of SO_LINGER socket option to FreeBSD CVS. Merge locking of opposing sockets in uipc_rcvd() and elsewhere in UNIX domain sockets to FreeBSD CVS. Merge not locking Giant in ip_mroute.c's socket_send left over from earlier debug.mpsafenet semantics. Merge inpcb assertion in tcp_input() for MAC Framework. Merge conditional initialization of TCP timers as CALLOUT_MPSAFE based on debug.mpsafenet to FreeBSD CVS. Merge soabort() locking/queueing updates for SPX to FreeBSD CVS. Merge annotations of NET_{LOCK,UNLOCK}_GIANT() to FreeBSD CVS. Merge annotation of so_state locking, sb_flags in socketvar.h to FreeBSD CVS. 20040620a patch: Integrate to CVS HEAD. Re-order socant{rcv,send}{,_locked}() functions to put locked variants first for consistency with other similar pairs. Combine all sowakeup(), sowakeup_locked, and sowakeup_under_lock() into a single sowakeup() function. sowakeup() asserts the socket buffer lock on entry such that it may be called atomically with an sb_notify() check in sorwakeup(), sowwakeup(), and others. However, it now releases the socket buffer lock before performing upcalls (et al) to prevent lock order reversals when upcalls call back into the socket code. Modify sorwakeup(), sowwakeup(), and associated macros to take this into account. Modify socantrcvmore(), socantrecvmore_locked(), socantsendmore(), and socantsendmore_locked(), which call sowakeup() variants, to assume that the socket buffer lock is released on return; re-acquire if needed (this is generally safe because of the use of sblock() and existing assumptions about the lock being dropped). Add assertions to check that that is the case in several places. These changes result in better-defined locking for a variety of wakeups, including upcalls. It also coallesces some locking operations, reducing the overhead and complexity of locking. 20040620 patch: Integrate to CVS HEAD. Attempt to generally clean up locked vs. unlocked interfaces in the socket code, avoiding conditional locking in a number of additional situations: Add sbrelease_locked(), sbflush_locked(), sbappendcontrol_locked(), which all assert the socket buffer lock. The unlocked versions of these unconditionally acquire the socket buffer lock. No longer need to acquire the socket buffer lock in TCP before calling sbflush(). Remove conditional locking from sbappend(), sbappendstream(), sbappendrecord(), sbinsertoob(), sbappendaddr(), sbappendcontrol(). It is no longer required due to better defined locking conditions when these functions are called. In sbappend_locked(), assert the socket buffer lock before validating arguments to catch locking problems that might otherwise be missed. Modify panic()'s in _locked versions of calls to use the correct function name in their panic string. Clean up and annotate problems in sorflush() involving the local stack copy of the receive socket buffer during cleanup. This code has been present in the stack since early BSD, and is intended to allow continued access to the socket buffer during the disposal process, which may block. Make sure to zero and copy appropriate sections of the new socket buffer, in particular to zero the mutex so that new locking assertions in sbrelease() don't result in a panic (as the socket buffer is copied while the original has its mutex locked). Initialize and tear down the temporary socket buffer mutex so it can be used for locking. Correct a socket buffer lock leak in UDP which occurred during a race on UDP socket close. Whitespace synchronization. 20040619 patch: Integrate to CVS HEAD. Merge initial socket buffer locking of sosend(), soreceive(), sofree() to FreeBSD CVS. Also merge lock assertions in sb_lock() and sbwait(), move to using msleep() instead of tsleep() in those functions. Remove some spls. Note that the locking of sosend()/soreceive() is different in CVS than in the rwatson_netperf branch as it doesn't include some restructuring; it may include races not present in the branch as a result. Some structural changes looped into rwatson_netperf. Merge locking of so_options and SO_DONTROUTE in sosend(). Annotate a commented out sbwait() call that it will need socket buffer locking. Add some initial, and likely broken, calls to lock the inpcbs in ip_ctloutput(). There is more work to do here, as some of these functions may block allocating (annotated). Pick up a fix to not hold the inpcb lock over calls to ip_ctloutput() in tcp_ctloutput(). 20040617 patch: Integrate to CVS HEAD. Merge low-hanging-fruit locking of sb_flags, so_state, so_options using SOCKBUF_LOCK(), SOCK_LOCK() to FreeBSD CVS. In particular, protect SB_KNOTE and kqueue-related socket behavior. Merge conditional locking of the socket buffer in socket and fifofs kqueue filters; the socket buffer lock is already held if calling KNOTE() from the socket code, but if called from kqueue(), won't be. More consitently name needlock variables. Merge simplification of sodisconnect() logic now there are no spls. Merge raw_cb mutex protection of raw socket control blocks to FreeBSD CVS. Merge conversions of GIANT_REQUIRED to NET_ASSERT_GIANT to FreeBSD CVS. Merge struct socket annotations to FreeBSD CVS. Add sbdrop_locked() and sbdroprecord_locked() to avoid conditional locking of the socket buffer in sbdrop() and sbdroprecord(). Update consumers to call the right variant. Annotate the reason for conditional socket buffer lock acquisition in filt_so{read,write}(). Annotate situations where we acquire the socket lock, then the socket buffer lock, and in the future we can reduce locking overhead if we maintain the assumption that the socket lock and receive socket buffer lock are the same. 20040615 patch: Integrate to CVS HEAD. Partially merged locking of sb_state field for FreeBSD CVS. 20040614 patch: Integrate to CVS HEAD. Move of SS_{CANTRCVMORE,CANTSENDMORE,RCVATMARK} from so_state to sb_state (and rename to SBS_*) merged to FreeBSD CVS. 20040613a patch: Integrate to CVS HEAD. Correct merge-nit in subr_witness.c lock list. 20040613 patch: Integrate to CVS HEAD. Socket MAC label locking (so_label, so_peerlabel) merged back to FreeBSD CVS. Make thread-local copies of socket MAC labels before externalizing. Hard-coded witness lock orders for UNIX domain sockets and socket mutexes mreged to FreeBSD CVS. Locking of socket reference counnt (so_count) using SOCK_LOCK(so). Acquire socket lock around so_state manipulation in unp_disconnect(). 20040612 patch: Integrate to CVS HEAD (altq import). Merge presence of sb_mtx in struct sockbuf to FreeBSD CVS. Merge initialization/destruction/lock/unlock/assertion macros. Merge calls to initialize and destroy mutexes in soalloc()/sodealloc(). Merge locking of so_count using SOCK_LOCK(so) to FreeBSD CVS. Add several missing calls to SOCK_LOCK(so) before sotryfree() in less well reviewed sections of the stack, such as netatalk, netipx, netnatm, elsewhere. Whitespace cleanups in socketvar.h. 20040611 patch: Integrate to CVS HEAD. Change to conditional Giant acquisition in bpfwrite() before entering if_output() merged to FreeBSD CVS. Constification of raw_{send,recv}space merged to FreeBSD CVS. Merge of igmp_mtx protection of router_info (et al) in igmp.c to FreeBSD CVS. Merge removing Giant acquisition from divert_packet() in ip_divert.c to FreeBSD CVS. Remove socket buffer handling in TCP and UDP COMMON_START(), COMMON_END() since it's now unused as a result of not holding socket buffer locks over protocol downcalls. 20040610 patch: Integrate to CVS HEAD. Remove "unp head" from WITNESS lock order, as it's no longer used. Merge UNIX domain socket locking to FreeBSD CVS. 20040609 patch: Integrate to CVS HEAD. Introduce a netisr into the routing socket code when the kernel generates a routing message via rt_dispatch(), in order to avoid recursing into the socket code from the routing code. This avoids lock order issues between route locks/raw socket locks, as well as between socket locks when the socket code calls into the routing code. The sockaddr family is stapled onto the mbuf queued for the netisr using an m_tag. Reviewed by George Neville-Neil. Merged rtsock/netisr changes to FreeBSD CVS. sun_noname sockaddr constant in UNIX domain socket code made const, since it's imutable. Updated various references, modified sodupsockaddr() to accept a const sockaddr pointer as source, and sbappendaddr() to accept a const sockaddr. Modify mbuma to use atomic operations for statistics to prevent loss of accuracy of counters and help in debugging leaks. It would be nice to use pcpu stats here. sosisconnected() compile fix for !INVARIANTS, submitted by Bosko Milekic. accept1() bugs that leak file descriptor references fixed. Bug spotted by Brian Feldman. Large-scale simplification of UNIX domain socket locking: the fine-grained per-unpcb locking was entirely masked by holding the UNIX domain socket head mutex. Rather than switch to only fine-grained locking now, remove the per-unpcb mutex and just use the subsystem lock. We will want to revisit this later, but it simplifies UNIX domain sockets a great deal and improves performance over what was previously present. Several bug fixes in UNIX domain socket locking relating to Giant being held, proper coverage, and invariants, reported by George Neville-Neil. Also annotate a potential race in unp_bind(). In MAC, copy the ifnet label while holding the ifnet label lock, then externalize that, so as to avoid copying out while holding the mutex or doing unserialized copying. 20040603 patch: Integrate to CVS HEAD. Some TCP/IP, routing, SLIP, and socket hard-coded lock order entries merged to FreeBSD CVS. Socket buffer locks are no longer held over calls to pru_rcvd(), greatly simplifying protocol lock handling implementing this protocol entry point. The protocols (specifically, TCP and UNIX domain sockets) no longer need to release and re-grab the socket buffer lock for ordering reasons. Annotate a potential race in soreceive() due to failing to re-check a condition before sleeping resulting in a potential hang. To be resolved shortly. Constify the sun_noname structure in UNIX domain sockets as it immutable and hence requires no synchronization. Rename global unp_mtx in UNIX domain sockets to unp_head_mtx to prevent a name collision with the per-unpcb mutex, unp_mtx. Reformat some UNIX domain socket protocol entry points to match the structure in CVS, reducing diff size. When copying out the peer credential, don't perform a sooptcopyout() while holding a mutex. Instead make a stack copy which is copied out once the mutex is released. Update the UNIX domain socket count and generation number while holding unp_head_mtx when attaching a new UNIX domain socket to prevent inconsistency. Annotate some potential Giant-related problems in UNIX somain sockets. Make sure to hold Giant when calling vrele() on a detach. Annotate a test-and-set race relating to the unp_vnode pointer during unp_bind(). Having worked with Bosko to improve handling of inpcb locking in the raw IP code, integrate his locking fixes which remove an annotation about lock acquire and release for MAC checks. Begin a locking key for struct unpcb fields. 20040602 patch: Integrate to CVS HEAD. Socket accept() locking, including addition of accept_mtx, addition of ACCEPT_LOCK()/ACCEPT_UNLOCK() macros, use of the accept mutex to protect all queue-related fields in struct socket, and reorganizing of soclose(), sofree(), accept1(), soisconnected(), et al, to take into account new locking merged to FreeBSD CVS. VFS advisory lock handling in fdrop_locked() was pushed into vn_closefile() to get VFS-specific behavior out of the file descriptor layer "last close" function. Giant now asserted in some of the per-dtype close functions, and not others in preparation for pushing Giant acquisition down into the per-dtype fo_close() routine. Merged to FreeBSD CVS. Accept filter global list locking merged to FreeBSD CVS. SPX use of soabort() modified so that it removes uncomplete sockets from the incomplete socket queue before aborting, as soabort() no longer does that. Annotate a possibly port use of soabort() in tcp_syncache.c; since the socket is already fully connected, it will be left on the accept queue to be garbage collected in the listen queue consumer. This may require revisiting. Fix socket buffer bug introduced in NFS server socket code that could result in lock recursion. 20040531a patch: Integrate to CVS HEAD, including Don Lewis's socket/fifofs changes to add MSG_NBIO, eliminating races in fifofs. fifofs locking simplified by MSG_NBIO, several XXXRW's about non-local races removed since SS_NBIO is no longer frobbed on each side of a socket operation. so_qstate, SQ_INCOMP, and SQ_COMP merged to FreeBSD CVS. 20040531 patch: Integrate to CVS HEAD, including mbuma. Additional locking of sb_flags using socket buffer lock in portalfs, NFS client, NFS server, poll(), AIO, bluetooth sockets, netgraph sockets. Don't hold socket buffer lock over pru_send() in do_sendfile() to avoid socket->protocol lock order problems. Introduce a global lock in the MAC Framework to protect security labels on network interfaces, since there isn't currently sufficient locking in the caller for us to piggy-back onto. This is a temporary solution. Assert the socket buffer lock in sbunlock(). 20040530 patch: Integrate to CVS HEAD. Socket MAC label use in svr4 stream emulation now protected by SOCK_LOCK(so). Likewise for socket label use in netatalk via the DDP code. Annotation of an issue in fifofw where SS_NBIO is set before and removed after an I/O on the socket: locking protects consistency of the field, but not interlaced use due to the lock release over the socket operation. Don Lewis has a WIP patch that introduces MSG_NBIO, which is what this code actually wants. Several NFS locking fixes, including a couple of places where Giant was not acquired before entering VFS from the NFS server. Also a fix to the NFS nfssvc() system call to actually run without Giant, as its modevent code overwrite the system call argument field removing the MPSAFE flag. Additional assertions regarding Giant in the NFS server code to detect leaks of Giant, if any, from the NFS implementation. Add GIANT_REQUIRED to vrele() to catch possible entry into VFS using vrele() without Giant held. Contents of rwatson_net2 merged to rwatson_netperf: (1) Move state from struct socket to struct socketbuf in order to take advantage of locking on the socket buffer. Move SS_CANTRCVMORE, SS_CANTSENDMORE, and SS_RCVATMARK to so_{rcv,snd}.sb_state and rename to SBS_whatever. Consistently protect this field with the socket buffer lock, which is frequently held at the right moment anyway (unlike the socket lock). (2) Attempt to more consistently lock read-modify-write on so_state, so_linger, and so_options with the socket lock. Not that the SS_.*CONNECT.* flags are the last remaining under-locked flags in the socket state field. (3) Add a dump_socket() function that I use during debugging. Needs to be GC'd at some point. (4) Reformulate locking in soclose() to avoid holding the socket lock over much of the body of the function, in particular for the call into the protocol via pru_detach(). This menas that calls from the protocol back into the socket layer don't have to worry about lock orders or recursion. We may need to re-extend some of the influence of the socket lock over other bits of soclose() as locking on so_state becomes more mature. (5) In sosend(), drop the socket buffer lock earlier to avoid holding it over calls to pru_send(), simplifying inter-layer locking by allowing protocol send routines to grab socket layer locks without too much concern for lock order and recursion issues. (6) Annotate some calls to pru_rcvd() where we hold the socket or socket buffer lock over the call into the protocol layer as requiring similar attention to pru_send() locking. (7) Rewrite the accept filter attachment code to improve locking. Socket layer upcalls from the protocol remain one of the stickiest issues requiring cleanup, but the new formulation should reduce possible races against accept filter attachment and removal, which previously could race (even without fine-grained locking due to poor use of the M_WAITOK flag). (8) Re-lock soisconnected() to assume it is not called with a socket lock held: remove conditional locking, and unconditionally acquire various locks as needed. Note the upcall issue, and try to call the upcall without any socket locks held. (9) Likewise for soisdisconnecting(). Toast conditional locking, use explicit and unconditional locking. (10) Likewise soisdisconnected(). Toast conditional locking, use explicit and unconditional locking. (11) Try to use socantsendmore() and friends in preference to manually setting the flag and repeating the wakeup logic in various places. (12) As a result of avoiding holding socket buffer and socket locks over more protocol entry points, we no longer need to perform lock re-ordering in several places -- specifically, UNIX domain sockets and TCP stack. This seems to substantially simplify several bits of code, and I need to follow this strategy to its natural extension. (13) Annotate locking strategy for sockets and socket buffers in socketvar.h. (14) Annotate additional unlocked reads and possibly incorrect locking or lack of locking. Especially some potential issues in solisten(), the need to call soabort() without holding socket locks. There remain issues here. 20040524 patch: Integrate to CVS HEAD. SOCK_LOCK(so) now consistently held across soref() to protect reference count (so_count). File descriptor lock added to lock order before socket locks, as the file descriptor lock must be held over soref() so as not to lose the socket in a race against the protocol layer. Integrate a sanitized version of accept locking from rwatson_net to rwatson_netperf so it can be merged to rwatson_netperf. This patch introduces a new mutex, accept_mtx, which will serialize access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head In the previous world order, these fields were inconsistently or occasionally protected with various locks, largely the socket buffer receive lock on either the head socket or the member sockets. While the new locking doesn't have much granularity, it avoids lock order issues nicely, and appears to close several races. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race in the current code, which broke ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket, or races can occur that cause the incomplete socket queue to overfill. - In sofree() and other functions, generally simplify the locking by losing may or all references to socket buffer locks. Insert the accept mutex into the lock order after the file descriptor lock but before the socket locks, since we must bump the socket reference count while holding the accept lock. Move the SS_COMP and SS_INCOMP flags from so_state to so_qstate, since they are synchronized differently from the other so_state flags. Rename the flags to SW_COMP and SQ_INCOMP to prevent confusion and foot-shooting. 20040523b patch: Integrate to CVS HEAD. NFS server subsystem lock (nfsd_mtx) and constification merged back to FreeBSD CVS repository. Patch size -= 64k! 20040523a patch: Integrate to CVS HEAD. In fdrop_locked(), use the file descriptor type (fp->f_type) to determine if we should acquire Giant before calling fo_close(). Don't acquire Giant for pipes or sockets, but do for all else. I'd like to push the Giant acquisition and VFS bits down a layer, but haven't done it yet. Add so_rcv to the hard-coded lock order for routing. Add SLIP mutexes to the hard-coded lock order. Merge the accept_filter_mtx changes from rwatson_net2, which synchronize access to the global accept_filtlsthd list. Note that the existing lack of refcounting (and therefore prohibition on accept filter unload) remains. Rename rawcb_mtx to rawcb for WITNESS to be consistent with other mutexes. Acquire SOCK_LOCK(so) in tcp_attach() when frobbing the socket state to prevent the inpcb detach from releasing the socket. Annotate the possible problems. Re-annotate NFS server locking to remove comments that no longer apply, clarify some current ones, and remove some #if 0'd code that is now definitely not needed. 20040523 patch: Integrate to CVS HEAD. MAC label locking on sockets using the socket lock; the socket lock is grabbed around calls to mac_check_socket_visible(), mac_check_socket_receive(), mac_check_socket_send(), mac_create_socket_from_socket(), mac_check_socket_listen(), mac_check_socket_connect(), mac_set_socket_peer_from_socket(), mac_create_inpcb_from_socket(), mac_inpcb_sosetlabel(). mac_create_mbuf_from_socket(). The MAC Framework now asserts the socket locks in those entry points. Note that the socket lock is not held around the call to pru_sosetlabel() because it must be acquired after the PCB lock in the protocol layer in order to not violate the lock order. This means dropping the socket lock after a relabel. The MAC Framework now makes a copy of the socket label and socket peer label before externalizing when the label is retrieved using a system call or socket option, such that it can be externalized after the socket lock is released. mac_create_mbuf_from_inpcb() now widely preferred to mac_create_mbuf_from_socket() when below the socket layer to avoid acquiring socket locks in the protocol code. In one or two edge cases, I couldn't change it over because an inpcb was not used during send (ip_dvert). Locked down the global accept filter list with accept_filter_mtx to maintain accept filter list consistency in the presence of multiple registrations and attachments in parallel. soisconnecting() now unconditionally grabs the socket lock before frobbing so_state. soisconnected() lock handling updated to simplify lock handling in the "not an accept socket" case for non-listening connections. Addressed XXX's cleared up from netatalk/aarp.c. Bug fix to unlock the DDP list in ddp_connect(); otherwise in the EISCONN error case we leaked the lock. soref() now asserts the socket lock since it frobs the reference count. One of the changes integrated from CVS was a change from Don Lewis to avoid using the vnode interlock to synchronize fifo operations, which should correct a lock order reversal between the socket buffer lock and the vnode interlock in the rwatson_net2 work. NOTE: Additional work to address socket locking concerns is occurring in the rwatson_net2 branch, including breaking out so_state, re-locking the accept queues, and more. These changes will be merged to rwatson_netperf in the next few days as they mature. The current diff from rwatson_netperf to rwatson_net2 may be found in 20040523-rwatson_net2.diff. 20040503 patch: Integrate to CVS HEAD. Annotate possible unlocked or incorrectly locked use of so->so_state. Performed a survey of all so_state use and recorded results and annotations in notes-so_state.txt. Converted many instances of /* XXXRW: socket lock? */ to /* XXXRW: so_state locking? */. Added hard-coded WITNESS lock orders for a number of relevant network stack locks, including route, interface, socket, socket buffer, inpcb. Generally modified the MAC Framework to use inpcbs as the source for mbuf labels rather than the socket to avoid the need for reaching up to the socket layer where avoidable, which would require socket layer locks. Annotate the fact that if_attach() has an ordering problem relative to adding the ifnet to global lists before properly initializing all ifnet fields. Review locking of netgraph ng_base.c and add annotations for mutexes in macaros, as well as add locking to debugging routines called via sysctls. Notes on netgraph locking in notes-netgraph.txt. Still a work in progress. Annotate a possible lack of inpcb locking in raw IP; I added inpcb locking for the call into the MAC Framework but it may also be generally needed. 20040420 patch: Integrate to CVS HEAD. Lock Giant around calls into VFS in sendfile(). Submitted by Pawel Jakub Dawidek. Don't unlock Giant when it's not locked when a write is denied by the NFS server due to a read-only NFS mount or read-only underlying file system. Reported by Kris Kennaway. 20040418 patch: Integrate to CVS HEAD. Add if_ppp softc list locking from Maurycy Pawlowski-Wieronski . Merged to CVS. 20040416 patch: Integrate to CVS HEAD. NFS server panic fix from Pawel Dawidek. Garbage collect call to ipsec_getsocket() in ip_output(), which may have been a property of an earlier mismerge from some or another branch to netperf. 20040412 patch: Integrate to CVS HEAD. Close several race conditions in soreceive() relating to the stack local nextrecord variable, 'm->m_nextpkt' pointer, and socket buffer mbuf list fields: always synchronize these variables before releasing the socket lock, and after acquiring the socket lock, or sosend()/soreceive() may use inconsistent values leading to socket buffer inconsistency. Revert behavior to not potentially block when copying control mbufs with MSG_PEEK. This corrects panics running with (and without) socket buffer consistency assertions enabled on my dual-processor Xeon box when running sshd in privsep mode due to races during simultaneous send and receive of control mbufs on a socket. 20040411 patch: Integrate to CVS HEAD. Various 0->NULL changes, in part integrated from CVS, some local. Bugfix in sbappendcontrol() to invoke sbappendcontrol_locked() instead of recursing. 20040408a patch: Integrate to CVS HEAD. In fifofs, replace XXXRW notes relating to access to so_rcv and so_snd with actual socket buffer locking. Unconditionally lock during KQ registration of fifos. Conditionally lock socket buffers in filt_fiforead() and filt_fifowrite() to match similar locking in socket code. Correct assertion typo in uipc_socket.c. Assert socket buffer lock in sowakeup_locked(). KNOTE() locking in socket code is generally right, subject to limitations of KQ itself, so remove XXX locking notes from KNOTE() calls. Do hold socket buffer locks in soreserve() when modifying various socket buffer parameters read-modify-write, since soreserve() can be called at various points. Add XXX's in socket wakeup code that calls to so_upcall() need more attention in the locking department. 20040408 patch: Integrate to CVS HEAD. Additional socket buffer lock assertions. filt_sowrite() now does conditional socket buffer locking due to variation in locks held on entering: when entering via kqueue system calls, socket buffer lock is not held, but when entering via KNOTE() from socket code, it is. This fixes a panic reported by Kris Kennaway. Socket buffer lock held around call to unp_scan() in UNIX domain socket code, since it accepts so->so_rcv.sb_mb, which might otherwise change out from under it. Merged netatalk AARP locking back to CVS. 20040407 patch: Integrate to CVS HEAD. Annotations of unlocked reads, possible future locking needs in the socket buffer code. Additional assertions. 20040406 patch: Integrate to CVS HEAD. filt_soread() now conditionally grabs the socket buffer lock based on whether it's already held. That way it can grab the lock when called via the kevent system call, but not grab it when called via KNOTE() from socket code already holding the lock. Integrated 'gif_called' removal by Ruslan Ermilov in CVS; XXX's removed as a result (now much more MPSAFE). Several warning fixes from laset integration. Annotation of so_upcall() use to indicate that Giant is not available to the called function. Affected are nb_upcall() nfsrv_rcv, ng_ksocket_incoming. First attempt at fine-grained locking for the NFS server: add a global nfsd_mtx to act as a code/data lock for the time being. Push Giant down to VFS in the RPC path, and permit nfsrv_rcv() to run without Giant. Removal of Giant in the downward path is conditional on debug.mpsafenet. This approach is modeled on the BSD/OS approach, but is substantially different. This hopefully will eliminate NFS server hangs on SMP systems running with debug.mpsafenet=1. 20040404a patch: Integrate to CVS HEAD. Avoid lock order recursion in socket kqueue implementation by always calling KNOTE() on a socket buffer holding the socket buffer lock. Remove socket buffer locking from socket kqueue methods and simply assert the socket buffer lock. Fixes a panic reported by Kris Kennaway. 20040404 patch: Integrate to CVS HEAD. Annotate uses of so_upcall() with comments that Giant is not held in the upcalls; these exist in: ng_ksocket.c:ng_ksocket_incoming() netsmb/smb_trantcp.c:nb_upcall() nfsserver/nfs_srvsock.c:nfsrv_rcv() Annotate a nasty issue in the NFS server where we're entering the incoming NFS processing path from the socket code without holding Giant, resulting in a variety of races in the NFS server, which may explain the hangs that Kris Kennaway was experiencing. Convert a NET_ASSERT_GIANT() into a GIANT_REQUIRED, which will fail-stop the system if the NFS server is used (for now). Convert another Giant assertion to non-conditional in the NFS system call code. This problem reported by Kris Kennaway. Constify various constant arrays and values in the NFS server code. 20040402a patch: Actually, it's spelt sowwakeup_locked(). 20040402 patch: Integrate to CVS HEAD. Bugfix: use sowakeup_locked() instead of sowwakeup(); this fixes a lock recursion panic reported by Kris Kennaway. Constify some stuff in netnatm. netnatm requires more attention. 20040401 patch: Integrate to CVS HEAD. Merged slisunitfree() to FreeBSD CVS. Add a per-softc mutex to SLIP interfaces (sc_mtx), which protects all elements of the softc except tty fields, ifnet fields, ifqueue fields, and sc_next (protected by slip_mtx). Rename sl_mtx to slip_mtx to reduce confusion with sc_mtx. Note: if_sl.c changes are untested. 20040331 patch: Integrate to CVS HEAD. Merged formatting/layout changes to uipc_socket.c to reduce diff size. Global variable locking for if_sl using sl_mtx; note that there is a problem here when a slip interface is renumbered using an ioctl(). 20040330a patch: Merge unp_connect2() -> uipc_connect2() and fifofs/portalfs changes to FreeBSD CVS. 20040330 patch: Integrate to CVS HEAD. portalfs and fifofs now use uipc_connect2() not unp_connect2() as unp_connect2() is an internal interface that assumes that the UNIX domain socket PCB mutex is already held. This corrects a fifofs panic (and likely a portalfs one also). However, there are some unlocked socket variable accesses in both of these that will need further attentiont (now annotated). Lots of 0's become NULL's in the UNIX domain socket code as I review locking there. 20040329b patch: Integrate to CVS HEAD. Merged if_tun per-softc locking. 20040329a patch: Integrate to CVS HEAD. Merged conditional CALLOUT_MPSAFE intialization of UNIX domain sockets. Merged conditional Giant assertions in the callouts. Merged various structural changes to the socket sofree() and kqueue functions to make merging locking to the main tree easier. Removed dup gif_called variable that resulted from a mistake in merging. Merged tunmtx global variable locking in if_tun. 20040329 patch: Integrate to CVS HEAD. NET_LOCK_GIANT() associated with file descriptor socket operations moved above call into the MAC Framework since we haven't yet resolved locking for socket labels. 20040328b patch: Integrate to CVS HEAD. Merged NET_*_GIANT() replacement of mtx_*(&Giant) in sys_socket.c (socket operations via file descriptor), and uipc_syscalls.c (socket operations via socket system calls). 20040328a patch: Integrate to CVS HEAD. Merged inversion of debug.mpsafenet in NET_*_GIANT(). Merged NET_ASSERT_GIANT() in fputsock(). Merged BPFD_LOCK_ASSERT() also doing NET_ASSERT_GIANT(). Merged NET_LOCK_GIANT() removal on ip_protox switch since we now conditionally lock the whole stack, not just forwarding plane. 20040328 patch: Integrate to CVS HEAD. 20040322a patch: Integrate to CVS HEAD. if_gif: Merged global locking to the CVS tree, no longer in the patch. if_gre: Merged global locking to the CVS tree, no longer in the patch. 20040322 patch: Integrate to CVS HEAD. if_tun: missing tun_destroy() call restored to modevent unload. netatalk: AARPTAB_UNLOCK_ASSERT() added so that functions can assert that the aarptab lock *isn't* held. netatalk: lots of gratuitous whitespace cleanup, NULL use, rename of globals at_ifaddr and ddpcb to at_ifaddr_list and ddppcb_list to avoid confusion and make lock mechanical verification easier; merged to main tree. netatalk: ddpcb list lock assertions fixed. locking in ddp_search() fixed. lots of spls removed. socket space reserved before pcb attach rather than after to simplify locking, and match other protocols. ddp_send() locking fixed. 20040321 patch: Integrate to CVS HEAD. netatalk: AARP locking now uses macros rather than direct mutex operations. netatalk: AARP spls removed. netatalk: at_rmx comments updated; hexdump() is basically unused. netatalk: ddpcb locking prototype finished; global mutex for the pcb list, per-pcb mutexes for per-pcb data. ddp_send restructured so that global mutex isn't grabbed for a connected socket, only for unconnected. 20040320 patch: Integrate to CVS HEAD. if_tap: annotate si_flags and si_drv1 locking weaknesses. if_tap: Merge if_tap softc locking back to main tree, as well as fix a bug where clone state isn't destroyed if event handler registration fails. if_tun: annotate si_flags and si_drv1 locking weaknesses. if_tun: annotate possibly removal of D_NEEDGIANT from cdev. if_tun: break out tunnel interface destruction into tun_destroy(). if_tun: replace tun_proc with tun_pid to avoid use of stale proc pointer if process opening if_tun exits while it's still in us. Note: the use of tun_pid is bogus. netatalk: at_ifaddr -> at_ifaddr_list rename. netatalk: ddp_global_mtx added to protect ddpcb list, ddp_ports. constifided ddp_sendspace, ddp_recvspace as they are immutable and don't require synchronization. First pass at DDP PCB locking (untested). netipx: annotate ipx_ifaddr, ipxstat, as requiring synchronization. constify ipx_broadnet, ipx_broadhost. Note that some other globals are immutable but non-const because they are initialized with the protosw. netipx: start to migrate from a home-brew PCB list to using queue(9), and avoid abusing a bogus first PCB on the list to hold cached port use state (ouch). Not yet finished, as there are cases where entries may be removed from the list during a timeout, and the queue(9) interface may not be able to handle this well. There appears to be a serious bug in the automatic port-binding code for netipx wherein if the port space is exhausted, the code will simply spin.