From andy@siliconlandmark.com Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 02:42:03 -0400 (EDT) From: Andre Guibert de Bruet To: current@freebsd.org Subject: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeList failed: Network write failure: ChannelMux.ProtocolError) Replying to myself... I made the cardinal sin of not including a uname -a... The system in question is: FreeBSD bling.home 6.0-CURRENT FreeBSD 6.0-CURRENT #1: Sat Sep 11 18:31:59 EDT 2004 root@bling.home:/usr/CURRENT/sys/i386/compile/BLING i386 I have a fresh dmesg up at http://bling.properkernel.com/dmesg.boot.txt and a boot -v at http://bling.properkernel.com/boot-v.txt Andy | Andre Guibert de Bruet | Enterprise Software Consultant > | Silicon Landmark, LLC. | http://siliconlandmark.com/ > On Sat, 11 Sep 2004, Andre Guibert de Bruet wrote: > Anyone have any ideas on this one? > > # cvsupdate [...] > Updating collection src-all/cvs > TreeList failed: Network write failure: ChannelMux.ProtocolError > Will retry at 18:43:42 > ^C > > The machine in question is a dual Athlon with a custom SMP kernel config > which can be found at http://bling.properkernel.com/BLING . Reverting to > GENERIC does _not_ fix the problem. The interface in question is an > nge-compatible 64-bit Linksys card detected as: > > nge0@pci0:8:0: class=0x020000 card=0x10641737 chip=0x0022100b rev=0x00 > hdr=0x00 > vendor = 'National Semiconductor' > device = 'DP83820/1 10/100/1000 Gigabit Ethernet Adapter' > class = network > subclass = ethernet > > I've also noticed data corruption in the form of failed CRCs (And hence > dropped SSH connections) while transferring large amounts of data via SSH > over gige to a machine on its subnet. These problems started occuring after > the giant-less networking megacommit. Older kernels check out without any > such issues. > > Additional information on the system can be found online: > pciconf -vl: http://bling.properkernel.com/pciconf-vl.txt > kernel config: http://bling.properkernel.com/BLING > old dmesg (Can be updated): http://bling.properkernel.com/dmesg.boot.txt _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From kris@FreeBSD.org Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 06:44:16 +0000 From: Kris Kennaway To: Andre Guibert de Bruet Cc: current@freebsd.org Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeList failed: Network write failure: ChannelMux.ProtocolError) On Sun, Sep 12, 2004 at 02:42:03AM -0400, Andre Guibert de Bruet wrote: > >I've also noticed data corruption in the form of failed CRCs (And hence > >dropped SSH connections) while transferring large amounts of data via SSH > >over gige to a machine on its subnet. These problems started occuring > >after the giant-less networking megacommit. Older kernels check out > >without any such issues. Does it go away if you turn off debug.mpsafenet? If not, it's probably not related to that commit. Kris _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From andy@siliconlandmark.com Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 02:53:05 -0400 (EDT) From: Andre Guibert de Bruet To: Kris Kennaway Cc: current@FreeBSD.ORG Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) On Sun, 12 Sep 2004, Kris Kennaway wrote: > On Sun, Sep 12, 2004 at 02:42:03AM -0400, Andre Guibert de Bruet wrote: > >>> I've also noticed data corruption in the form of failed CRCs (And hence >>> dropped SSH connections) while transferring large amounts of data via SSH >>> over gige to a machine on its subnet. These problems started occuring >>> after the giant-less networking megacommit. Older kernels check out >>> without any such issues. > > Does it go away if you turn off debug.mpsafenet? If not, it's > probably not related to that commit. Setting debug.mpsafenet to 0 allows the SSH transfers to complete. The MD5 checksums and sizes match. Where do we go from here? Andy | Andre Guibert de Bruet | Enterprise Software Consultant > | Silicon Landmark, LLC. | http://siliconlandmark.com/ > _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From rwatson@FreeBSD.ORG Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 10:57:45 -0400 (EDT) From: Robert Watson To: Andre Guibert de Bruet Cc: Kris Kennaway , current@FreeBSD.ORG Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) On Sun, 12 Sep 2004, Andre Guibert de Bruet wrote: > On Sun, 12 Sep 2004, Kris Kennaway wrote: > > > On Sun, Sep 12, 2004 at 02:42:03AM -0400, Andre Guibert de Bruet wrote: > > > >>> I've also noticed data corruption in the form of failed CRCs (And hence > >>> dropped SSH connections) while transferring large amounts of data via SSH > >>> over gige to a machine on its subnet. These problems started occuring > >>> after the giant-less networking megacommit. Older kernels check out > >>> without any such issues. > > > > Does it go away if you turn off debug.mpsafenet? If not, it's > > probably not related to that commit. > > Setting debug.mpsafenet to 0 allows the SSH transfers to complete. The > MD5 checksums and sizes match. Where do we go from here? I think I'd look at the following next: - Does your network interface driver support checksum offload? If so, what happens if you disable that? - Is the network interface driver marked as INTR_MPSAFE and/or not IFF_NEEDSGIANT. If either, try setting the driver to run with Giant by removing INTR_MPSAFE and adding IFF_NEEDSGIANT. After that I think we want to try and produce a non-SSH reproduction scenario using a very simple test program... Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From andy@siliconlandmark.com Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 12:25:49 -0400 (EDT) From: Andre Guibert de Bruet To: Robert Watson Cc: current@FreeBSD.ORG Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) On Sun, 12 Sep 2004, Robert Watson wrote: > On Sun, 12 Sep 2004, Andre Guibert de Bruet wrote: >> On Sun, 12 Sep 2004, Kris Kennaway wrote: >>> On Sun, Sep 12, 2004 at 02:42:03AM -0400, Andre Guibert de Bruet wrote: >>> >>>>> I've also noticed data corruption in the form of failed CRCs (And hence >>>>> dropped SSH connections) while transferring large amounts of data via SSH >>>>> over gige to a machine on its subnet. These problems started occuring >>>>> after the giant-less networking megacommit. Older kernels check out >>>>> without any such issues. >>> >>> Does it go away if you turn off debug.mpsafenet? If not, it's >>> probably not related to that commit. >> >> Setting debug.mpsafenet to 0 allows the SSH transfers to complete. The >> MD5 checksums and sizes match. Where do we go from here? > > I think I'd look at the following next: > > - Does your network interface driver support checksum offload? If so, > what happens if you disable that? It appears that it does, based on the options field reported by ifconfig: nge0: flags=108843 mtu 1500 options=13 I can still reproduce the problem after passing -rxcsum and -txcsum while bringing the interface up. > - Is the network interface driver marked as INTR_MPSAFE and/or not > IFF_NEEDSGIANT. If either, try setting the driver to run with Giant by > removing INTR_MPSAFE and adding IFF_NEEDSGIANT. dev/nge/if_nge.c has the interface marked as IFF_NEEDSGIANT, with no trace of INTR_MPSAFE. My dmesg confirms this: "nge0: [GIANT-LOCKED]" > After that I think we want to try and produce a non-SSH reproduction > scenario using a very simple test program... Attempting to bring a local FreeBSD repo up-to-date causes the issue to manifest itself. If portupgrade is run and execs a fetch for a large tarball from a fast mirror (100KB/s+), the problem manifests itself as well. I cannot yet make any conclusive determination, but preliminary pattern analysis seems to indicate that large bursts of network traffic on this gige interface aid the reproduction of this condition. The machine in question acts as a dns resolver for my small home network and appears to handle light amounts of traffic without any issues. Thanks for the help, Andy | Andre Guibert de Bruet | Enterprise Software Consultant > | Silicon Landmark, LLC. | http://siliconlandmark.com/ > _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From andy@siliconlandmark.com Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 13:07:35 -0400 (EDT) From: Andre Guibert de Bruet To: Robert Watson Cc: current@freebsd.org Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) Robert, Using an rl-based network card, I am able to transfer data without any problems. Any idea who the nge maintainer is? Regards, Andy | Andre Guibert de Bruet | Enterprise Software Consultant > | Silicon Landmark, LLC. | http://siliconlandmark.com/ > _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From rwatson@freebsd.org Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 16:10:29 -0400 (EDT) From: Robert Watson To: Andre Guibert de Bruet Cc: current@freebsd.org Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) On Sun, 12 Sep 2004, Andre Guibert de Bruet wrote: > Using an rl-based network card, I am able to transfer data without any > problems. Any idea who the nge maintainer is? I'm not sure we have an nge maintainer, but I'm also not sure it's needed much maintenance (perhaps until now). Bill Paul wrote it, I believe, however. I'm thinking there are a couple of things we should try doing: - First, we should confirm that Giant really is properly held in some strategic places in the driver. I.e., slap down GIANT_REQUIRED in a bunch of interesting looking places (perhaps the head of most of the functions). We could be entering the ioctl code w/o Giant, perhaps, or the watch dog. - Attempt to identify whether or not the corruption corresponds with other failure modes that may be present, such as packet loss. Perhaps we're looking at a problem with reassembly and/or retransmission. It would be useful to know, for example, if the counters relating to TCP packet loss go up at about the time corruption occurs. - We should probably build a test tool to characterize the corruption a bit better. We could potentially start out just by dd'ing a big file of zeros through netcat between two hosts using if_nge, and confirm that the zeros get there in one piece, and then try with more complex data patterns that would reveal improper ordering, etc. - For grins, could you try running the same software with TCP SACK turned off and confirm that the problem is still present? Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" From rwatson@freebsd.org Mon Sep 13 12:34:17 2004 Date: Sun, 12 Sep 2004 17:30:27 -0400 (EDT) From: Robert Watson To: Andre Guibert de Bruet Cc: current@freebsd.org Subject: Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError) On Sun, 12 Sep 2004, Robert Watson wrote: > On Sun, 12 Sep 2004, Andre Guibert de Bruet wrote: > > > Using an rl-based network card, I am able to transfer data without any > > problems. Any idea who the nge maintainer is? > > I'm not sure we have an nge maintainer, but I'm also not sure it's > needed much maintenance (perhaps until now). Bill Paul wrote it, I > believe, however. I'm thinking there are a couple of things we should > try doing: Another thing to try is to use the ping command with a large packet size (maybe just below MTU) and relatively rapid rate to see if it reports any data corruption. That might help us confirm whether this is isolated to UDP or not. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"