PDA

View Full Version : Error Creating Socket


MegaRock
1st July 2005, 17:40
This is an interesting error.

Running DNAS 1.9.5 under FreeBSD 5.3. The server starts fine, runs fine but it seems twice now it just quit responding alltogether and this is the only error in the log file:

<07/01/05@03:24:33> [yp_tch] error creating socket!
<07/01/05@03:29:33> [yp_tch] error creating socket!
<07/01/05@03:34:33> [yp_tch] error creating socket!
<07/01/05@03:39:33> [yp_tch] error creating socket!

The other day when it happened it seemed like the server stuck but showed an occasional connection make it through:

<06/29/05@09:41:33> [yp_add] error creating socket!
<06/29/05@09:41:34> [yp_add] error creating socket!
<06/29/05@09:41:35> [yp_add] error creating socket!
<06/29/05@09:41:36> [yp_add] error creating socket!
<06/29/05@09:41:37> [yp_add] error creating socket!
<06/29/05@09:41:38> [yp_add] error creating socket!
<06/29/05@09:41:38> [dest: 193.134.xxx.xx] connection closed (62 seconds) (UID: 12095)[L: 0]{Bytes: 32701}(P: 0)
<06/29/05@09:41:38> [dest: 67.79.x.xxx] starting stream (UID: 12096)[L: 1]{A: NSPlayer/8.0.0.4487}(P: 0)
<06/29/05@09:41:39> [yp_add] error creating socket!
<06/29/05@09:41:40> [yp_add] error creating socket!
<06/29/05@09:41:41> [yp_add] error creating socket!
<06/29/05@09:41:42> [yp_add] error creating socket!

Not sure why this one particular server is giving me trouble. Any ideas what causes this?

likearock
1st July 2005, 19:28
Try giving the box a reboot and start up your sc_serv again and see if it still does it?

I've seen FTP (proFTP) kick out socket errors under BSD 5x, and restarting the service fixes it temporarily, but restarting the box fixes it long term...

Not sure if that will actually do it...but its worth a shot eh?

djSpinnerCee
1st July 2005, 22:28
Sometimes the port range (user ports) that your box uses for outgoing requests (your end of a socket) will temporarily overlap a port you want to bind to a server..... typically the UNIX box selects user ports in a circular fashion, starting at 1025 and looping back after 65535, -- some Windows PCs seem to stop at 8192 and loop back -- in any event, if the port is in use when the server wants to bind it it will fail.

Another thing: You get the same issue if you have SrcIP/DestIP set to an IP address that the box does not "have" -- it should be Any for both -- and NEVER the address of any other box -- this must be an IP bound to a local interface.

MegaRock
2nd July 2005, 17:32
Still trying to figure out what happened. It's been well behaved since yesterday when I restarted although I can't figure out what's causing it and it's the second time it happened.

I am running 225 64k listeners under a AMD Sempron 2400 with 512MB RAM which I think should be sufficent since several of my other servers ran fine on 512MB considering it is both Linux and there are virtually almost no other services running (I have Proftpd and Webmin running, that's about it)

Guess I'll wait to see if it blows it's mind out again.

djSpinnerCee
2nd July 2005, 17:49
FTP could "do it" if it does passive mode (?) -- can you configure the Ephemeral (data) port range that the ftpd uses?

In any case, I would check with netstat -a when the issue is occurring -- it may give you some hints.

MegaRock
3rd July 2005, 16:33
Tried it this time (probably some time after the DNAS quit responding. All the SC server slots said 'CLOSE_WAIT'. FTP server was not up, HTTP was not up. SSH conneciton back into the server also worked fine. All the SC slots though said this and the DNAS server was not responding:

tcp4 386 0 sc03.http 80.87.83.117.57229 ESTABLISHED
tcp4 186 0 sc03.http adsl-67-66-159-6.2705 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2652 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2616 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2597 CLOSE_WAIT
tcp4 116 0 sc03.http 81-231-82-60-no3.63135 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2561 CLOSE_WAIT
tcp4 0 0 sc03.http ip127-92.cbn.net.55876 CLOSE_WAIT
tcp4 0 0 sc03.http ip127-91.cbn.net.55875 CLOSE_WAIT
tcp4 0 0 sc03.http ip127-94.cbn.net.55874 CLOSE_WAIT
tcp4 0 0 sc03.http 130.67-19-211.re.2846 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2519 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2483 CLOSE_WAIT
tcp4 252 0 sc03.http dialup-013.kamat.2299 CLOSE_WAIT
tcp4 251 0 sc03.http p54AF7A50.dip.t-.2548 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2431 CLOSE_WAIT
tcp4 186 0 sc03.http adsl-67-66-159-6.2378 CLOSE_WAIT

I noticed many of the ip's were repeated in the list but there were maybe 50 connections stuck like this - not an overly large number.

djSpinnerCee
3rd July 2005, 16:59
Your DNAS is not running on port 80 is it? The local port is http -- I usually see that when web clients support keepalive -- with CLOSE_WAIT, the local port is waiting for another request on the same port from the same client - The keepalive timeout and maxkeepalive requests per client can reduce "how long" ports stay in this state.

I'm not sure if this is your problem though, I'm not certain if a port in CLOSE_WAIT will prevent binding by a new thread (ie. counts as an "active" connection), and connections "lingering" on port 80 should not effect any other port at all, unless you have also reached the total max "sessions [sockets]" for your OS(?).

In any event, netstat -a will show all active and idle sockets and their states, including LISTENING and UDP ports -- that may give you more clues. I suspect a "recource" issue of some sort, but I can't put my finger on what exactly it may be.

MegaRock
4th July 2005, 18:12
It is indeed on port 80 at the moment. Think I might try switching ports to see if it makes any difference.

MegaRock
5th July 2005, 16:55
Well, finally got the system logs to catch something although it's a bit confusing. :

Jul 5 10:57:39 sc03 sm-mta[6724]: j65AvdmY006724: SYSERR(UID0): gatherq: cannot open "/var/spool/mqueue": Too many open files in system
Jul 5 10:57:44 sc03 kernel: kern.maxfiles limit exceeded by uid 25, please see tuning(7).
Jul 5 10:57:44 sc03 sm-msp-queue[6725]: j65AviqM006725: SYSERR(UID0): fill_fd: disconnect: fd 0 not open: Bad file descriptor
Jul 5 10:57:44 sc03 sm-msp-queue[6725]: j65AviqM006725: SYSERR(UID0): fill_fd: disconnect: cannot open /dev/null: Too many open files in system
Jul 5 10:58:04 sc03 kernel: kern.maxfiles limit exceeded by uid 0, please see tuning(7).
Jul 5 11:00:00 sc03 kernel: kern.maxfiles limit exceeded by uid 0, please see tuning(7).
Jul 5 11:02:06 sc03 last message repeated 2 times
Jul 5 11:11:10 sc03 last message repeated 14 times
Jul 5 11:22:09 sc03 last message repeated 501 times
Jul 5 11:27:38 sc03 last message repeated 329 times
Jul 5 11:27:39 sc03 syslogd: /dev/console: Too many open files in system: Too many open files in system
Jul 5 11:27:39 sc03 sm-mta[6751]: j65BRdmY006751: SYSERR(UID0): fill_fd: disconnect: fd 0 not open: Bad file descriptor
Jul 5 11:27:39 sc03 sm-mta[6751]: j65BRdmY006751: SYSERR(UID0): fill_fd: disconnect: cannot open /dev/null: Too many open files in system
Jul 5 11:27:39 sc03 kernel: kern.maxfiles limit exceeded by uid 0, please see tuning(7).
Jul 5 11:27:43 sc03 last message repeated 4 times
Jul 5 11:27:44 sc03 sm-msp-queue[6752]: j65BRiqM006752: SYSERR(UID0): fill_fd: disconnect: fd 0 not open: Bad file descriptor
Jul 5 11:27:44 sc03 sm-msp-queue[6752]: j65BRiqM006752: SYSERR(UID0): fill_fd: disconnect: cannot open /dev/null: Too many open files in system
Jul 5 11:27:44 sc03 kernel: kern.maxfiles limit exceeded by uid 0, please see tuning(7).
Jul 5 11:28:15 sc03 last message repeated 31 times
Jul 5 11:30:16 sc03 last message repeated 121 times
Jul 5 11:40:17 sc03 last message repeated 601 times
Jul 5 11:50:18 sc03 last message repeated 470 times
Jul 5 11:57:39 sc03 last message repeated 371 times
Jul 5 11:57:39 sc03 sm-mta[6771]: j65BvdmY006771: SYSERR(UID0): gatherq: cannot open "/var/spool/mqueue": Too many open files in system
Jul 5 11:57:44 sc03 kernel: kern.maxfiles limit exceeded by uid 25, please see tuning(7).
Jul 5 11:57:44 sc03 sm-msp-queue[6772]: j65BviqM006772: SYSERR(UID0): fill_fd: disconnect: fd 0 not open: Bad file descriptor
Jul 5 11:57:44 sc03 sm-msp-queue[6772]: j65BviqM006772: SYSERR(UID0): fill_fd: disconnect: cannot open /dev/null: Too many open files in system
Jul 5 11:57:59 sc03 kernel: kern.maxfiles limit exceeded by uid 0, please see tuning(7).
Jul 5 11:58:30 sc03 last message repeated 21 times
Jul 5 12:00:00 sc03 last message repeated 40 times
Jul 5 12:10:41 sc03 last message repeated 380 times
Jul 5 12:19:59 sc03 last message repeated 278 times
Jul 5 12:27:39 sc03 last message repeated 291 times

From what I'm reading on this I need to set up tuning parameters in the loader.conf fil but from what I've been reading on this it seems a bit confusing. It does seem that these are the errors causing the DNAS to stop responding while still being up and operational although out of six unix boxes this is the first time I've come across this error on a default configuration.

djSpinnerCee
5th July 2005, 17:17
Seems like those "too many files..." messages are caused by a mail server of some sort == are you relaying mail?

Maybe your mailserver (SMTP relay?) has been "discovered" and is being "used" to flood other servers.... ? consuming your file handles in the process... Sockets and streams also use "file handles"... so this could cause "bind" errors.

I would also "check" the system handles [kern.maxfiles] parameter -- it may have been "set" to too low a number for your use -- defaults are typically conservative.

To compare apples to oranges: On my WinXP Home PC, in Task Manager [Performance] -- I typically find around 8000 handles in use by only 36 processes.

MegaRock
5th July 2005, 17:33
Indeed. Sendmail was running but it was under a default configuration so I *don't think* it was open to relaying but since I am not using it on this machine I am going to remove it completely. I've also set some higher defaults in the loader.conf file to see if it makes any difference as it indeed was running on default with nothing in it. I've dumped the following in:

kern.ipc.maxsockets=16424
kern.maxproc=12328
kern.maxprocperuid=11094

Not sure if that's still being too conservative or not but seems a good starting point since in reality the only things that will run on this server is proftpd (for my admin use only), a backup Bind 9 server, snmpd, smokeping and Shoutcast.

djSpinnerCee
5th July 2005, 17:41
The best way to "know" what settings to use is to monitor the critical settings during runtime.

I don't know about that maxsockets param -- remember, just for IP, there are potentially 65535 TCP ports that can be in use, and another 65535 for UDP, and a "listen" can allow multiple connections on the same port [ie: DNAS maxListeners], so, watch that one......

MegaRock
5th July 2005, 18:39
Problem I'm having it seems is things run just fine. Once this happens it's too late. Can't get in through SSH or Webmin, can't find out what's running and choking things out and the only recourse is to reboot. I'm wondering if I need to be looking at a param that also controls how long a socket stays open or on hold once it's no longer being used because that kinda seems to be what the issue is - eventually to many files are left open and the choke hold ensues - if these files were closing properly doesn't seem it would happen.

I also noticed a sysctl kern.maxfiles resulted in 7048 which is lower than any of the other nix boxes so I set this up to 8000 just to see if it helps any. The previous errors were indeed sendmail but those were from the system trying to email errors to the root account which it could not.

Figure it would also be good to note I am running 225 slots on this box with 512 of RAM which seems sufficent (100 Mb of bandwidth) as I don't think the SC process takes too much memory for each slot and it's ran at max a few times with no issue.

Gee, where's a linux guru when you need one :)

Pc-Center
7th July 2005, 16:27
was port already used ???
was interface (if virtual eth0:x) really UP ?

MegaRock
8th July 2005, 18:06
Yep, everything was fine but I never figured this out. There was only one IP (non-virtual) on the machine but I have no idea what the hardware configuration was on the box as it's leased.

Since the colo would reinstall the OS for free I had them do that just in case I borked something myself and didn't know it. Ran fine for two days and then did it again and locked the whole box up. Had also tried raising the values via the loader.conf file for max open sockets and files. Ran nothing but the SC process and SSH. Nothing else at all running.

Came to the belief BSD don't like their hardware so I switched the server to Win 2003. Been running like a pearl since. Go figure, after all the m$ bashing I do MS solved the problem.