If you're interested in functional programming, you might also want to checkout my second blog which i'm actively working on!!

Wednesday, October 3, 2012

java.io.FileNotFoundException: Too many open files

This morning we noticed that our (S)FTP service was not sending files anymore. We monitor this process however and any error logmessages are stored per job. A few error logmessages were very deceiving however
File[] files = jobIdDirectory.listFiles();
if (null == files || files.length != 1) {
  throw new SharedFileValidationException(String.format(
  "The directory %s is expected to contain a single file inside, actual number of files is %d", jobIdDirectory.getAbsolutePath(), files != null ? files.length : -1));

Checking the logs for a job that failed:
2012-10-03 08:40:26,498 INFO  pool-35 com.nxp.spider2.ftpservice.service.impl.FileTransferPostProcessor - <<< Updating job 74058544 status to Sent Failed [item: BUK7540-100A, workflow: SPC-2-PWS, error message: The directory /appl/wpc/5.2/pxprod1/public_html/sharedFS/74058544 is expected to contain a single file inside, actual number of files is -1]
Checking the actual filesystem shows however that this is not the case

But luckily we found another logmessage which gave us more insight
Caused by: java.io.FileNotFoundException: /appl/../public_html/sharedFS/73526306/TRIMM_PRODUCTS/LPC1830FET256.zip (Too many open files)

So my next action was to get more info on how to resolve this kind of error and i found this useful StackOverflow question. It turned out that I could not use the lsof command.
lsof -p <pid of jvm>

But I managed to find the PID for the failing tomcat instance.
ps -ef | grep tomcat
So next I took a look at the following folder:
/proc/{pid}/fd>ls | wc -l
So it seems we hit the default sweet spot of open files. I shutdown the tomcat instance and the open file handles got closed. Actually the complete /proc/{pid} folder got deleted.
Next I restarted the tomcat instance and checked against the new pid. The number of open file handles grows and shrinks so it does not seem to be a code issue.
/proc/11395/fd>ls | wc -l
/proc/11395/fd>ls | wc -l
/proc/11395/fd>ls | wc -l
:/proc/11395/fd>ls | wc -l
/proc/11395/fd>ls | wc -l
I actually just checked and we have about 9000 jobs to be processed and this big backlog could potentially result in this issue.

No comments:

Post a Comment