|Re: Meeting on Thursday, April 26, 2007 [message #4172 is a reply to message #4170]
|Thu, 26 April 2007 06:39
Registered: June 2004
Location: GSI, Darmstadt
Dear Silvia and all,
I would like to comment your points:
1. from the summary table one can see that GSI dedicated 100 CPUs to Grid computing, which we did not fullfill so far. But we will certainly come close to that within this year. Especially when you compute the number of 100 CPUs into kSi2k of the time when this number has been pledged.
2. you reported nothing about failure rates at GSI.
what you reported is:
- integrated CPU time at GSI compared to other centres.
This value compares more or less with the percentual site share, which has been 4 to 5% at Muenster and 0.5% at GSI so far.
The reason for this is that we have been running continuously 3 Grid jobs at GSI for quite some time (and most of them successfully and without failures !!!)
since the high mem queue did not provide more than 3 machines in the batch farm and the other batch machines were not able to deal with the memory consumption of ALICE jobs. Since we put the D-Grid machines into production we run 10 jobs in average in parallel, which is already a significant improvement. We also tried already 2 jobs in parallel but the efficiency of job performance became definitely worse compared to only 1 job per machine. The reason might be found out by continous local job monitoring. According to our preliminary findings the peak of memory consumption, which happens only for a short while during reconstruction, may exceed the machine capabiliies when two jobs are running at the same time having the peak roughly at the same time, too.
In any case if the integrated CPU time at GSI compares with the percentual share of jobs running at GSI then the failure rate at GSI also compares with other sites.
Memory consumption of ALICE jobs at 64bit machines used to be significantly higher than at non 64bit machines with former AliRoot versions. This improved slightly with newer AliRoot versions.
If you compare the memory of GSI batch machines with other machines then please consider that our machines rarely have SWAP but mainly the on board memory. The Linux group believes that it does not make sense to use SWAP since this would slow down the jobs too much and would abuse the local disk.
The GSI bandwidth does not provide a reason for job failure. The GSI bandwidth is just limiting the amount of jobs we can run in parallel. The Suedhessennetz is not yet in production for our Grid jobs due to the mentioned political problems, which are not yet solved.
xrootd cluster: I agree with the problem analysis of Horst Goeringer and I also agree that we should follow up this more closely.
Cheers and have fun,