Home » Alice » Alice Computing » Computing Meeting, Thursday, November 15
|Re: Computing Meeting, Thursday, November 15 - Minutes [message #5426 is a reply to message #5392]
||Fri, 16 November 2007 10:34
Registered: March 2004
Minutes of the Alice Computing meeting, 15.11.2007|
Silvia reported 100% failure rates on her jobs, traced to errors at writing output. Kilian noticed disks full and proposed redirecting output to /tmp disks, which shall solve the problem. This is associated to a recent bug in AliRoot, causing a very large output, which is now fixed.
In the context of this problem, which had caused quite same lost time, Silvia brought up the proposal to monitor more comprehensively, in order to hopefully prevent such situations in the future. Peter had proposed to issue a warning to users when the system is not working (question is, how efficiently one can detect a trouble, so monitoring is a crucial issue). Mail+Wiki will be used for such warnings.
Currently there are running jobs at a healthy rate, 2 (1) jobs on large (small) machines (+5 local jobs on large machines).
Not stable, xrootd unstable (a reset command kills the master).
Marian proposed to temporarily stop all other jobs (Grid & local) to check Proof. To do this, it was decided to restrict Proof to old machines (where data is stored) where it would run exclusively (a la CAF).
Another problem is that when one machine is dead Proof crashes. This may be due to an overloaded system (Grid+Proof+local) and will be clarified by the above check configuration. It was anyway proposed by Marian to regularly check machine status and remove them from the configuration. Another type of problem is a strange "1/2 stalled" state for some machines, for which jobs continue to run, but ssh does not work.
It was discussed that the whole xrootd rides on NFS and question is how much this contributes to our problems.
In view of this problems, it was discussed that the planned tutorial may be postponed until Proof is under better control.
It was decided to reserve 2 fileservers (Mischa's playground) for
Mischa's results show from merged files about 500 ev/s (compared to about 370 ev/s on local disk). A nice result is the apparent scaling with nr. of jobs. These preliminary tests were don in parallel with Hades,
Next week we will have a dedicated Alice test. This needs batch, which means rebooting of machines (users will be warned). 1-2 days will be devoted to "coherent" checks, the rest for "chaotic" usage.
5 fileservers are currently in Lustre, 2 more are foreseen.
It was proposed to move the meeting on Wednesday.
[Updated on: Fri, 16 November 2007 10:34]
Report message to a moderator
Current Time: Wed Sep 22 03:30:10 CEST 2021
Total time taken to generate the page: 0.02201 seconds