Home » Alice » Alice Computing » Computing Meeting, Thursday, November 15
Re: Computing Meeting, Thursday, November 15 - Minutes [message #5426 is a reply to message #5392] |
Fri, 16 November 2007 10:34 |
Anton Andronic
Messages: 18 Registered: March 2004 Location: GSI
|
occasional visitor |
From: *gsi.de
|
|
Minutes of the Alice Computing meeting, 15.11.2007
==================================================
Grid:
------
Silvia reported 100% failure rates on her jobs, traced to errors at writing output. Kilian noticed disks full and proposed redirecting output to /tmp disks, which shall solve the problem. This is associated to a recent bug in AliRoot, causing a very large output, which is now fixed.
In the context of this problem, which had caused quite same lost time, Silvia brought up the proposal to monitor more comprehensively, in order to hopefully prevent such situations in the future. Peter had proposed to issue a warning to users when the system is not working (question is, how efficiently one can detect a trouble, so monitoring is a crucial issue). Mail+Wiki will be used for such warnings.
Currently there are running jobs at a healthy rate, 2 (1) jobs on large (small) machines (+5 local jobs on large machines).
Proof:
------
Not stable, xrootd unstable (a reset command kills the master).
Marian proposed to temporarily stop all other jobs (Grid & local) to check Proof. To do this, it was decided to restrict Proof to old machines (where data is stored) where it would run exclusively (a la CAF).
Another problem is that when one machine is dead Proof crashes. This may be due to an overloaded system (Grid+Proof+local) and will be clarified by the above check configuration. It was anyway proposed by Marian to regularly check machine status and remove them from the configuration. Another type of problem is a strange "1/2 stalled" state for some machines, for which jobs continue to run, but ssh does not work.
It was discussed that the whole xrootd rides on NFS and question is how much this contributes to our problems.
In view of this problems, it was discussed that the planned tutorial may be postponed until Proof is under better control.
It was decided to reserve 2 fileservers (Mischa's playground) for
safe copy.
Lustre cluster:
---------------
Mischa's results show from merged files about 500 ev/s (compared to about 370 ev/s on local disk). A nice result is the apparent scaling with nr. of jobs. These preliminary tests were don in parallel with Hades,
Next week we will have a dedicated Alice test. This needs batch, which means rebooting of machines (users will be warned). 1-2 days will be devoted to "coherent" checks, the rest for "chaotic" usage.
5 fileservers are currently in Lustre, 2 more are foreseen.
Org:
----
It was proposed to move the meeting on Wednesday.
Anton
[Updated on: Fri, 16 November 2007 10:34] Report message to a moderator
|
|
|
Goto Forum:
Current Time: Sat Jan 18 08:05:36 CET 2025
Total time taken to generate the page: 0.01168 seconds
|