GSI Forum: Bugs, Fixes, Releases » Memory leaks in digitization (TPC!)

Home » PANDA » PandaRoot » Bugs, Fixes, Releases » Memory leaks in digitization (TPC!)

Show: Today's Messages :: Polls :: Message Navigator

Memory leaks in digitization (TPC!) [message #11011]

Mon, 20 September 2010 17:49

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *to.infn.it

Dear all,
I have run valgrind on digi macro, and I have found many messages coming from memory leaks, in particular from TPC (which is the winner) but also from EMC and MVD/DSD.
I would like that detector experts take a look and try to fix them, for a better stability of the code (and also to improve the performances).

This is the log:

The last one from the TPC Cluster Finder seems a big guy!

Many thanks

Report message to a moderator

Re: Memory leaks in digitization (TPC!) [message #11012 is a reply to message #11011]

Tue, 21 September 2010 13:33

Felix Boehmer
Messages: 149
Registered: May 2007
Location: Munich

first-grade participant

From: *natpool.mwn.de

Hello Stefano,

thanks for pointing this out. It's being investigated.

Cheers

Felix

Report message to a moderator

Re: Memory leaks in digitization (TPC!) [message #11025 is a reply to message #11012]

Thu, 23 September 2010 18:50

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *to.infn.it

Hi,
I have seen some changes during these days. I have cleaned a bit emc, Ralf has fixed sds and mvd, Felix has cleaned something for tpc.
I leave here what is remaining with svn trunk from 18:30 of today (digi is still crashing):

INIT (TpcPadPlane):

EXEC: (several TPC still)

Report message to a moderator

Re: Memory leaks in digitization (TPC!) [message #11028 is a reply to message #11025]

Fri, 24 September 2010 12:37

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *to.infn.it

I have commented out almost everything, and I have found the task who is producing the crash is PndTpcElectronicTask.

I suppose the guilty line is 176:

if(padmap[id]==NULL)padmap[id]=new std::vector<PndTpcSignal*>;

Report message to a moderator

Re: Memory leaks in digitization (TPC!) [message #11029 is a reply to message #11028]

Fri, 24 September 2010 14:23

Felix Boehmer
Messages: 149
Registered: May 2007
Location: Munich

first-grade participant

From: *natpool.mwn.de

Hi Stefano,

this is indeed an absurdly stupid bug. I checked in a quick fix. Thank you for your time!

Cheers

Felix

Report message to a moderator

Re: Memory leaks in digitization (TPC!) [message #11032 is a reply to message #11029]

Fri, 24 September 2010 17:14

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *to.infn.it

Hi,
from valgrind the message has disappeared, and running the digi, after 3k events I cannot see a crash (not yet, at least).
If I check valgrind, it seems there is still something that can be improved for tpc, from PndTpcClusterFinder:

==22081== 10,083,067 bytes in 547,009 blocks are possibly lost in loss record 457 of 461
==22081==    at 0x4004790: operator new(unsigned) (vg_replace_malloc.c:164)
==22081==    by 0xC17971: std::string::_Rep::_S_create(unsigned, unsigned, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.3)
==22081==    by 0xC19DF4: (within /usr/lib/libstdc++.so.6.0.3)
==22081==    by 0xC19F01: std::string::string(char const*, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.3)
==22081==    by 0x9909D3D: padprocessor::reset() (TORPadProcessor.cxx:99)
==22081==    by 0x990F622: PndTpcSectorProcessor::reset() (PndTpcSectorProcessor.cxx:176)
==22081==    by 0x98F4FC3: PndTpcClusterFinder::process(std::vector<PndTpcDigi*, std::allocator<PndTpcDigi*> >&) (PndTpcClusterFinder.cxx:119)

==22081== 8,448 (2,940 direct, 5,508 indirect) bytes in 245 blocks are definitely lost in loss record 363 of 461
==22081==    at 0x4004790: operator new(unsigned) (vg_replace_malloc.c:164)
==22081==    by 0x990CC1D: ppstate_output::heartbeat() (TORPPState_Output.cxx:40)
==22081==    by 0x9909DED: padprocessor::heartbeat() (TORPadProcessor.cxx:105)
==22081==    by 0x990F3CD: PndTpcSectorProcessor::process() (PndTpcSectorProcessor.cxx:117)
==22081==    by 0x98F4FA8: PndTpcClusterFinder::process(std::vector<PndTpcDigi*, std::allocator<PndTpcDigi*> >&) (PndTpcClusterFinder.cxx:118)

==22081== 48 bytes in 1 blocks are possibly lost in loss record 188 of 461
==22081==    at 0x4004790: operator new(unsigned) (vg_replace_malloc.c:164)
==22081==    by 0x98E06DF: __gnu_cxx::new_allocator<PndTpcDigi*>::allocate(unsigned, void const*) (new_allocator.h:81)
==22081==    by 0x98E04BC: std::_Vector_base<PndTpcDigi*, std::allocator<PndTpcDigi*> >::_M_allocate(unsigned) (stl_vector.h:113)
==22081==    by 0x990D036: std::_Vector_base<PndTpcDigi*, std::allocator<PndTpcDigi*> >::_Vector_base(unsigned, std::allocator<PndTpcDigi*> const&) (stl_vector.h:100)
==22081==    by 0x990CE99: std::vector<PndTpcDigi*, std::allocator<PndTpcDigi*> >::vector(std::vector<PndTpcDigi*, std::allocator<PndTpcDigi*> > const&) (stl_vector.h:221)
==22081==    by 0x990CC38: ppstate_output::heartbeat() (TORPPState_Output.cxx:40)
==22081==    by 0x9909DED: padprocessor::heartbeat() (TORPadProcessor.cxx:105)
==22081==    by 0x990F3CD: PndTpcSectorProcessor::process() (PndTpcSectorProcessor.cxx:117)
==22081==    by 0x98F4FA8: PndTpcClusterFinder::process(std::vector<PndTpcDigi*, std::allocator<PndTpcDigi*> >&) (PndTpcClusterFinder.cxx:118)

==22081== 12 bytes in 1 blocks are possibly lost in loss record 81 of 461
==22081==    at 0x4004790: operator new(unsigned) (vg_replace_malloc.c:164)
==22081==    by 0x990CC1D: ppstate_output::heartbeat() (TORPPState_Output.cxx:40)
==22081==    by 0x9909DED: padprocessor::heartbeat() (TORPadProcessor.cxx:105)
==22081==    by 0x990F3CD: PndTpcSectorProcessor::process() (PndTpcSectorProcessor.cxx:117)
==22081==    by 0x98F4FA8: PndTpcClusterFinder::process(std::vector<PndTpcDigi*, std::allocator<PndTpcDigi*> >&) (PndTpcClusterFinder.cxx:118)

The first should be the most... important, but it does not seem to crash digitization (I am waiting the end of 20k dpm events... who knows).
Thanks for the "Electronic" fix.

Report message to a moderator

Re: Memory leaks in digitization (TPC!) [message #11034 is a reply to message #11032]

Sat, 25 September 2010 14:58

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *54-82-r.retail.telecomitalia.it

Last update:

I have tried to run 20k events with only tpc, up to TpcElectronicTask, and up to TpcClusterFinderTask.

The one up to the electronic task now runs fine, without errors.
The one up to the cluster task crashes, with the following error:

Event number 3979.
PndTpcClusterizer:: 5502 clusters created
9906 electrons arriving at readout
Aggregating drifted electrons into avalanches  finished.
9906 Avalanches created
0 aggregations done.
18401 Signals created
PndTpcElectronicsTask::Exec
Building up padmap ...finished. 1298 pads hit
...........
1694 Digis created
PndTpcClusterFinderTask::Exec
252 cluster created  containing 1694 digis from 1694
Error: Symbol #include is not defined in current scope  run_digi2_tpccombi.C:145:
Error: Symbol exception is not defined in current scope  run_digi2_tpccombi.C:145:
Syntax Error: #include <exception> run_digi2_tpccombi.C:145:
Error: Symbol G__exception is not defined in current scope  run_digi2_tpccombi.C:145:
Error: type G__exception not defined FILE:/d/panda02/spataro/pandaroot/macro/pid/test/./run_digi2_tpccombi.C LINE:145
*** Interpreter error recovered ***
dpm_digi2.txt lines 52084-52132/52132 (END)

This means that the fixes in PndTpcClusterFinderTask for the valgrind warnings (previousmessage) are really needed.

Report message to a moderator

Crash in TPC digitization [message #11093 is a reply to message #11011]

Wed, 13 October 2010 17:55

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *to.infn.it

After some tries, I surrender, I hope that somebody else can take a look.

I have produced 10k events with dpm, using macro/pid/run_sim_tpccombi_dpm.C ... everything fine.
I run macro/pid/run_digi_tpccombi.C, I have a crash.

Error: Symbol #include is not defined in current scope  run_digi_tpccombi.C:148:
Error: Symbol exception is not defined in current scope  run_digi_tpccombi.C:148:
Syntax Error: #include <exception> run_digi_tpccombi.C:148:
Error: Symbol G__exception is not defined in current scope  run_digi_tpccombi.C:148:
Error: type G__exception not defined FILE:/d/panda02/spataro/pandaroot/macro/pid/./run_digi_tpccombi.C LINE:148
*** Interpreter error recovered ***

In order to isolate it, I have commented out some stuff.
If I run, as tasks, only:

PndTpcClusterizerTask* tpcClusterizer = new PndTpcClusterizerTask();
if(mcMode=="TGeant3") tpcClusterizer->SetMereChargeConversion();
tpcClusterizer->SetPersistence();
fRun->AddTask(tpcClusterizer);

PndTpcDriftTask* tpcDrifter = new PndTpcDriftTask();
tpcDrifter->SetPersistence();
tpcDrifter->SetDistort(false);
fRun->AddTask(tpcDrifter);

PndTpcGemTask* tpcGem = new PndTpcGemTask();
tpcGem->SetPersistence();
fRun->AddTask(tpcGem);

PndTpcPadResponseTask* tpcPadResponse = new PndTpcPadResponseTask();
tpcPadResponse->SetPersistence();
fRun->AddTask(tpcPadResponse);

I can run 10k events (please note that I have turned on the persistency).
If I add the PndTpcElectronicTask, or if I run a macro with only the follwing task:

PndTpcElectronicsTask* tpcElec = new PndTpcElectronicsTask();
tpcElec->SetPersistence();
fRun->AddTask(tpcElec);

I have the crash again.
I have filled the code with cout, and I have found that the crash line is in FairRootManager::ForceFill() :

fOutTree->Fill();

Then it should be a problem of the data written into the file, and not in the tasks themselves.

After a discussion with Mohammad, it seems that this problem could rise when the data objects have no default constructor, or if there are some unitialized variable.
I tried to update PndTpcPrimaryCluster, PndTpcDriftedElectrons, PndTpcAvalance, PndTpcSignal, PndTpcSample, PndTpcDigi, but without any success. I have seen, however, that there are some unitialized variables, numbers but also std::vector and pointers.

I hope that some TPC expert could take a look, at least to reproduce the crash and then to investigate.

I give up.

Report message to a moderator

Re: Crash in TPC digitization [message #11102 is a reply to message #11093]

Mon, 18 October 2010 11:29

Johannes Rauch
Messages: 41
Registered: September 2010
Location: TUM

continuous participant

From: *natpool.mwn.de

Hi Stefano,

I tried to reproduce the problem, and the run_sim_tpccombi_dpm macro finished successfully with 10k events.

I had't turned persitency on, so I will do another check and also investigate in the uninitialised variables and constructor problems you mentioned.

Report message to a moderator

Re: Crash in TPC digitization [message #11103 is a reply to message #11102]

Mon, 18 October 2010 11:36

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *aspublic.wlan.sinica.edu.tw

Hi,
run_sim_tpccombi_dpm.C does not crash anymore after the changes of one month ago. What is crashing now is run_digi_tpccombi.C.

If you meant that "run_digi..." is not crashing, could you please tell which machine/linux distribution/gcc version you are using?
Maybe it could be that it crashes only in some distributions.
For sure it crashes in SL4.6 (gcc 3.4.6) and at GSI (lenny64 - gcc 4.3.2).

[Updated on: Mon, 18 October 2010 11:51]

Report message to a moderator

Re: Crash in TPC digitization [message #11104 is a reply to message #11103]

Mon, 18 October 2010 11:48

Johannes Rauch
Messages: 41
Registered: September 2010
Location: TUM

continuous participant

From: *natpool.mwn.de

Hi,

yes, I meant run_digi_tpccombi.C. It finished without a crash.

gcc --version is: gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

We will now try to run it on a 64bit machine and see if it crashes there.

EDIT: The macro sucessfully ran on a 64bit machine.

But we discovered many memory leaks that we will have to fix the next days.

[Updated on: Mon, 18 October 2010 17:18]

Report message to a moderator

Re: Crash in TPC digitization [message #11111 is a reply to message #11104]

Thu, 21 October 2010 15:42

Johannes Rauch
Messages: 41
Registered: September 2010
Location: TUM

continuous participant

From: *natpool.mwn.de

Hello everybody,

I fixed some really big memory leaks and just commited the changes. There are still memory leaks existing, but leakage went down at least one order of magnitude.

@Stefano, maybe the crash was due to this memory leaks, maybe you could try your test again?
We also experienced program crashes after some 5k events with our macros, probably because of full memory.

Report message to a moderator

Re: Crash in TPC digitization [message #11119 is a reply to message #11111]

Mon, 25 October 2010 09:57

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *ihep.ac.cn

Hi,
unfortunately my DPM simulation has crashed, then I have to regenerate again the files to test your changes.
Meanwhile, I have tried to run muons, and I got no crash... but muons are easier to handle.

Running valgrind, I have seen still the following leak:

==18001== 10,082,772 bytes in 546,990 blocks are possibly lost in loss record 451 of 455
==18001==    at 0x4004790: operator new(unsigned) (vg_replace_malloc.c:164)
==18001==    by 0xC17971: std::string::_Rep::_S_create(unsigned, unsigned, std::allocator<char> co
nst&) (in /usr/lib/libstdc++.so.6.0.3)
==18001==    by 0xC19DF4: (within /usr/lib/libstdc++.so.6.0.3)
==18001==    by 0xC19F01: std::string::string(char const*, std::allocator<char> const&) (in /usr/l
ib/libstdc++.so.6.0.3)
==18001==    by 0x9947B5D: padprocessor::reset() (TORPadProcessor.cxx:99)
==18001==    by 0x994D4C2: PndTpcSectorProcessor::reset() (PndTpcSectorProcessor.cxx:187)
==18001==    by 0x9932E20: PndTpcClusterFinder::process(std::vector<PndTpcDigi*, std::allocator<Pn
dTpcDigi*> >&) (PndTpcClusterFinder.cxx:122)
==18001==    by 0x99415EB: PndTpcClusterFinderTask::Exec(char const*) (PndTpcClusterFinderTask.cxx
:151)
==18001==    by 0x4193924: TTask::ExecuteTasks(char const*) (in /home/spataro/jan10/tools/root/lib
/libCore.so.5.26)
==18001==    by 0x4193720: TTask::ExecuteTask(char const*) (in /home/spataro/jan10/tools/root/lib/
libCore.so.5.26)
==18001==    by 0x90140B7: FairRunAna::Run(int, int) (FairRunAna.cxx:278)
==18001==    by 0x905995D: G__FairDict_792_0_5(G__value*, char const*, G__param*, int) (FairDict.c
xx:11420)

the last one from the old series I have reported some time ago. I think it should be also taken out.
I will let you know once the dpm simulation will end.

Report message to a moderator

Re: Crash in TPC digitization [message #11120 is a reply to message #11119]

Mon, 25 October 2010 16:06

Johannes Rauch
Messages: 41
Registered: September 2010
Location: TUM

continuous participant

From: *natpool.mwn.de

Dear Stefano,

I just fixed this one.

regards,

Johannes

Report message to a moderator

Re: Crash in TPC digitization [message #11123 is a reply to message #11120]

Wed, 27 October 2010 03:45

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *vpn.unito.it

Hi,
I have tried and it is still crashing. Now I have the crash earlier, between PndTpcDriftTask and PndTpcGemTask (before it was between PndTpcPadResponseTask and PndTpcElectronicsTask).
This means that the problem is something else...

Report message to a moderator

Re: Crash in TPC digitization [message #11173 is a reply to message #11123]

Fri, 05 November 2010 22:01

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *0-87-r.retail.telecomitalia.it

Hi,
I have done a local change in my code just trying to explore possible reasons of this crash.
In particular, I have changed the objects PndTpcDigi, PndTpcAvalanche, PndTpcSignal, PndTpcSample, PndTpcPrimaryCluster, removing the inheritance from FairMultiLinkData and moving back to the old good TObject.
In this case, I can run digitization up to the end.

Unfortunately, if I run also the cluster finder, I have again the crash, but due to the fact I cannot change "simply" the inheritance of PndTpcCluster from FairHit (which inheritates from FairMultiLinkedData) without changing the rest of the code also for other detectors.

Summarizing, it seems that the problems are coming mainly from adding the links to the data objects, even if they are not filled.
Something should be explored in this direction.

Report message to a moderator

Re: Crash in TPC digitization [message #11174 is a reply to message #11173]

Sat, 06 November 2010 08:12

Tobias Stockmanns
Messages: 489
Registered: May 2007

first-grade participant

From: *netcologne.de

Dear PandaRooters,

I am exploring at the moment the possibility to get rid of the dependence of FairMultiLinkedData and find a way to store the Link Data in an alternative way. With the support of Mohammad (to upload my changes to Fair base) I can remove the dependence on Monday.

Still it is very strange, that these links seem to work if you test them in a small sample but cause so much problems when you do larger simulations. I guess we have to learn that the use of the TClonesArrays is not as easy as it seems.

Cheers,

Tobias

Report message to a moderator

Re: Crash in TPC digitization [message #11175 is a reply to message #11174]

Sat, 06 November 2010 09:42

StefanoSpataro
Messages: 2736
Registered: June 2005
Location: Torino

first-grade participant

From: *0-87-r.retail.telecomitalia.it

I am not sure if the Links are the real reason of the crashes. Maybe they are just making the objects bigger, and then the crash appears before than without them. But in theory these linksare not filled,I have commented out all the SetLinks before.

It could also be, I fear, that the crashes are coming from somewhere else, and we will find again the crash running 100k events...

Report message to a moderator

Previous Topic:	FairMCPoint trackID vs FairLink trackID.
Next Topic:	[fixed] cmake bug in EvtGen

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Fri Apr 26 14:48:40 CEST 2024

Total time taken to generate the page: 0.00945 seconds