bu_cms_history/DCC_Monitoring

SiteMap (Historical BU CMS wiki main page)

DCC Monitoring Software Blog

2007-10-08 (''jmsj'')

Working on cmsmoe6 in Arjan's SiPM lab.

Attempt to separate in Monitoring Tables the LRB-channel-level quantities from the LRB-level quantities

Modified in DCCItem.cc:a

  1. getMonTableUniqueID() to return unique ID #'s (Unique among all LRB channels in the system, if it's an LRB Channel quantity)
  2. getLRB_ChannelNumber() new: returns HardwareNumber (the LRB number 0-4) multiplied by 3 and added to the channel number. ''This probably should be translated into a DCC Input Spigot Number, and intelligently.''

When I additionally modify

  1. getMonTableName() to append " by LRB Channel" to the string it returns if (getLRB_ChannelNumber() != -1),

certain LRB tables cease to appear at all. They reappear with the appended string in their names. Next thing to do is to get rid of the NaN entries by making a single column out each of the ones named ".... in Channel X" where X is 0,1,2. But now it's time to try out the Commissioning and DQM meetings.

By the way, the presence and functionality of oslb.exe in hcalDCC/tool/ should be documented at the CountingHouse Repository. I just got my access to those files; I will add it soon.

2007-09-10 (''jmsj'') ''Eureka!''

The hcalDCCMonitoring method CreateTables() now publishes to the DCCMoanager infospace monitoring tables which share the names and contents of the ExpertView tables delineated in the Xilinx-chip fw-specific csv files. All entries have their own columns, however, so that the rows correspond to DCCs. The present contents of the DCCManager infospace are: (statically-allocated tables in parantheses)

 Config                          DCC
 Counters                        DCC
 DCC State Timers                DCC
 DCC Status Register             DCC
 (DCCWorld)
 Debug                           DCC
 Error                           DCC
 Error control configuration     DCC 
 (ErrorLRB)
 Firmware Revisions              DCC
 HTR Error Counters                   HTR
 HTR Status                           HTR
 LRB_CSR                                   LRB
 LRB_SR                                    LRB
 LRB_count                                 LRB
 LRB_err                                   LRB
 LRB_mon                                   LRB
 TTS_State                       DCC

In conference by phone with J. Mans, this will change. The goal (after E. Hazen's input has been folded in, as well) will be to make these tables still more like those of the ExpertView:

Each of these tables will be named as in the ExpertView but with their DCC's fed id number appended to the table name, to make global monitoring possible. Additionally they will be assigned key columns of unique row identifier numbers (values), where

  1. DCC-level tables use the fed id for this,
  2. HTR-level tables use a formula based on DCC and spigot/HTR number, and
  3. LRB-level tables use a formula based on DCC and LRB number.

Implementation details:

The 2c.csv table names will have to change. What will become of the EkthpertView?

Need to add the DCCItem class: the DCCItem methods

 std::string getMonTableRow()     return the row number for that item
 std::string getMonTableColumn()  return the column name for that item
 std::string getMonTable()        return the table name for that item
 // If the table name contains "LRB" or "HTR", implement those hashes.
 //  otherwise only implement the crateid and FED id hash.
 int getHWMonitoringUniqueId(int crate, int FEDid, std::string MonTableName) 

'2007-08-10' (''jmsj'')

Successfully added one table from a dynamically created table and published it to the infospace. . Next: check to see if it works for more than one table. (Almost certainly it does.) Then check to see that I can add values to those tables' rows, and that they show up in the infospace, too. Also that I can add updateListeners. If all those pieces come together, it's a very small step to render live, automatically updating monitoring tables from the DCC-appropriate csv files used for the Expert View.

'2007-08-09' (''jmsj'')

Accomplished: Using ExpertView technology, we are now able to generate a vector of custom-made structs, They each contain pointers to unique DCC's (from the map used elsewhere in hcalDCCmonitoring) and a vector of DCCItems filled from the FW-appropriate csv file and hardware reads. The code is general and should work for n DCC's, but has only been tested on a single crate with a single DCC. I am able to output reasonable-looking values and strings from these DCCItems to the logging stream.

On the Horizon: Could the appropriate monitoring tables be dynamically allocated from the DCCItems? One for each ExpertView table, for instance. I will spend half of one day exploring this question. If, as I expect, I get nothing but frustration for my trouble, the fallback plan is to make hard-coded tables which reflect the present structure of the ExpertView.

'2007-07-30' (''jmsj'')

Q. What are the valid types to declare when adding a column? "Int" and what else?

 m_perDCC.addColumn("String for name", "int") 

A. They are in the sparse comments in the files declaring these types, in hcal/hcalBase/interface/hcal/monitor/. It seems to something I can use as 32 bits and unsigned, which should suffice for every item in the DCC, even the bits, nibbles, and (if treated in separate halves, which is eye-friendly) the 64-bit counters.

2007-07-24 (''jmsj'')

Now able to create monitorable tables from non-hardware-reading values. Created a base of understanding and elucidation on monitoring for the beginner: The_Anatomy_of_Monitoring .

What's the best way to proceed?

2007-07-21 (''jmsj'')

I can add the file hcal/config/monitor/MonitorDCCWorld.flash and runcontrol will run (Ignores the file). I also have that same file in ~/dist/etc/flash, and reference this one in a new line of hcal/config/profile/profile-teststand.xml which is modeled directly on the lines above it for ~/dist/etc/flash/MonitorLRBError.flash. After this, runcontrol operated with no change, and after a make install, it was still unchanged (It's xml; not probably evaluated at compile time but at run time). Error messsages would be more helpful than this; I hoped the modifications to profile-teststand.xml would publish the DCCWorld table to the DCCManager infospace. Nothing has changed with the HyperDAQ. What gives?

Nevermind -- I re-did the make install, and copied over the libraries. Same seg fault, same type of spew to the log file , which is a link where I hope an expert can examine it.

2007-07-20 (''jmsj'')

Starting from a safe, working directory tree, runcontrol still executes successfully if I add only one column to m_perDCC, even though I tried to set values to that one (copied from Jeremy's code) and another (an uint32_t of my own).

 m_perDCC.addColumn(label_FED, "int");
 ...

2007-07-19 (''jmsj'')

Tried adding the file hcal/config/monitor/MonitorDCCWorld.flash. It is a copy of MonitorDCCLRBError.flash, with simple replacement changes:

 1. The filename change.
 2.  goes to 
    
 3.  goes to 
    
 4.  goes to 
    

And I modified the hcalDCCMonitoring contructor:

 hcalDCCMonitoring::hcalDCCMonitoring() 
 : m_perLRB("ErrorLRB"), m_perDCC("DCCWorld") {
  .... logging instantiation ....
 }

Somehow this causes runcontrol to gracefully stackdump to the logfile just before completing the setup. Here's the last 'good' part of the logger output and all the nasty stuff, too:

 19 Jul 2007 15:16:1184876189 94120880
 INFO  localhost.p:40000.hcalDCCManager.instance(0).DCC12 <> - Active : Spigot0, Spigot1
 19 Jul 2007 15:16:1184876189 94120880
 INFO  DCCMonitoring_by_jason at setup <> - hcalDCCMonitoring �
 V2718 firmware : 2.0
 A2818 firmware : 0.6
 VMELibRelease  : 2.3d
 �Process 1690
 on node cms2 terminates abnormally at 2007-07-19T11:16:29.105116-04:00. Caught signal 6 at address 0x69a

 Stacktrace follows:
 /home/daqowner/dist/lib/libtoolbox.so(_ZN7toolbox10stacktraceEiRSo+0x59) 0x1e3b75

 /home/daqowner/dist/lib/libtoolbox.so(_ZN7toolbox18signalSEGVCallbackEiP7siginfoPv+0x1f7) 0x1e3ff1

 /lib/tls/libpthread.so.0 0x8baf80

 /lib/tls/libc.so.6(abort+0x1d5) 0xcc3705

 /usr/lib/libstdc++.so.5 0x7004f7

 /usr/lib/libstdc++.so.5 0x700544

 /usr/lib/libstdc++.so.5 0x700567

 /usr/lib/libstdc++.so.5(__cxa_call_unexpected+0x45) 0x7003f5

 /home/daqowner/dist/lib/libhcalHW.so(_ZN14hcalDCCManager8coldInitEv+0) 0x1ff308a

 /home/daqowner/dist/lib/libhcalBase.so(_ZN4hcal11Application9steerInitEPN7toolbox4task8WorkLoopE+0x39) 0x7388d57

 /home/daqowner/dist/lib/libhcalBase.so(_ZN7toolbox4task6ActionIN4hcal11ApplicationEE6invokeEPNS0_8WorkLoopE+0x36) 0x73930dc

 /home/daqowner/dist/lib/libtoolbox.so(_ZN7toolbox4task15WaitingWorkLoop7processEv+0x38) 0x22078c

 /home/daqowner/dist/lib/libtoolbox.so(_ZN7toolbox4task8WorkLoop3runEv+0x2a) 0x21d69c

 /home/daqowner/dist/lib/libtoolbox.so(_ZN7toolbox4task11thread_funcEPv+0xc0) 0x21c740

 /lib/tls/libpthread.so.0 0x8b4dd8

 /lib/tls/libc.so.6(__clone+0x5a) 0xd76fca

(The entire log file can be found here , for now.)

I don't get much out of that. Tried grepping for LRBError from $XDAQ_ROOT, and found hcal/config/profile/profile-teststand.xml lists the flashlists available at startup. Won't be able to see if that's having the expected (hoped-for) effect until I can get a successful setup to execute.

Tried this: cleaned out the emacs revision file

 rm config/monitor/*~

And recompiled with make install. Killed the run process and jobcontrol, restarted jobcontrol (see Job Control deets in the DCCSoftware entry of 2007-07-02.) Same kind of failure, with stackdump written in the last moments of setup. Backing out the new files to check that otherwise runcontrol is fine. Also removing new flashlist files from ~/dist/etc/flash/.

''To check:'' Was the emacs reversion file in ~/dist/etc/flash, with its .flash~ suffix, the cause of the problem? Nope, removing it didn't fix the problem, and neither did removing the file itself. Stored both new .flash files to daqowner/DCC_flashlists/. Couldn't get runcontrol to set up successfully until I removed my attempts to set values in the monitorable table object m_PerDCC.

'2007-07-18' (''jmsj'') Observations:

After closing runcontrol and the HyperDAQ webbrowser windows, although runcontrol is done, monitoring continues timed executions of DCC::updateLRBdata(), which can be seen in the logging, labeled as localhost.p:40000.hcalDCCManager.instance(0).DCC12 These logging messages are each followed by the logging message from hcalDCCMonitoring::updatePerLRB(), so it must be called then somehow. (It is not called explicitly in DCC::updateLRBData().) These updatePerLRB logger entries are labeled as "DCCMonitoring_by_jason " which are the messages ESH and I put in to see when these were called.

What is calling DCC::updateLRBdata() when the Collect button is clicked? Is it a special effect of the xdata types in the table? How does that result in a call to updatePerLRB?

This looks promising:

  static const std::string UpdateLRB("CSR_read_LRB_monitor_data");
  static const std::string CommandStatusReg("command_and_status_register");
  void DCC::updateLRBData()  throw (hcalDCC::Exception){
    try {
      // request an update
      logicBoard.dev->setBit(UpdateLRB);
      // wait for the update to be done
      uint32_t csr;
      int loop_count = 0;
      do {
	logicBoard.dev->read(CommandStatusReg,&csr);
	++loop_count;
      } while ((csr&0xC00)!=0);
      LOG4CPLUS_INFO(m_logger,toString("DCC::updateLRBdata() - updated LRB monitoring after %d loops csr=0x%04x",loop_count, csr));

Seems to be based on HAL names for the registers. I can get to that. The logicBoard object is initialized in DCC.cc as follows:

    /// set up the logic board
    LOG4CPLUS_DEBUG(m_logger,"DCC::initialize() setup the logic board");
    snprintf(lookingFor,32,"log3_fmem");
    if (names.find(lookingFor)==names.end()) {
      LOG4CPLUS_ERROR(m_logger,toString("Did not find '%s' as needed in the master map",lookingFor));
      return false;
    }
    loc = base + vat->getGeneralHardwareAddress(lookingFor).getAddress();
    if (!logicBoard.setup(p_busAdapter,m_logicboardAddrTableFile,loc,m_logger,m_swap)) {
      LOG4CPLUS_ERROR(m_logger,toString("Error setting up logic board on base 0x%08x, with addressTable '%s'",loc,m_logicboardAddrTableFile.c_str()));
      return false;
    }

How does it know about the method setup(HAL::VMEBusAdapterInterface*, ..., bool m_swap), and does it do anything I should know about? logicBoard is declared an hcal::VMEDeviceBundle, so it has this definition .

''Separately'', a list of flash lists exists in config, like

  hcal / config / monitor / MonitorFPGA.flash 

but for now I just want the monitoring collector to pick up a new monitorable table with EvB, the counter of Events Built at DCC register 0xBC0.

The monitorable type table is declared in hcal / hcalBase / interface / hcal / monitor / Table.hh

'2007-07-14' (''jmsj'') Understanding dawns.

Nothing seems to be the matter with the original version of the 3_9_4 code. Copied my versions of hcalHW directories back to hcal/hcalHW, then

 $XDAQ_ROOT/hcal/make clean; make install

My changes to the LRB monitoring code are now visible on the HyperDAQ Monitoring pages.

After deleting all numbered log files in /tmp/ and restarting runcontrol, it is clear that ''two log files are opened and being written to.'' They are created in a fixed order upon selection of the run type. Of these two log files:

''A make install (and copying in the libraries again) seems to be required in order for monitoring code modifications to affect the data published to the monitoring HyperDAQ pages.'' Is it because it loads the subpackages and remakes the libdirectory?

'2007-07-13' (''jmsj'') After invention with J. Mans, all seemed well. After a major logging and xdaq.exe/jobcontrol hiccup this afternoon, restored order by saving experimental versions of hcal/hcalDCC amd hcal/hcalHW branches to daqowner/hcalXXY_jmsj files and clobbering the tree back into shape. As we'd disicovered a missing call to m_perLRB.addUpdateListener, I replicated that. Small test changes (see snip below for the manual cerr pedestal of 10 I inserted) failed to propagate to monitoring tables published to the HyperDAQ as seen on localhost:40000. I would have expected them to show up, once the monitorable table's Update Listener had been added.

 In hcalDCCMonitoring.cc:
 void hcalDCCMonitoring::updatePerLRB() {
 
       xdata::Integer xi_cerr(cerrlrbc
+10);
       xdata::Integer xi_uerr(uerrlrbc
);
       xdata::Integer xi_badid(badidlrbc
);
  

Interestingly, there is also a HyperDAQ at port 40001, where the only major difference seems to be that the Collection is not enabled by default, althopugh once enabled it works just as well as the original one. This is a result of some change Jeremy made with me watching but which I didn't understand nor could I replicate. In fact, all the logging is now going out that port:

 13 Jul 2007 21:56:1184381787 3044924336
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - New state is:Init
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Preparing to parse /home/daqowner/dist/etc/monvis/hcal-standard.xml
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column collectiontime
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:Id
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:Crate
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:Slot
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:Fiber
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:TopBottom
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:BZeroBCN
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:BZeroOrbit
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:LinkErrors
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalHTRManager:PerFiber:SignalDetect
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:id
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:crateId
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:slot
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column collectiontime
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:OpticalLOSCount
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:QPLLErrorCount
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:QPLLUnlockCount
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:TTCSingleErrorCount
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - Adding column hcalTTCFanout:TTCDoubleErrorCount
 13 Jul 2007 21:56:1184381787 31947696
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - New state is:Ready
 13 Jul 2007 21:57:1184381831 3044924336
 INFO  localhost.p:40001.hcalHyperMonVis.instance(0) <> - New state is:Active

Clearly, something is still out of whack. Next move: Try to replicate the logging victories of last night, Col. Mans leading the charge.

'2007-07-11' (''jmsj'')

Stored orginal directory trees in daqowner/hcalHW_before_jmj and daqowner/hcalDCC_before_jmsj, repectively. These should be safe from future hcal software upgrades (which is good), but will fall behind should there be an upgrade released (which is bad).