ME0 Segment Finding

Segment Finding Algorithm
Testbench
Presentations
Source Code

Segment Finding Algorithm

Each ME0 chamber is composed of 6 layers of GEMs.

Each layer is composed of 8 partitions in eta, each of which is split into 384 strips in phi.

The trigger output of a GEM chamber is a binary readout called an “S-bit”, which is the OR of two adjacent strips. The resolution of a ME0 chamber in the trigger path is therefore 1/2 of the fundamental resolution, or 8x192 strips per layer.

In summary, and ME0 Chamber has:

6 layers
each layer has 8 eta partitions
each eta partition has 384 strips (192 s-bits)
1536 sbits / layer / bx * 6 layers = 9216 sbits / bx

The orientation of the chamber relative to the interaction point is:

                 ly0 -> ly5  .
            prt0 |‾‾‾‾‾‾‾‾|.
            prt1 |      . |
            prt2 |   .    |
            prt3 |.       |
            prt4 |        |
            prt5 |        |
           .prt6 |        |
        .   prt7 |________|
     .
--IP-----------------------

\ 16        8         0  /    prt0
 \ 17       9        1  /     prt1
  \ 18      10      2  /      prt2
   \ 19     11     3  /       prt3
    \ 20    12    4  /        prt4
     \ 21   13   5  /         prt5
      \ 22  14  6  /          prt6
       \ 23 15 7  /           prt7

     strip 191 --> 0

The basic goal of segment finding in ME0 is that we want to:

Identify multi-layer segments
Do some refined angular/position calculation using multi-layer hit data

The mechanism to do this segment identification is broadly something like:

Identify segments locally
- Create “segment candidates” with some sortable key field
- Done in parallel, many times
Sort segment candidates into a smaller number of output candidates
- Development of sorting machinery can be somewhat decoupled from the exact segment identification mechanism
Perform some post-processing on the segment candidates to produce a refined measurement (some kind of fit)

Input Hits Processing

This segment finder is designed to find multi-layer segments in ME0.

The pattern finder works by looking at the chamber in two dimensional slices, where bending in the phi direction can be observed. Each slice corresponds to 6 layers of the ME0 chamber for a given eta partition.

For example, a cross-sectional view of a muon track may look something like this (chamber is truncated between /// marks).

    0                                          --->                                              191
ly0 -------------+-----------------------------------------///--------------------------------------
ly1 --------------+----------------------------------------///--------------------------------------
ly2 ---------------+---------------------------------------///--------------------------------------
ly3 ----------------+--------------------------------------///--------------------------------------
ly4 -----------------+-------------------------------------///--------------------------------------
ly5 ------------------+------------------------------------///--------------------------------------

                   ^ track

Our goal in segment finding is to:

Identify segments
Measure the position of the segment
Measure the bending angle of the segment

Cross-partition data sharing

Approximately 10% efficiency loss is incurred if the segment finder only looks for tracks within single partitions. To capture the segments which cross from partition to partition, the segment finder does cross-partition data sharing.

This is done by creating what we have dubbed ’virtual partitions’.

There are eight ’true’ partitions in the chamber. Segment finders are instantiated for each of these eight partitions. In addition to this though, virtual partitions are formed that are the OR of some layers of one partition, and some from the neighbor.

For example, the logic implemented in the firmware does the ORing as follows:

partition_or(0) <=                  sbits(I/2 + 1)(0);
partition_or(1) <=                  sbits(I/2 + 1)(1);
partition_or(2) <= sbits(I/2)(2) or sbits(I/2 + 1)(2);
partition_or(3) <= sbits(I/2)(3) or sbits(I/2 + 1)(3);
partition_or(4) <= sbits(I/2)(4);
partition_or(5) <= sbits(I/2)(5);

The partitions are numbered so that 0 = straight, 1 = angled, 2 = straight, 3 = angled, etc. This means that in the output segments there are 15 possible partitions, 8 of which are straight and 7 of which are angled. Even numbered partitions are straight, so the LSB of the partition word can be used as an angled flag.

Chamber staggering

Are the chambers staggered? if so we need de-staggering

Alignment ??

Are alignment corrections required?

Segment Candidate Identification

Centered around each strip is a pattern identification block called a “pattern unit”.

A pattern unit looks at a subset of the partition’s data, and identifies segment candidates which are roughly centered around a given strip.

To minimize latency, each bunch crossing a number of independent pattern unit modules operate in parallel to find pattern candidates for each of the strips in the chamber.

For example, in the following diagram we show the “window” that a pattern unit centered on strip #20 might look at:

    0                   20                     --->                                              191
      ┌────────────────────────────────────┐
ly0 --│--------------+---------------------│-------------------///--------------------------------------
ly1 --│---------------+--------------------│-------------------///--------------------------------------
ly2 --│----------------+-------------------│-------------------///--------------------------------------
ly3 --│-----------------+------------------│-------------------///--------------------------------------
ly4 --│------------------+-----------------│-------------------///--------------------------------------
ly5 --│-------------------+----------------│-------------------///--------------------------------------
      └────────────────────────────────────┘
                       ^ track

Within a pattern unit we look only at the boxed area, and can consider looking at a slice of data ranging within +- some range (the range is determined from the width of the maximally wide pattern). Here we use +- 18 as an example.

┌────────────────────────────────────┐
│--------------+---------------------│
│---------------+--------------------│
│----------------+-------------------│
│-----------------+------------------│
│------------------+-----------------│
│-------------------+----------------│
└────────────────────────────────────┘
-18     <--       0      -->       18

Each pattern unit looks for patterns (or roads) which are centered around a given strip, at different bending angles.

For example, two patterns with different bending angles might look something like:

Pattern #14

ly0 xxxx-----
ly1 -xxxx----
ly2 ---xxx---
ly3 ---xxx---
ly4 ----xxxx-
ly5 -----xxxx

Pattern #1

ly0 ----------------------------xxxxxxxxx
ly1 ------------------------xxxxxxxxx----
ly2 ----------------xxxxxxxxxxxx---------
ly3 ---------xxxxxxxx--------------------
ly4 ----xxxxxxxxx------------------------
ly5 xxxxxxxxx----------------------------

If we were to overlay the above pattern #14 onto the data shown above, we would see this:

┌─────────────────────────────────────┐
│-------------xx+x--------------------│
│--------------xx+x-------------------│
│----------------x+x------------------│
│----------------xx+------------------│
│------------------x+xxx--------------│
│-------------------x+xxx-------------│
└─────────────────────────────────────┘
-18     <--       0       -->       18

In this case, we can see that 6 hits (designated by +) fall within the pattern mask (designated by x), so we say that for this pattern the layer count is 6.

The high layer count (6) is because this pattern is a good match for the actual data.

If instead we overlay pattern #1 on this data we see

┌─────────────────────────────────────┐
│---------------+------------xxxxxxxxx│
│----------------+-------xxxxxxxxx----│
│----------------x+xxxxxxxxxx---------│
│---------xxxxxxxx-+------------------│
│----xxxxxxxxx------+-----------------│
│xxxxxxxxx-----------+----------------│
└─────────────────────────────────────┘
-18     <--       0       -->       18

In this case only one of the hits falls within the pattern mask, so the layer count is only 1. Following this example, we can use the layer count for each pattern as a metric for the quality of the pattern.

For the entire collection of patterns in a pattern unit, a sorting tree looks through this collection of data and returns a single pattern which is determined to have the highest “quality”.

Hit count: for sorting within a pattern unit, the simple metric of {layer count, id} was found to be inadequate due to a “narrowing” bias that was discovered by Jade. Because the pattern unit prefers higher pattern IDs (straighter patterns), the sorting mechanism will select the straightest possible pattern that still maintains the layer count, while discarding hits at the edges. To avoid throwing away this data biasing the pattern finder toward erroneously narrow patterns, a hit count was introduced.

Sorting secondarily on the number of hits means that the pattern unit will not narrow patterns in a foolhardy attempt to increase the pattern ID.

In implementing this in the firmware, however, a full 6 layer hit count was found to be very costly in terms of resources. A simplified approach was taken that considers only outer layers, and a simplified code for counting the number of hits. For implementation details, refer to hit_count.vhd.

The sorting metric is:

Choose the pattern with the highest layer count
If multiple patterns have the same layer count: choose the one with the highest hit count
If multilpe patterns have the same layer count + hit count: choose the highest pattern id (higher pattern IDs correspond to straighter patterns, or higher momentum particles).

The single pattern that is chosen for each pattern unit

Layer count (the number of layers hit in the pattern)
Pattern ID (a unique number representing the pattern; higher pattern IDs are straighter)

Edge Padding

(Some pattern units run off the left and right edges of the chamber and need to be zero-padded)

Deadtime

Deadtime is added to prevent pulse-extended strips from re-firing in time. It is a simple mechanism wherein each strip has a counter, and goes dead for a specified number of bunch crossings after firing.

Ghost Cancellation

Because of wide patterns, the same hits will produce multiple strips that trigger. A mechanism must cancel off these neighboring “ghosts” by selecting only the best segment in a group of neighboring strips.

Ghost cancellation must also act between partitions, but this is not yet implemented in the firmware.

Partition Pre-Sorting

For each partition, every bunch crossing a collection of 192 segments is produced in the pattern units.

This will be later sent into “true” bitonic sorters, but to reduce the computational difficulty of the bitonic sorting stage, the segments are first passed through a crude sorting tree.

This takes advantage of the fact that generally we are less interested in multiple segments appearing in neighboring partitions, since they will often simply be ghosts of one another.

Thus, we restrict the chamber to only accept one segment for every N strips, reducing the # of segments that need to be sorted by a factor of N.

Chamber Segment Selection

Choose from (8 partitions * 192/N segments) -> M segments, where N is the pre-sorting factor from the previous stage.

Centroid Coordinate Transformation

f (pattern id, 6x centroids) → 6x hit position

Fitting

A standard linear fit follows a formula:

\[ \overline{X}=\frac{\sum{}{}x_i}{n} \]

\[ \overline{Y}=\frac{\sum{}{}y_i}{n} \]

\[ m = \frac {\sum{}{}(x_i - \overline{X})(y_i-\overline{Y})}{\sum{}{} (x_i-\overline{X})^2} \]

\[ b = \overline{Y} - m\overline{X} \]

To reduce the computational difficulty in an FPGA, we modify this formula in a few ways.

To defer a division by n and continue more of the calculation in signed arithmetic (rather than fixed point), we can instead calculate \(m\) as:

\begin{align*} m =& \frac {\sum{}{}(x_i - \overline{X})(y_i-\overline{Y})}{\sum{}{} (x_i-\overline{X})^2} \\ =& \frac {\sum{}{}(x - \frac{\sum{}{}x_i}{n})(y-\frac{\sum{}{}x_i}{n})}{\sum{}{} (x-\frac{\sum{}{}x_i}{n})^2} \\ =& \frac{n}{n} \frac {\sum{}{}(x_i - \frac{\sum{}{}x_i}{n})(y_i - \frac{\sum{}{}x_i}{n})}{\sum{}{} (x_i - \frac{\sum{}{}x_i}{n})^2} \\ =& \frac {\sum{}{}(nx_i - \sum{}{}{x_{i}})(ny_i-\sum{}{}{y_i})} {\sum{}{} (nx_{i}-\sum{}{}x_{i})^2} \end{align*}

And we can calculate \(b\) as:

\begin{align*} b =& \overline{Y} - m\overline{X} \\ =& \frac{\sum{}{}y_i}{n} - m \frac{\sum{}{}x_i}{n} \\ =& \frac{1}{n} \cdot \left(\sum{}{}y_i - m \sum{}{}x_i\right)\\ \end{align*}

To make this even simpler, we take advantage of the fact that the range of the divisor in the above equation (for the caclulation of \(m\)) is limited to a maximum value of 630. This is because the x values represent the layer count, which is simply the set of layers hit (0, 1, 2, 3, 4, 5) where not all layers are necessarily hit. So it has a very restricted set of possibilities that can be easily evaluated.

The upper bound on this number then is just:

\[ 630 = 6^2 \times ( (0 - 2.5)^2 + (1 - 2.5)^2 + (2 - 2.5)^2 + (3 - 2.5)^2 + (4 - 2.5)^2 + (5 - 2.5)^2 ) \]

This was found by brute force exhausting the entire possibility of combinations, but could probably be proven with a small amount of thought.

Since the divisor only has a small range of values possible, the division is re-written as a multiplication by the reciprocal of the number. All possible values of this are encoded in a lookup table as a fixed point value. This transforms a division into a fixed point multiplication, which is computationally much simpler.

\[ m = \sum{}{}(nx - \sum{}{}{x_{i}})(ny_i-\sum{}{}{y_i}) \times reciprocal(\sum{}{} (nx_{i}-\sum{}{}x_{i})^2) \]

With this, the slope is still expressed by the same formula as above, with a multiplication by a factor of 1/n (stored in a lookup table) used in place of a division by n.

\[ b = \overline{Y} - m\overline{X} = \frac{\sum{}{}y_i}{n} - m \frac{\sum{}{}x_i}{n} = \frac{\sum{}{}y_i - m \sum{}{}x_i}{n} = (\sum{}{}y_i - m \sum{}{}x_i) \times reciprocal(n) \]

This intercept (b) represents the intercept in a coordinate system where the layers are numbered (0 1 2 3 4 5) and so the intercept is the strip along the edge of the chamber (on the 0th layer).

To better represent the pattern and make the intercept in the center of the chamber, we do a simple coordinate transformation:

\[ strip = m \times 2.5 + b \]

The number 2.5 is chosen so that the center of the chamber is at 0, with the layers at ±0.5, ±1.5, ±2.5. The output of the fit module is therefore the strip and slope centered in the midpoint of the chamber. Both numbers are output as fixed point numbers.

e.g., for the strip output,

the number is composed of an integer and decimal part. The integer part represents the strip in integer units. Since the patterns are constructed such that the tracks are always centered around the midpoint of the pattern unit, the integer part of this need only be a few bits to represent the offset from center.
the fractional part is such that
- fractional bit 0 = 1/2 strip
- fractional bit 1 = 1/4 strip
- fractional bit 2 = 1/8 strip
- and so on..

For the slope output, it is similarly represented in fixed point format, with units of strips/layer. A slope of 0 is a straight track, and a slope of 7 is extremely angled.

The fit operates in a relatively high resolution output by default, but the resolution can be truncated for sending upstream by simply truncating off fractional bits to achieve the desired bandwidth.

Studies are needed to determine the optional (and achievable) resolution from this fit.

Quality of Fit

how to calculate??

Pipelined Multiplication

Some steps of the fit (multiplications) are pipelined into multiple (2) clock cycles.

The basic scheme of the pipelined multiplication is to split the numbers into most-significant and least-significant parts, and multiply the parts independently, summing their products at the end.

e.g. consider the multiplication of two 16 bit numbers, A and B:

\begin{align*} A \times B =& A[15:0] \times B[15:0] \\ =& (A[15:8] \cdot 2^8 + A[7:0]) \times (B[15:8] \cdot 2^8 + B[7:0]) \\ =& (A_{HI} \cdot 2^8 + A_{LO}) \times (B_{HI} \cdot 2^8 + B_{LO}) \\ =& (A_{HI} \cdot 1^8 + A_{LO}) \times (B_{HI} \cdot 2^8 + B_{LO}) \\ =& (A_{HI} \cdot B_{HI} \cdot 2^{16}) + (A_{HI} \cdot B_{LO} \cdot 2^8) + (A_{LO} \cdot B_{HI} \cdot 2^8) + (A_{LO} \cdot B_{LO}) \\ =& (A_{HI} \cdot B_{HI} << 16) + (A_{HI} \cdot B_{LO} \cdot << 8) + (A_{LO} \cdot B_{HI} \cdot << 8) + (A_{LO} \cdot B_{LO}) \\ \end{align*}

So, we are able to split a 16x16 bit multiplication into two steps: (1) four 8x8 bit multiplications w/ bitshifting (2) three additions. This allows us to pipeline the multiplication into two clock cycles and achieve timing at 320 MHz. Bitshifting is “free” in the FPGA (just zero padding) so this can easily meet timing.

Reciprocal

The lookup table for the reciprocal LUT is calculated through a simple python script:

print("".join([
    "  function reciprocal (x : integer; nbits : integer) return sfixed is\n",
    "  begin\n",
    "    if (x<1 or x> 2047) then \n",
    "      assert false report \"invalid reciprocal lookup x=\" & integer\'image(x) severity error;\n",
    "      return to_sfixed(0, 1, -nbits);\n",
    "".join(list(map(lambda i :
                     "    elsif (x=%d) then return to_sfixed(%.20f, 1, -nbits);\n" % (i, 1/i),
                     range(1, 2048)))),
    "    end if;\n",
    "  end;\n"]))

Post-Fit Coordinate Transformation

Transform from local to global coordinates (this is just addition of strip + fit_offset)

f (pat_unit_strip, fit_offset) -> strip

Output Data Format

The currently proposed output data format is:

Field	Bits	Notes
Eta	4	16 eta positions (stubs can’t cross more than 2 partitions)
Phi	10	768 phi positions (“half strip” resolution)
Bend	9	512 different bend angles
Quality	4	16 different quality levels
Total	27	Bits per Segment

Eta

Phi

Phi is encoded as a number from 0-768

The resolution of the trigger primitive is in the OR of two adjacent strips (0-191) so this is a factor of 4 increase in nominal resolution.

Bend

The bend angle is encoded as a fixed point number

4 bits integer bend (units of strips/layer, twos complement signed number)
5 bits fractional bend

In fixed point fashion, the interpretation of the fractional bend is that

Bit	Resolution
bit0	1/2 strip/layer
bit1	1/4 strip/layer
bit2	1/8 strip/layer
bit3	1/16 strip/layer
bit4	1/32 strip/layer

The resolution may be reduced later after further studies are done.

Quality

Testbench

The firmware is tested against a standalone Python emulator, using CocoTB as the test runner and Questasim (or modelsim, or Aldec Riviera) as the simulator.

The test bench is based around a Python program that is meant to be a 1:1 identical implementation of the firwmare. The CI ensures that they are functionally compatible to the best extent possible, although some features are not (yet) fully implemented and tested, such as pulse extension and deadzoneing.

The python program can be run in a stanalone way, is written in pure python, and has very minimal depedencies (it should require only built-in python packages). It is rather slow but uses multiprocessing for a fairly signficant speedup on a multi-core machine.

The input to the top level function (process_chamber) is quite simple. It requires a 2D list of BigInts that specify the hits in the partition.

e.g. an empty chamber is:

[[0 for _ in range(6)] for _ in range(8)]

Hits are simply bits in the BigInt, e.g. a hit on strip 10 is represented by 1 << 10.

process_chamber also takes in a Config struct that controls some of the configurable parameters of the segment finder.

Presentations

Source Code

Firmware: https://github.com/andrewpeck/me0sf/