0% found this document useful (0 votes)
16 views9 pages

Fast Mapping and Updating Algorithms For A

This document presents fast mapping and updating algorithms for a binary content-addressable memory (CAM) on FPGA, aimed at improving the speed of lookup and update mechanisms in real-time applications. The proposed FMU-BiCAM algorithms utilize Xilinx FPGA resources efficiently, achieving an update latency of only two clock cycles, independent of CAM depth. The implementation demonstrates significant improvements over existing methods, addressing the limitations of slower update processes in conventional CAM architectures.

Uploaded by

Pradeep K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

Fast Mapping and Updating Algorithms For A

This document presents fast mapping and updating algorithms for a binary content-addressable memory (CAM) on FPGA, aimed at improving the speed of lookup and update mechanisms in real-time applications. The proposed FMU-BiCAM algorithms utilize Xilinx FPGA resources efficiently, achieving an update latency of only two clock cycles, independent of CAM depth. The implementation demonstrates significant improvements over existing methods, addressing the limitations of slower update processes in conventional CAM architectures.

Uploaded by

Pradeep K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

156 IEEE CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, VOL. 44, NO.

2, SPRING 2021

Fast Mapping and Updating Algorithms for a


Binary CAM on FPGA
Mappage rapide et mise à jour des algorithmes
pour un CAM binaire sur FPGA
Azhar Qazi, Zahid Ullah , Member, IEEE, and Abdul Hafeez

Abstract— Content-addressable memories (CAMs) are used in a variety of applications, such as IP filtering,
data compression, and artificial neural networks due to its high-speed lookup. Fast field-programmable gate
arrays (FPGAs) are nowadays used to emulate CAMs. These CAM emulations either make use of logical
resources or use memory blocks on FPGAs to emulate CAMs. However, such CAM emulation suffers from
slower mapping and updating mechanisms, which results in an unacceptable response in real-time applications.
The slower response in update mechanism is proportionate to the CAM depth in the schemes. In this article, fast
mapping and updating algorithms for a binary CAM (FMU-BiCAM) are presented, which efficiently utilizes
lookup tables, slice registers, and block random access memories (RAMs) on Xilinx FPGA to emulate faster
mapping and updating CAMs. The advantage of the proposed work lies in directly applying the CAM key
as an address, which helps in updating contents in memory units. CAMs in the literature exhaust the entire
CAM depth in remapping the CAM words along with the updating word, which leads to higher update latency.
The proposed algorithms are implemented on Xilinx Virtex−6 FPGA, and the results show that the proposed
method brings latency to only two clock cycles during update.
Résumé— Les mémoires adressables de contenu (CAMs) sont utilisées dans une variété d’applications, telles
que le filtrage IP, la compression des données et les réseaux de neurones artificiels en raison de sa recherche
à haute vitesse. Les réseaux de portes rapides programmables in situ (FPGAs) sont aujourd’hui utilisées pour
émuler les CAMs. Ces émulations CAM utilisent des ressources logiques ou des blocs de mémoire sur les FPGA
pour émuler des CAMs. Cependant, une telle émulation CAM souffre d’un mappage et de mécanismes mise
à jour lents, ce qui entraîne une réponse inacceptable dans les applications en temps réel. La réponse lente
dans le mécanisme de mise à jour est proportionnelle à la profondeur CAM dans les schémas. Dans cet article,
des algorithmes de mappage et de mise à jour rapides pour un CAM binaire (FMU-BiCAM) sont présentés,
qui utilisent efficacement les tables de recherche, les registres de tranche et bloc de mémoires vives (RAM)
sur Xilinx FPGA pour émuler un mappage et une mise à jour plus rapides des CAMs. L’avantage du travail
proposé réside dans l’application directe de la clé CAM comme adresse, ce qui aide à mettre à jour le contenu
des unités de mémoire. Les CAM dans la littérature épuisent toute la profondeur de CAM en remappant les
mots CAM avec le mot de mise à jour, ce qui conduit à un délai de mise à jour plus élevée. Les algorithmes
proposés sont implémentés sur Xilinx Virtex-6 FPGA et les résultats montrent que la méthode proposée crée
un délai qu’à deux cycles d’horloge lors de la mise à jour.
Index Terms— Content-addressable memory (CAM), fast mapping algorithm, fast updating algorithm, random
access memory (RAM)-based CAM.

I. I NTRODUCTION designed to allow comparison of its entire entries with the

C ONTENT-ADDRESSABLE memory (CAM) plays a


pivotal role in the present era of information access.
Search operation can be performed in random access mem-
searched contents in a deterministic time. This unique feature
of high-speed comparison enables CAM a fetching device
in applications, such as artificial intelligence, pattern recog-
ory (RAM) by iteratively comparing the entire RAM entries nition, file storage, and database management [1], [2]. The
for every search request, but it leads to a significant delay advantage of deterministic comparison in CAM through its
because of the sequential searching. On contrary, CAM is inbuilt comparison circuitry is a big plus. Though being fast
Manuscript received January 1, 2020; revised May 16, 2020 and July 13,
in comparison to RAM, CAM suffers from some drawbacks
2020; accepted September 11, 2020. Date of current version March 4, 2021. of low scalability, high cost, and high power consumption.
(Corresponding author: Zahid Ullah.) Despite improvements in the circuitry and architecture of
Azhar Qazi is with the Department of Electrical Engineering, CECOS
University of IT and Emerging Sciences, Peshawar 25000, Pakistan (e-mail: conventional CAMs, the massive hardware parallelism and
[email protected]). rapid prototyping capabilities of modern field-programmable
Zahid Ullah is with the Department of Electrical and Computer Engineering, gate arrays (FPGAs) make it an attractive choice to imple-
Pak-Austria Fachochschule: Institute of Applied Sciences and Technology,
Haripur 22620, Pakistan (e-mail: [email protected]). ment CAM on reconfigurable platforms. Unfortunately, FPGA
Abdul Hafeez is with the Department of Computer Science and IT, devices have no support for conventional CAM architecture;
University of Engineering and Technology, Jalozai Campus 24280, Pakistan therefore, researchers opt for CAM emulations on FPGA by
(e-mail: [email protected]).
Associate Editor managing this article’s review: Maher Bakri-Kassem. using distributed or block RAMs, lookup tables, and slice
Digital Object Identifier 10.1109/ICJECE.2020.3025198 registers.
2694-1783 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
QAZI et al.: FMU-BICAM ON FPGA 157

Static RAM (SRAM)-based CAM architecture implemented TABLE I


on Xilinx FPGA [3] can be summarized by analyzing all L IST OF N OTATIONS U SED
3-D progress: 1) development in CAM designs for use in
parallel search operations; 2) enhance the mapping technique
for low-resource utilization; and 3) reduce the power consump-
tion. A number of architectures, such as [4]–[6], have been
surfaced in this connection. Some of them have significantly
achieved scalability as well as throughput, but almost all of
them lead to higher latency in updating their contents. The
update mechanism involves two subprocesses, i.e., erase the
old contents and write the new contents. The update of data
contents is a crucial factor and the present SRAM-based vertically partitioned CAM [3], which logically divides the
CAMs possess limitations in high-performance applications. conventional CAM table columnwise. The subsequent columns
Like in related literature, it takes at least O(N) clock cycles are named subtables of CAM with n number of entries and
for updating, where N is the CAM depth. This motivates us processed to be stored in their corresponding SRAM blocks.
to develop a high-speed update strategy that is independent of Ullah et al. [3] effectively incorporated parallelism in the
the arrangement of CAM words before mapping them to the matching process of large CAM words, but its mapping and
desired locations. The proposed scheme has been implemented updating process depend on O(N) (where N is the number of
to validate its efficiency in case of fast updating (add or remove CAM words) and this raises the update latency to a very high
entries) for use in high-speed and low-latency contents match- value.
ing in networking appliances, including routers, switches, The approach adopted in [7] maximizes the parallelism in
firewalls, security, storage, and high-performance computing. mapping and searching of CAM words by incorporating hor-
Time consuming and sequentially dependent updates would izontal partitioning along with vertical partitioning, whereas
not be acceptable in most networking applications, such as the approach adopted in [5] removes the entire process of
packet classification and pattern recognition. sequentially calculating the last index used in [3] and [7].
Designs in [3], [5], and [7], are focused to reduce the latency of
A. Key Contributions of the Research Work mapping and lookup process, but at the cost of using redundant
memory resources, which can be avoided. In the proposed
The key contributions are listed as follows.
work, we eliminated all such redundant memory resources and
1) The proposed fast mapping and updating algorithms directly used the CAM word as an address to SRAM blocks,
for a binary CAM (FMU-BiCAM) are independent of which results in efficient mapping and updating mechanisms.
the CAM depth and consume only two clock cycles The SRAM-based CAM designs presented in [4] and [8]
unlike related prior work in which the update mechanism provide deterministic searching in contents matching and are
depends on the entire CAM depth, hence resulting in independent of data type and support search of arbitrarily large
faster update mechanism compared with the prior work. words, whereas lag in update mechanism as the update process
2) The development of the research work enables the user is sequentially dependent on the entire CAM depth. In our
to configure the hardware implementation as per CAM proposed mechanism, the update latency is independent of
depth requirements. CAM depth. The low-power RAM-based CAM architectures
3) Since mapping/updating implies write operations, the in [9]–[11] are designed to reduce the power consumption
proposed strategies may result in a significant reduction for SRAM-based CAMs. Memory blocks are arranged in a
in power consumption compared with the state-of-the-art hierarchy to have high and low priority blocks. If search word
CAMs. is found in the high priority block, low priority blocks remain
The rest of this article is organized as follows. Section II deactivated. This power reduction in CAM is achieved by
narrates the related work. The proposed fast mapping and compromising latency and throughput, while reduction still
updating mechanisms for binary CAM are presented and not guaranteed in worst case scenario (e.g., activation of all
elaborated in Section III. Implementation and performance priority blocks). The dependence on entire CAM depth for
evaluation of the proposed mechanisms is given in Section IV. contents update in [12]–[17] also leads to high update latency.
Section V concludes this article. Table I presents the notation Scalable CAM [18] is based on dividing the input CAM
used in this article. word into P subwords and deploys SRAM for matching
by ANDing the output of all corresponding memory blocks,
II. R ELATED W ORK connected to the encoder. The entire process is based on the
Modern FPGAs having an unprecedented logic density tradeoff between throughput and resource utilization. On one
along with the digital signal processing blocks, clocking, side, high throughput is ensured, but excessive use of RAMs
embedded processors, and high-speed searching with reason- has increased memory resources and power consumption.
able prices are preferred to be used in digital system designs. Despite all this, this update mechanism is still dependent
These advantages along with its reconfigurability motivated on arranging its contents before mapping it to appropriate
designers to implement CAM architecture on the FPGA locations. The architecture in [19] uses a single memory block
platform. First, RAM-based CAM implemented on FPGA is to emulate CAM on Kintex-7 FPGA. The method though

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
158 IEEE CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, VOL. 44, NO. 2, SPRING 2021

achieves optimization of memory resources but at the cost of to 36k block RAM used in most of the related work. This is
degraded scalability. In addition, virtual partitioning in a single done to reduce the number of block RAMs, e.g., k, in each
memory block and use of validation memory (VM) trigger layer. Decreasing k though increases the total number of layers
successive processes in single memory block and reduce speed in architecture for successfully emulating the target 512 × 36
significantly. The update mechanism mentioned in [19] again CAM, but experimental results suggested that increasing the
depends on O(N). Lee et al. [20] adopted the technique of number of layers and lesser the SRAM blocks in a layer help
bundle updates and the update mechanism in the architecture to reduce power consumption during update. Hence, instead
is proportionate to the width size of configured CAM rather of using 32 block RAMs of 36k size, 64 block RAMs of 18k
than the depth. The design in [20] has compromised speed size are used. The first layer cascades blocks from (SRAM1,1 )
and scalability to achieve optimization in memory resources to (SRAM1,k ), whereas the last layer cascades blocks from
but still depends on the width of the configured CAM. (SRAM L ,1 ) to (SRAM L ,k ). Upon fetching the required address
The SRAM-based CAM architectures mentioned in [12], location on address line, the control logic on top of each layer
[13], [21], and [22] address the issue of update mechanism selects the corresponding layer. Mapping of the CAM word
through dedicated circuitry, yet the update latency is depen- starts from dividing CAM word into subwords, whereas the
dent on CAM depth. The logic-based schemes mentioned number of CAM subwords depends on the number of SRAM
in [6], [23], and [24] simplify the update procedure at the blocks in each layer or vice versa. A number of addresses
cost of degradation in scalability and speed. Update time is a layer can link indulges on the width N of SRAM block,
reduced up to some extent in [24] and [25] but compromising whereas N refers to valid informative column bits of SRAM
either speed or resource utilization. The dependence of update block.
mechanism on CAM depth in the literature motivated us In every layer, the corresponding significant bit SB of
to resolve the issue of slower updating and mapping for each memory block verifies the presence of linked subwords.
SRAM-based CAM. We effectively develop an algorithm that Concurrently, the correlated SBs of all SRAM blocks in a layer
supports fast run time updating in its mapped contents. The are ANDed together to check the presence of input word Cw .
design is having the beauty that the user can configure it as per Equation (1) describes the size of CAM depth Cd
CAM size requirement to avoid problems in run time updating.
For example, to implement the CAM to handle a 36-bit key, Cd = 2(Cw/k) (1)
the configured CAM could be of size (512×36) or (1024×36), where Cw refers to width of CAM word and k refers to the
or to handle 72-bit key, the configured CAM could be of size number of SRAM blocks in each layer.
(512 × 72) or (1024 × 72).

A. Mapping Mechanism
III. P ROPOSED M APPING AND U PDATING A LGORITHMS
Mapping process is complex in [3], [5], [7], and [19] with
Recent work shows that the comparison process is exhaus- the use of temporary lookup tables. Like [3] and [7], divide
tive in the worst case scenario. This is to the fact that all the CAM word into subwords in time period t1 , make bit
SRAM-based CAM architectures use an indirect searching position table (BPT) from the corresponding subwords in t2 ,
approach, which utilizes a lot of resources such as generation build address position table address generator (APTAG) from
of temporary tables for storing interim data and generation corresponding BPT in t3 , and finally make address position
of VM modules for checking and storing the presence of table (APT) from the corresponding APTAG table in t4 .
subwords. Once the complicated circuitry is of no importance The process differs somewhat in [5] and [19] by using VMs
in the worst case scenario, it can be removed and a direct modules in place of BPT and original address table address
searching approach may be incorporated. The removal of such generators (OATAGs) in place of APTAG, yet consume a
complex circuitry significantly reduces the resource utilization number of clock cycles for mapping contents at the desired
and power consumption of SRAM-based CAMs and ultimately location. The proposed mapping mechanism; however, uses
leads to optimization in other parameters as well. only two clock cycles - t1 for generation of CAM subwords
Updating of data contents discussed in the literature is and t2 for mapping these subwords to the corresponding rows
solemnly dependent on the positioning of updated CAM word in SRAM blocks by using them as row address to set the
with existing stored contents. Mapping the updated CAM corresponding bits of SRAM blocks.
word in the appropriate location involves iterative steps, which The proposed mapping mechanism is independent of arrang-
lead to higher latency. Such limitations in the update mecha- ing the mapped words in any order. Algorithm 1 demonstrates
nism of SRAM-based CAMs motivate the development for a the mapping process. To map the queued CAM word Cw into
faster mapping and updating algorithms that are independent the desired location, Cw is divided into k same sized subwords,
of the CAM depth. The proposed fast mapping and updating Csw(0) to Csw(k) . The beauty of the proposed work lies in
mechanisms/algorithms for a binary CAM is shown in Fig. 1. incorporating control logic for mapping process unlike other
The generalized fast mapping and updating mechanisms architectures that peculiarly activate the addressed location
using block RAMs are proposed for BiCAM, which is com- layer only. The mapping CAM word is also used as a target
posed of L number of layers, each layer having cascaded address location in SRAM blocks where it will be stored. This
SRAM blocks, as shown in Fig. 1. It is important to mention additional feature makes it a unique approach. The subwords
that we used the smaller size of 18k block RAM compared from Csw(0) to Csw(k) set only the corresponding significant

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
QAZI et al.: FMU-BICAM ON FPGA 159

Fig. 1. Proposed fast mapping and updating mechanisms for a BiCAM.

bits of addressed SRAM blocks, leaving the other bits of row of (SRAM1,2 ). Similarly, for second CAM word, the least
into low level. The process thus eradicates activating all layers significant subword Csw(0) = 100 is mapped to (SRAM2,1 ),
for contents writing and haul downs the energy consumption while the most significant subword Csw(1) = 010 is mapped
during the mapping process. The queued CAM words can be to (SRAM2,2 ) at address “5.” It is important to mention that
mapped individually into the desired locations in random order mapping control logic singularly activates layer 1 for address
until exhausting the depth of the CAM architecture. “1” as it is linked to layer 1 and activates layer 2 for address
“5” as it is linked to layer 2. The beauty of the proposed
1) Mapping Example: For mapping the CAM words shown mapping mechanism lies in directly mapping CAM words by
in Fig. 2, we configure 8 × 4 BiCAM architecture. For this using its subwords as row address to SRAM blocks without
particular example, we select L = 2 layers of cascaded SRAM arranging it in ascending order unlike other designs.
blocks having depth Rd = 8, and each block links N/L
addresses, where N refers to informative column bits in SRAM
block. Mapping CAM words 101110 to address location “1” B. Lookup Process
and 010100 to address location “5” is the aim. Lookup process is initiated concurrently in all the layers.
After partitioning first CAM word into k = 2 number Algorithm 2 demonstrates the lookup process. The input CAM
of subwords, the least significant subword Csw(0) = 110 is word Cw is divided into k same sized subwords, Csw(0) to
mapped to (SRAM1,1 ) by setting the corresponding signif- Csw(k) . The CAM subwords are applied concurrently in all
icant bit of located address “1,” while the most significant the layers to its corresponding SRAM blocks. The subwords
subword Csw(1) = 101 is mapped to the corresponding location from Csw(0) to Csw(k) identify the addressed locations in the

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
160 IEEE CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, VOL. 44, NO. 2, SPRING 2021

Fig. 2. Mapping process in configured 8 × 4 architecture of a BiCAM with L and k both equal to 2.

Algorithm 1 Proposed Mapping Algorithm Algorithm 2 Lookup Operation


1: procedure T HIS ALGORITHM IS USED TO MAP THE CAM 1: procedure T HIS ALGORITHM IS USED TO GET THE
WORDS INTO DESIRED LOCATIONS OF SRAM BLOCKS ADDRESS OF MATCHED CONTENTS FROM SRAM
2: [Divide the CAM word into Subwords]; BLOCKS
3: for i = 0 to k do \\execute in parallel for all SRAMs 2: [Divide the CAM word into Subwords];
4: Csw(i) = Cw(i)/k 3: for i = 0 to k do \\execute in parallel for all SRAMs
5: Next i 4: Csw(i) = Cw(i)/k
6: end for 5: Next i
7: [Find the Address and insert CAM word]; 6: end for
8: for i = 1 to L do \\execute in parallel for all layers 7: [Comparison];
9: for j = 1 to K do 8: for i = 1 to L do \\execute in parallel for all layers
10: if SRAM(i,j) = Csw(i,j) then 9: for j = 1 to K do
11: SB of SRAM(i,j) = Cw 10: SRAM(i,j) = Csw(i,j)
12: Exit 11: Next i,j
13: end if 12: end for
14: Next i,j 13: end for
15: end for 14: [Address Making];
16: end for 15: for i = 1 to L do
17: end procedure 16: for j = 1 to K do
17: Temp = Temp + SB(i,j)
18: if Temp == 1 then
19: Address = Temp
corresponding SRAM blocks, and the entire row data of the 20: end if
particular locations are selected. The row data of all the SRAM 21: Next i,j
blocks in each layer are bitwise ANDed, which results in 22: end for
identifying the only set output linked to the mapped address 23: end for
location of a searched CAM word. The lookup process is 24: [Get Address];
parallelized through cascaded SRAM blocks in each layer and 25: Display Address
consumes only two clock cycles to match the searched word 26: end procedure
with entire stored contents and identify the required location.
1) Lookup Example: The lookup process in configured
8 × 4 BiCAM is shown in Fig. 3. To search the CAM word
C. Update Mechanism
101011, the process is initiated by applying the least signifi-
cant subword Csw(0) =011 to least significant memory module SRAM-based CAMs in [3], [5], [7], and [19] have focused
(SRAM1,1 ) and the most significant subword Csw(1) =101 to to bring down the lookup latency and resource utilization in
the most significant memory block (SRAM1,2 ). Concurrently, CAM and did not cater for update latency as it is too high
in the second layer, the least significant subword Csw(0) = in all the designs. The update latency in [6], [12], and [13]
011 is applied to (SRAM2,1 ) and the most significant subword depends on the number of ternary bits per selected subword.
Csw(1) = 101 is applied to (SRAM2,2 ). The addressed row However, the update stage did not consider the erasing process,
bits of both SRAM blocks in each layer are bitwise ANDed, and the worst latency of update remains O(N), whereas in
which results in identifying the only matched output among our proposed update mechanism by using the control logic,
both layers. In this case, the selected address location is 1 that the CAM words are applied directly as a row addresses to
is linked to layer 1. The output from the rest of the AND gates activate the corresponding SRAM blocks in the selected layer
is at low level. and avoid activation and rewriting contents of all layers.

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
QAZI et al.: FMU-BICAM ON FPGA 161

Fig. 3. Lookup process in configured 8 × 4 architecture of a BiCAM with L and k both equal to 2.

Algorithm 3 elaborates the proposed update mechanism. proposed update mechanism consumes only two clock cycles
To update contents of the desired location, the existing CAM for all sorts of updates. The proposed direct mapping and
word Cw is divided into k equal-sized subwords, Csw(0) to sequence-independent update procedure significantly reduce
Csw(k) , and applied to the corresponding SRAM blocks of the energy consumption during both processes compared with
the addressed layer. In the first write cycle, the correspond- all the reviewed SRAM-based CAM architectures.
ing addressed bits selected through the existing subwords at 1) Update Example: The update mechanism in configured
the desired location in the corresponding SRAM blocks are 8×4 BiCAM architecture is shown in Fig. 4. For this particular
cleared, followed by mapping the updating CAM subwords example, we select L = 2 layers of SRAM blocks having depth
into the corresponding bit positions of already addressed Rd = 8, and each block links N/L addresses, where N refers to
location. informative column bits of SRAM block and L is the number
of layers of cascaded SRAM blocks.
Algorithm 3 Proposed Updating Algorithm Updating the contents of address location “5” is required.
1: procedure T HIS ALGORITHM IS USED TO UPDATE THE In the erasing step, we concurrently provide address of location
CONTENTS MAPPED TO ADDRESSED LOCATION AT “5” through update control logic to all layers, which results
SRAM BLOCKS in activating layer 2 only, as address “5” is linked to layer
2: [Divide the CAM word into Subwords]; 2. Now, the existing least significant CAM subword Csw(0) =
3: for i = 0 to k do \\execute in parallel for all SRAMs 100 clears the corresponding bit position of address “5” in
4: Csw(i) = Cw(i)/k (SRAM2,1 ), while the existing most significant CAM subword
5: end for Csw(1) = 001 clears the corresponding bit position of location
6: Next i “5” in (SRAM2,2 ).
7: [Find the Address and delete contents]; This deletion is followed by insertion, setting the corre-
8: for i = 1 to L do \\execute in parallel for all layers sponding bit position of address location “5” through least
9: for j = 1 to K do significant updated CAM subword Csw(0) = 000 in (SRAM2,1 )
10: if SRAM(i,j) = Csw(i,j) then and setting the corresponding bit position of address location
11: Del (SBs of SRAM(i,j)) “5” through the most significant updated CAM subword
12: Exit Csw(1) = 111 in (SRAM2,2 ). The updating hence achieved
13: end if through independent mechanism in absolutely two clock cycles
14: Next i,j by deleting existing mapped contents 010100 from address
15: end for “5” in first cycle, followed by mapping the updating CAM
16: end for word 111000 to the same location in the second clock cycle.
17: [Insert new contents];
18: SB of SRAM(i,j) = New-word IV. I MPLEMENTATION AND P ERFORMANCE E VALUATION
19: Exit The proposed algorithms for a BiCAM of size of 64 × 36
20: end procedure and 512 × 36 are implemented on Xilinx Virtex-6 FPGA
device XC6VLX760, using Xilinx ISE Design Suite 14.5.
The process thus eradicates activating all layers for contents Comparative analysis of update latency, speed, resource uti-
updating. Unlike all other SRAM-based CAM schemes, the lization, and time consumed during contents update with

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
162 IEEE CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, VOL. 44, NO. 2, SPRING 2021

TABLE II
P ERFORMANCE C OMPARISON OF THE P ROPOSED M APPING AND U PDATE M ECHANISM W ITH THE R EVIEWED L ITERATURE

Fig. 4. Updating process in configured 8 × 4 architecture of a BiCAM with Fig. 6. Resources comparison of related work with the proposed work [P]
L and k both equal to 2. in terms of update time (represented in µs).

absolutely only two clock cycles for all sorts of updates,


as shown in Table II and in graphic comparison of speed and
update latency in Fig. 5. All other referred architectures in
Table II depend on O(N), where N is the CAM depth, whereas
we incorporated the update mechanism into CAM architecture,
which applies the CAM subwords directly as a row address
to SRAM blocks in each layer that eliminates the CAM depth
factor from update process. Mostly, the update time in the
refereed architectures [3]–[26] is based on (2), whereas the
update time in the proposed algorithm is based on (3), which
Fig. 5. Resources comparison of related work with the proposed work [P] proves that the proposed update mechanism is independent of
in terms of speed (represented in MHz with red bars and black digits) and the CAM depth. The comparison of time consumed for single
update latency (represented in clock cycles with blue bars and red digits). update showed in graph of Fig. 6 summarizes the story
1
Ut =∗ (2(Cw/k) + 1) (2)
the published results of latest FPGA-based CAM archi- Cs
1
tectures [4]–[8], [13], [18]–[20], [22], [25], [26] is pre- Ut∗ = ∗ (2) (3)
sented in Table II. The performance comparison supports Cs
our claim that the proposed fast mapping and updating whereas Cw is the CAM width, k is the number of SRAM
mechanisms for a BiCAM significantly reduces the update blocks in each layer, and Cs is the speed of SRAM-based
latency (and obviously power consumption) compared with CAMs.
all other SRAM-based architectures, as summarized in graphs 3) Energy Consumption and Resource Utilization: Reduc-
in Figs. 5 and 6. tion of power consumption during contents update lies in the
2) Update Latency: The proposed update mechanism unlike minimum number of write operations. The proposed update
all other SRAM-based CAM schemes [3]–[26] eradicates the mechanism incorporates such algorithm for the update process,
activation of all layers for contents updating and consume which activates only the target layer in CAM architecture

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
QAZI et al.: FMU-BICAM ON FPGA 163

where update is required, leaving all other layers deacti- [11] V. S. Satti and S. Sriadibhatla, “Hybrid self-controlled precharge-free
vated unlike related schemes. Collectively, both factors surely CAM design for low power and high performance,” Turkish J. Electr.
Eng. Comput. Sci., vol. 27, no. 2, pp. 1132–1146, 2019.
result in reducing power consumption during update. Despite [12] I. Ullah, Z. Ullah, U. Afzaal, and J.-A. Lee, “DURE: An energy- and
achieving significant reduction in update latency and energy resource-efficient TCAM architecture for FPGAs with dynamic updates,”
consumption in update stages, we always catered to keep the IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 6,
pp. 1298–1307, Jun. 2019.
memory resources to a reasonable level as evident from the [13] I. Ullah, Z. Ullah, and J.-A. Lee, “EE-TCAM: An energy-efficient
performance comparison in Table II. SRAM-based TCAM on FPGA,” Electronics, vol. 7, no. 9, p. 186,
Sep. 2018.
[14] Y.-J. Chang and Y.-H. Liao, “Hybrid-type CAM design for both power
V. C ONCLUSION AND F UTURE W ORK and performance efficiency,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 16, no. 8, pp. 965–974, Aug. 2008.
SRAM-based CAM architecture plays a pivotal role in [15] D. B. Grover, R. J. Stephani, and C. D. Browning, “Low power
artificial intelligence, pattern recognition, file storage, and content addressable memory hitline precharge and sensing circuit,”
networking router. The designers have proposed several archi- U.S. Patent 13 456 419, Oct. 31, 2013.
[16] D. Jothi and R. Sivakumar, “Design and analysis of power efficient
tectures on reconfigurable hardware, i.e., FPGAs. The state- binary content addressable memory (PEBCAM) core cells,” Circuits,
of-the-art CAMs suffer from higher update latency during Syst., Signal Process., vol. 37, no. 4, pp. 1422–1451, Apr. 2018.
contents updating as their update mechanism is dependent on [17] Y.-J. Chang, K.-L. Tsai, and H.-J. Tsai, “Low leakage TCAM for IP
lookup using two-side self-gating,” IEEE Trans. Circuits Syst. I, Reg.
arranging the entire contents with the updated contents. The Papers, vol. 60, no. 6, pp. 1478–1486, Jun. 2013.
dependence on the entire CAM depth during update stage also [18] W. Jiang, “Scalable ternary content addressable memory implementation
leads to significant power consumption in the update process. using FPGAs,” in Proc. 9th ACM/IEEE Symp. Archit. Netw. Commun.
Syst. (ANCS), Piscataway, NJ, USA, Oct. 2013, pp. 71–82.
This research work presents a different direction of [19] A. Ahmed, K. Park, and S. Baeg, “Resource-efficient SRAM-based
sequence-independent update mechanism, which does not ternary content addressable memory,” IEEE Trans. Very Large Scale
depend on CAM depth. The proposed algorithm selects at Integr. (VLSI) Syst., vol. 25, no. 4, pp. 1583–1587, Apr. 2017.
[20] D.-Y. Lee, C.-C. Wang, and A.-Y. Wu, “Bundle-updatable SRAM-based
most one layer of SRAM blocks for contents updating at any TCAM design for OpenFlow-compliant packet processor,” IEEE Trans.
location rather than activating the entire memory blocks and Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 6, pp. 1450–1454,
ultimately consumes less energy during the update process. Jun. 2019.
[21] I. Ullah, Z. Ullah, and J.-A. Lee, “Efficient TCAM design based on
Thus, the proposed mapping and updating algorithms speed multipumping-enabled multiported SRAM on FPGA,” IEEE Access,
up the table makeup and reduce energy consumption. vol. 6, pp. 19940–19947, 2018.
Our future work includes to optimize the FGPA resources [22] F. Syed, Z. Ullah, and M. K. Jaiswal, “Fast content updating algorithm
for an SRAM-based TCAM on FPGA,” IEEE Embedded Syst. Lett.,
utilization in relation to mapping and updating algorithms and vol. 10, no. 3, pp. 73–76, Sep. 2018.
extend their scope to TCAM. [23] H. Mahmood, Z. Ullah, O. Mujahid, I. Ullah, and A. Hafeez, “Beyond
the limits of typical strategies: Resources efficient FPGA-based TCAM,”
IEEE Embedded Syst. Lett., vol. 11, no. 3, pp. 89–92, Sep. 2019.
R EFERENCES [24] P. Reviriego, A. Ullah, and S. Pontarelli, “PR-TCAM: Efficient TCAM
[1] A. Madhavan, T. Sherwood, and D. B. Strukov, “High-throughput emulation on Xilinx FPGAs using partial reconfiguration,” IEEE Trans.
pattern matching with CMOL FPGA circuits: Case for logic-in-memory Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 8, pp. 1952–1956,
computing,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, Aug. 2019.
no. 12, pp. 2759–2772, Dec. 2018. [25] I. Ullah, J.-S. Yang, and J. Chung, “ER-TCAM: A soft-error-
[2] R. Govindaraj and S. Ghosh, “Design and analysis of sttram-based resilient SRAM-based ternary content-addressable memory for FPGAs,”
ternary content addressable memory cell,” ACM J. Emerg. Technol. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 4,
Comput. Syst., vol. 13, no. 4, p. 52, 2017. pp. 1084–1088, Apr. 2020.
[3] Z. Ullah, M. K. Jaiswal, Y. C. Chan, and R. C. C. Cheung, “FPGA [26] Z. Qian and M. Margala, “Low power RAM-based hierarchical CAM on
implementation of SRAM-based ternary content addressable memory,” FPGA,” in Proc. Int. Conf. ReConFigurable Comput. FPGAs (ReCon-
in Proc. IEEE 26th Int. Parallel Distrib. Process. Symp. Workshops PhD Fig), Dec. 2014, pp. 1–4.
Forum, May 2012, pp. 383–389.
[4] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, “E-TCAM: An efficient
SRAM-based architecture for TCAM,” Circuits, Syst., Signal Process.,
vol. 33, no. 10, pp. 3123–3144, Oct. 2014.
[5] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, “Z-TCAM: An SRAM-
based architecture for TCAM,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 23, no. 2, pp. 402–406, Feb. 2015.
[6] M. Irfan, Z. Ullah, and R. C. C. Cheung, “Zi-CAM: A power and
resource efficient binary content-addressable memory on FPGAs,” Elec-
tronics, vol. 8, no. 5, p. 584, May 2019.
[7] Z. Ullah, K. Ilgon, and S. Baeg, “Hybrid partitioned SRAM-based
ternary content addressable memory,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 59, no. 12, pp. 2969–2979, Dec. 2012. Azhar Qazi received the B.Sc. degree (Hons.) and
[8] Z. Ullah, M. K. Jaiswal, R. C. C. Cheung, and H. K. H. So, “UE- the M.S. degree in electrical engineering (commu-
TCAM: An ultra efficient SRAM-based TCAM,” in Proc. IEEE Region nication) from the University of Engineering and
Conf. (TENCON), Nov. 2015, pp. 1–6. Technology, Peshawar, Pakistan, in 2006 and 2014,
[9] S.-H. Yang, Y.-J. Huang, and J.-F. Li, “A low-power ternary content respectively. He is currently pursuing the Ph.D.
addressable memory with pai-sigma matchlines,” IEEE Trans. Very degree with the Department of Electrical Engineer-
Large Scale Integr. (VLSI) Syst., vol. 20, no. 10, pp. 1909–1913, ing, CECOS University of IT and Emerging Sci-
Oct. 2012. ences, Peshawar.
[10] B.-D. Yang, Y.-K. Lee, S.-W. Sung, J.-J. Min, J.-M. Oh, and H.-J. Kang, His research area includes designing fast updating
“A low power content addressable memory using low swing search and mapping algorithms for static random access
lines,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 12, memory (SRAM)-based content-addressable mem-
pp. 2849–2858, Dec. 2011. ories (CAMs) on field-programmable gate array (FPGA).

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.
164 IEEE CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, VOL. 44, NO. 2, SPRING 2021

Zahid Ullah (Member, IEEE) received the B.Sc. Abdul Hafeez received the Ph.D. degree from
degree (Hons.) in computer system engineering Virginia Tech, Blacksburg, VA, USA, in 2014, with a
from the University of Engineering and Technology, focus on high-performance computing and machine
Peshawar, Pakistan, in 2006, the M.S. degree in learning.
electronic, electrical, control, and instrumentation In his Ph.D. degree, he collaborated with The
engineering from Hanyang University, Seoul, South University of Texas at Arlington, Arlington, TX,
Korea, in 2010, and the Ph.D. degree in electronic USA, IBM Almaden, San Jose, CA, USA, and IBM
engineering from the City University of Hong Kong, Dublin, Dublin, Ireland, on how to leverage parallel
Hong Kong, in 2014. computing and machine learning for bionano sensing
He was an Associate Professor and the Chair- and protein simulations. He worked as an Adjunct
man of the Department of Electrical Engineering, Faculty Member with the Department of Computer
CECOS University of IT and Emerging Sciences, Peshawar. He is currently Science, Virginia Tech. In 2015, he was a Post-Doctoral Fellow with Georgia
an Assistant Professor and the Head of the Department of Electrical and Tech, Atlanta, GA, USA, where he focused on materials informatics to
Computer Engineering, Pak-Austria Fachhochschule: Institute of Applied establish an e-collaboration platform for data scientists, material scientists, and
Sciences and Technology, Haripur, Pakistan. He has authored prestigious manufacturing experts and worked as a Principal Investigator on the GT-FIRE
journal and conference papers and holds patents in his name in the field of project. Following to his Post-Doctoral Fellow, he joined the Department of
field-programmable gate array (FPGA)-based TCAM. His research interests Computer Systems Engineering, University of Engineering and Technology,
include low-power/high-speed content-addressable memory (CAM) design on Peshawar, Pakistan, as an Assistant Professor.
FPGA, reconfigurable computing, pattern recognition, embedded systems, and
image processing

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 14,2021 at 23:56:59 UTC from IEEE Xplore. Restrictions apply.

You might also like