An Efficient PCI Based Neural Net Engine


The Neural Net Engine (NNE) card is a PCI compatible module which implements the computationally intensive aspects of a neural net recognition algorithm. This card was specifically designed for cost effective execution of our client's proprietary neural networks which are used for optical character recognition. The input format uses 256 binary (single bit) pels to represent the unknown character. Characters may be input via the PCI bus as eight 32 bit words, or in bit serial fashion using a dedicated mezzanine bus attached to other hardware in the system. The output consists of a series of 16 bit scores representing the degree of correlation to each character in the target character set.

The Neural Net Engine flexibly implements neural net algorithms which require a single layer of hidden nodes. There is sufficient local storage for a large number of coefficients, permitting multiple character sets to be loaded and held for eventual execution. Each character set may have a different number of hidden nodes per character, ranging from 16 to 256. This permits a very compact (and fast) representation for certain character sets such as OCR-A or OCR-B, and a more complex representation for multi-font machine print or handprint.

The client's initial implementation of the Neural Net Engine used an array of up to 512 DSP computers, each operating at 20 MHz. Assuming 100% efficiency of the DSP code, the computer array could execute 10 billion neural net computations per second. At a cost of more than $40,000 this was an expensive computational resource. A single NNE card can deliver 4 billion operations per second for just a few percent of this cost. This equates to 8,000 characters per second in a typical handprint recognition application, which far exceeds the capabilities of the paper transport and video scanners. For very high speed applications, up to 8 NNE cards can be operated in parallel for a total of 32 giga-ops per second.

A block diagram of the Neural Net Engine is shown in Figure 1. The circuit is implemented on a PCI card and uses the PLX9050 device as an interface to the PCI bus. An XC9536 CPLD provides some of the local bus controls and facilitates loading of the three FPGAs via the PCI bus. An identical pair of XC4010 devices provides the first stage of the neural net computation (node scoring). Note that each of the node processors is attached to a very wide coefficient memory (128 bits comprising four 32 bit devices). This provides the ability to fetch 256 bits of coefficient data for every tick of the clock, allowing 32 hidden node computations to operate in parallel for increased throughput. An XC4013 contains the final scoring circuit and the overall control structure.



The Neural Net Algorithm

The neural net computation for handprint is based on 128 hidden nodes and a character cell of 256 binary (single bit) pels. Each hidden node examines the 256 input pels in sequence. For every ONE in the character cell, a corresponding hidden node weight of 6 bits is added to an accumulator. For a ZERO, the corresponding weight is subtracted. When all 256 pels have been processed in this manner, the accumulator contains a Hidden Node Score. The handprint data set contains 128 hidden nodes for each possible character. Thus there are 256 x 128 = 32,000 coefficients per character. A numeric-only character set would require approximately 320,000 coefficients and a full alphanumeric character set would require at least 2.6 million coefficients.

In order to gain computational throughput without raising the clock rate, Presco modified the neural net algorithm to process 4 pels of the unknown character during each computational cycle. This means that a hidden node computation can be accomplished in 64 cycles instead of 256, raising the effective computational rate by 4x. Each time we process a four pel group, we fetch a "super-coefficient" which represents the sum/difference of the four original coefficients matching the input pels. We have to store 16 of these coefficients to represent all possible combinations of the four input pels. The data processing architecture is shown below:

The NNE card utilizes 32 hidden node processors (16 per XC4010 LCA) each operating at 33 MHz (PCI bus clock rate) and processing four pels per clock tick. This produces an effective computational rate of 4.2 billion node cycles per second. The 4 bit character segments are presented to the SDRAM lookup tables as part of the memory address (higher address bits represent position within the character plus character within the coefficient set). The 256 bit x 512K SDRAM provides sufficient storage for 128 handprint character masks.

Summary

The Neural Net Engine is a space efficient, low cost PCI card that replaces a large array of DSP computers costing over $40,000. By placing the computational algorithm in dedicated FPGAs, we are able to outperform the DSP computers by a wide margin. A key element for achieving a cost effective high speed implementation was to recast the original neural net algorithm to operate on four pel groups instead of single pels. This algorithm improvement plays upon the fact that SDRAM memory is inexpensive and yet earlier designs failed to use the full memory depth. Today, the Neural Net Engine is arguably the world's fastest and most accurate hand print recognizer. It has also been used successfully for OCR-A, OCR-B, and multi-font machine print.