| |
An Efficient PCI Based Neural Net Engine
The Neural Net Engine (NNE) card is a PCI compatible module
which implements the computationally intensive aspects of
a neural net recognition algorithm. This card was specifically
designed for cost effective execution of our client's proprietary
neural networks which are used for optical character recognition.
The input format uses 256 binary (single bit) pels to represent
the unknown character. Characters may be input via the PCI
bus as eight 32 bit words, or in bit serial fashion using
a dedicated mezzanine bus attached to other hardware in the
system. The output consists of a series of 16 bit scores representing
the degree of correlation to each character in the target
character set.
The Neural Net Engine flexibly implements neural net algorithms
which require a single layer of hidden nodes. There is sufficient
local storage for a large number of coefficients, permitting
multiple character sets to be loaded and held for eventual
execution. Each character set may have a different number
of hidden nodes per character, ranging from 16 to 256. This
permits a very compact (and fast) representation for certain
character sets such as OCR-A or OCR-B, and a more complex
representation for multi-font machine print or handprint.
The client's initial implementation of the Neural Net Engine
used an array of up to 512 DSP computers, each operating at
20 MHz. Assuming 100% efficiency of the DSP code, the computer
array could execute 10 billion neural net computations per
second. At a cost of more than $40,000 this was an expensive
computational resource. A single NNE card can deliver 4 billion
operations per second for just a few percent of this cost.
This equates to 8,000 characters per second in a typical handprint
recognition application, which far exceeds the capabilities
of the paper transport and video scanners. For very high speed
applications, up to 8 NNE cards can be operated in parallel
for a total of 32 giga-ops per second.
A block diagram of the Neural Net Engine is shown in Figure
1. The circuit is implemented on a PCI card and uses the PLX9050
device as an interface to the PCI bus. An XC9536 CPLD provides
some of the local bus controls and facilitates loading of
the three FPGAs via the PCI bus. An identical pair of XC4010
devices provides the first stage of the neural net computation
(node scoring). Note that each of the node processors is attached
to a very wide coefficient memory (128 bits comprising four
32 bit devices). This provides the ability to fetch 256 bits
of coefficient data for every tick of the clock, allowing
32 hidden node computations to operate in parallel for increased
throughput. An XC4013 contains the final scoring circuit and
the overall control structure.

The Neural Net Algorithm
The neural
net computation for handprint is based on 128 hidden nodes
and a character cell of 256 binary (single bit) pels. Each
hidden node examines the 256 input pels in sequence. For every
ONE in the character cell, a corresponding hidden node weight
of 6 bits is added to an accumulator. For a ZERO, the corresponding
weight is subtracted. When all 256 pels have been processed
in this manner, the accumulator contains a Hidden Node Score.
The handprint data set contains 128 hidden nodes for each
possible character. Thus there are 256 x 128 = 32,000 coefficients
per character. A numeric-only character set would require
approximately 320,000 coefficients and a full alphanumeric
character set would require at least 2.6 million coefficients.
In order to gain computational throughput without raising
the clock rate, Presco modified the neural net algorithm to
process 4 pels of the unknown character during each computational
cycle. This means that a hidden node computation can be accomplished
in 64 cycles instead of 256, raising the effective computational
rate by 4x. Each time we process a four pel group, we fetch
a "super-coefficient" which represents the sum/difference
of the four original coefficients matching the input pels.
We have to store 16 of these coefficients to represent all
possible combinations of the four input pels. The data processing
architecture is shown below:

The NNE
card utilizes 32 hidden node processors (16 per XC4010 LCA)
each operating at 33 MHz (PCI bus clock rate) and processing
four pels per clock tick. This produces an effective computational
rate of 4.2 billion node cycles per second. The 4 bit character
segments are presented to the SDRAM lookup tables as part
of the memory address (higher address bits represent position
within the character plus character within the coefficient
set). The 256 bit x 512K SDRAM provides sufficient storage
for 128 handprint character masks.
Summary
The Neural
Net Engine is a space efficient, low cost PCI card that replaces
a large array of DSP computers costing over $40,000. By placing
the computational algorithm in dedicated FPGAs, we are able
to outperform the DSP computers by a wide margin. A key element
for achieving a cost effective high speed implementation was
to recast the original neural net algorithm to operate on
four pel groups instead of single pels. This algorithm improvement
plays upon the fact that SDRAM memory is inexpensive and yet
earlier designs failed to use the full memory depth. Today,
the Neural Net Engine is arguably the world's fastest and
most accurate hand print recognizer. It has also been used
successfully for OCR-A, OCR-B, and multi-font machine print.
|
|