
Copyright:
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/

DOI link to article:
http://dx.doi.org/10.1109/TBCAS.2015.2460232

Date deposited:
12/10/2015
Real-Time Simulation of Passage-of-Time Encoding in Cerebellum Using a Scalable FPGA-Based System

Junwen Luo, Graeme Coapes, Terrence Mak, Tadashi Yamazaki, Chung Tin, and Patrick Degenaar

Abstract—The cerebellum plays a critical role for sensorimotor control and learning. However, dysmetria or delays in movements’ onsets consequent to damages in cerebellum cannot be cured completely at the moment. Neuroprosthesis is an emerging technology that can potentially substitute such motor control module in the brain. A pre-requisite for this to become practical is the capability to simulate the cerebellum model in real-time, with low timing distortion for proper interfacing with the biological system. In this paper, we present a frame-based network-on-chip (NoC) hardware architecture for implementing a bio-realistic cerebellum model with ~100 000 neurons, which has been used for studying timing control or passage-of-time (POT) encoding mediated by the cerebellum. The simulation results verify that our implementation reproduces the POT representation by the cerebellum properly. Furthermore, our field-programmable gate array (FPGA)-based system demonstrates excellent computational speed that it can complete 1sec real world activities within 25.6 ms. It is also highly scalable such that it can maintain approximately the same computational speed even if the neuron number increases by one order of magnitude. Our design is shown to outperform three alternative approaches previously used for implementing spiking neural network model. Finally, we show a hardware electronic setup and illustrate how the silicon cerebellum can be adapted as a potential neuroprosthetic platform for future biological or clinical application.

Index Terms—Cerebellum, field-programmable gate array (FPGA), network on chip (NoC), neural-rehabilitation, passage-of-time (POT).
Golgi cells, using a large scale biologically realistic neural systems [16]–[19]. are extensively used in neural system modeling and simulation with concurrent operation allowing direct use in bench-top implementation. It contains massive flexible programmable logic and is a stepping stone to compact low power CMOS chip implementation. Field-programmable gate arrays (FPGAs) are a versatile reconfigurable digital computational platform which can be used for both direct computational implementation and versatile I/O’s would prove valuable with noise free, scalable communication bandwidth and precise timing management features. A scalable hardware platform that can be tailored designed and takes advantage of highly parallel computing capability would be greatly preferred. Such a system would be a powerful tool to help explore the POT mechanism and related disease mechanisms in the cerebellum. Future neuroprosthetic developments could also benefit from an efficient hardware platform for implementing a large-scale spiking network model for real-time computation.

In general, CPU based process platforms are limited by their sequential computational architecture. The large latency makes them difficult to be used in real-time brain-machine interfaces (BMI). GPUs, on the other hand, are capable of parallel computing but are constrained by memory and communication bandwidth issues [13]. Models can be implemented directly onto CMOS, [14], [15] but a single implementation can be time consuming. Field-programmable gate arrays (FPGAs) are a versatile reconfigurable digital computational platform which can be used for both direct computational implementation and as a stepping stone to compact low power CMOS chip implementation. It contains massive flexible programmable logic with concurrent operation allowing direct use in bench-top in vitro and constrained in vivo systems. If designs are then translated to CMOS, the subsequent chips can be applied to implantable neuro-prosthetic devices. In recent years, FPGAs are extensively used in neural system modeling and simulation of large scale biologically realistic neural systems [16]–[19].

Hardware implementations of cerebellar neural networks for neuroprosthesis have already attracted the interest of neuroscientists and engineers. Bamford et al. [15] has designed a VLSI field-programmable mixed-signal array to produce the eyelink conditioning performances by modeling the cerebellum system. This has been fabricated as a core on a chip prototype intended for use in an implantable closed-loop prosthetic system aimed at rehabilitation of associated behavior. While they have demonstrated a proof-of-concept of success in their implementation, a highly simplified neural model with abstract modeling of cerebellar information processing is used in the work. Such simplification is convenient for hardware implementation, but lacks direct physiological correspondence for quantitative comparison with the biological system. In contrast, Yamazaki and Tanaka’s model [11] is more biologically realistic and pays specific attention to the role of the granular-Golgi layer in timing and gain control by the cerebellar cortex to reproduce experimental results. However, this comes with the cost of a significant increase in the size and complexity of the computational model in order to produce a robust system behavior. As such, an efficient implementation is required to overcome these computational challenges, especially when real-time application is required.

Previously we presented the concept of an FPGA-based network-on-chip (NoC) hardware architecture for implementing the granular layer of random projection cerebellum model [20], [21]. It produced a network behavior of POT representation consistent with the simulation results presented in the original paper by Yamazaki and Tanaka [11]. In this work we have made a more in-depth investigation on the details of implementation and analysis of system performance. The system contains ~100 000 granule cells and ~1000 Golgi cells, using a conductance-based, leaky integrate-and-fire neuron model. The parameter values all have experimental basis, such that the network model produce realistic firing behavior. In particular, three accomplishments are highlighted in this paper: 1) we have reproduced the granular layer firing patterns for representation of POT in real-time under normal as well as pharmacologically perturbed conditions; 2) our architecture allows for efficient scalability to 100 000 neurons and beyond and can be used for more complex biological neural network applications; and 3) we have eliminated multiplexing timing errors and allows for network profiling at key time points.

II. THE PASSAGE-OF-TIME COMPUTATIONAL MODEL

The cerebellar granular layer consists of two main cell types, namely the granule cells and Golgi cells. Input signal from the pre-cerebellar nucleus to the granule cells is conveyed by MFs (Fig. 1). The spiking network of cerebellar granular layer proposed in [11] is modelled as a 1 mm² virtual sheet composed of a square lattice arrangement of 32×32 Golgi cells and glomeruli, and 320×320 granule cells. The same network with minor changes is used in this paper. Fig. 2 describes the topology between Golgi and granule cells.

Fig. 2(a) illustrates the topology of our granular layer model which contains 1024 granule-cell clusters and Golgi cell, the different colors represents communities of closely connected cells within the network. Each granule-cell cluster contains 100 granule cells. The size of the circles is proportional to the number of other clusters that it is connected to. Each dot represents one granule-cell cluster and one Golgi cell, as is shown in Fig. 2(b). Every Golgi cell receives excitatory input from its nearest granule-cell cluster, while Golgi cells project randomly to the nearby granule-cell clusters such that each granule-cell cluster receives inhibitory inputs from ~8 Golgi cells on average. The probability distribution of number of synaptic connection from Golgi cell to granule-cell cluster is shown in Fig. 2(c).
The equations for modeling the neurons and analysis have been detailed in [11] and we briefly repeat the key ones here. The granule and Golgi cells were modelled as conductance-based, leaky integrate-and-fire units, as described in (1)

\[
\frac{dV(t)}{dt} = g_{\text{leak}}(E_{\text{leak}} - V(t)) + g_{\text{exc}} \cdot \text{AMPA}(t)(E_{\text{exc}} - V(t)) + g_{\text{NMDA}}(t)(E_{\text{NMDA}} - V(t)) + g_{\text{inh}}(t)(E_{\text{inh}} - V(t)) + g_{\text{ahp}}(t - \hat{t})(E_{\text{ahp}} - V(t))
\]

where \(V(t)\) and \(C\) are the membrane potential at time \(t\) and the capacitance, respectively, \(E\)'s are the reversal potential and \(\hat{t}\) denotes the last firing time of the neuron. The membrane potential depends on five types of currents: \(\alpha\)-amino-3-hydroxy-5-methyl-4-isoxazolepropionic (AMPA) receptor-mediated, N-methyl-D-aspartate (NMDA) receptor-mediated, leak current, inhibition current and the after-hyperpolarization current. The conductance, \(g(t)\)'s, are calculated by convolving the alpha function \(\alpha(t)\) with the spike event \(\delta_j(t)\) of presynaptic neuron \(j\) at time \(t\) as follows:

\[
g_j(t) = \bar{g}_j \sum_{j} w_{ij} \int_{-\infty}^{t} \alpha(t - s) \delta_j(s) ds
\]

where \(\bar{g}_j\) is the maximum conductance and \(w_{ij}\) is the synaptic weight from presynaptic neuron \(j\). A neuron fires a spike at time \(t\) (\(\delta_j(t) = 1\)) when its membrane potential exceeded a threshold \(\vartheta\), and the after-hyperpolarization would follow. The conductance for the after-hyperpolarization was given by

\[
g_{\text{ahp}}(t - \hat{t}) = \exp(-(t - \hat{t})/\tau_{\text{ahp}}).
\]

We followed the same analysis procedures as in [11] for evaluating the POT behavior produced by the simulation model. We first computed \(z_i(t)\) which represents the average activity of a granule-cell cluster \(i\).

\[
z_i(t) = \frac{1}{\tau} \sum_{s=0}^{t} \exp \left(-\frac{t-s}{\tau}\right) \left(\frac{1}{N_{gr}} \sum \delta_j(s)\right)
\]

where \(\delta_j(s)\) is the spike event in the granule cell \(j\) in the cluster at time \(s\), \(N_{gr}\) is the number of granule cells in a cluster (100 in this case) and \(\tau\) is the decay time constant, which was set at 8.3 ms.

How the activity patterns of granule cell clusters evolved over time is evaluated based on the similarity index, \(S(\Delta t)\). We first computed the autocorrelation of the activity pattern between time \(t\) and \(t + \Delta t\) as follows:

\[
C(t, t + \Delta t) = \frac{\sum_i z_i(t)z_i(t + \Delta t)}{\sqrt{\sum_i z_i(t)^2} \sqrt{\sum_i z_i(t + \Delta t)^2}}.
\]

\(C(t, t + \Delta t)\) takes the value between 0 and 1 since \(z_i(t)\) is always non-negative. It would be 1 if the activity pattern vectors \(z_i(t)\) and \(z_i(t + \Delta t)\) are identical, and it would be 0 when they are orthogonal, indicating that the activity patterns have no overlap. Then the similarity index is computed as the timed average of (5) over the CS duration, \(T\), shown as follows:

\[
S(\Delta t) = \frac{1}{T} \sum_{t=0}^{T} C(t, t + \Delta t).
\]

\(S(\Delta t)\) represents how two activity patterns separated by \(\Delta t\) are correlated, on average. If the similarity index decreased as \(\Delta t\) increased, it indicates that an activity patterns evolved with time into uncorrelated patterns.

We further computed the reproducibility index \(R(t)\) as follows:

\[
R(t) = \frac{\sum_i z_i^{(1)}(t)z_i^{(2)}(t)}{\sqrt{\sum_i z_i^{(1)}(t)^2} \sqrt{\sum_i z_i^{(2)}(t)^2}}
\]

where \(z_i^{(1)}(t)\) and \(z_i^{(2)}(t)\) are the activity patterns of granule-cell cluster \(i\) at time \(t\) for two different input signals. The reproducibility index quantifies how activity patterns elicited by two different input signals differ from each other over time and serves as a measure for the robustness of the POT representation by the network model.

III. HARDWARE ARCHITECTURE DESIGN

To implement the POT model, we propose a frame-based network on chip (NoC) hardware architecture on FPGA. The conceptual structure is shown in Fig. 3.

In Fig. 3, the left side shows the \(n\) by \(n\) frame based NoC system, where the size can be adjusted as needed. The architecture consists of three main components: the neural processor, the router, and the global controller. In this work, we implemented a NoC system containing 48 processors, which calculates the neural activates. Each processor implements 2000 granule cells and 20 Golgi cells with connection ratio of 100:1. The router is used for implementing the inhibitory connections from Golgi
A conceptual FPGA-based network on chip hardware architecture. The figure on the left is the scalable n by m structure of frame based network on chip system. It contains $n \times m$ neural processors, $n \times m$ routers and one global controller. This architecture can be scaled up depending upon the required model. In this paper, we implemented a network on chip system which contains 48 processors. On the right, there is a detailed structure of a module. The neural processor calculates the neural activity, with each processor implementing 2000 granule cells and 20 Golgi cells with connection ratio 100:1. The router is for implementing the connections from Golgi to granule-cell clusters. The interface modules packetize spike events received from the processor ready for transmission through the network. When the interface modules receive packets the message is decoded and transmitted to the required cells within the neural processor. Finally, a frame master is proposed to coordinate neural and communication processing periods.

A. Neural Computing

The neural processor data path is shown in Fig. 4. Two types of neurons are implemented in the processor, the granule cell (GR) and the Golgi cell (GO). Both models use the same hardware architecture but with different parameters. Each granule-cell cluster, containing 100 granule cells, connects to one Golgi cell. The activities (1 or 0) of all the 100 granule cells will be first calculated; whilst an accumulator will add all of them together and at the 100th clock cycle send the summated value to the Golgi cell model as an excitatory input [Fig. 4(a)].

Fig. 4(b) details the data path inside the neural model, which takes two computing stages: ion channel computing and integration. Each stage takes 4 clock cycles. Because computation is performed in parallel, the latency in each individual path has to be consistent; therefore appropriate delay blocks (the rectangular blocks) are added as necessary.

Fig. 4(c) and (d) show the sub-component circuits, including the inhibition and excitation circuits and FIFO-based delay circuits. Since each neural processor implements 2000 granule cells and 20 Golgi cells, a pipelining technique is applied for reducing hardware resources. A long pipelining stage is required for storing granule cells calculation intermediate values. A First-In First-Out (FIFO) based delay circuit is designed for achieving long computational stages.

B. Network-on-Chip

To manage the transmission of action potentials from Golgi cells to granule-cell clusters we have developed a NoC infra-
structure. This system allows for arbitrary connectivity between Golgi cells and granule-cell clusters. Each processing element is connected to a router through which the action potentials are communicated. The routers are connected together in a mesh topology [22] as shown in Fig. 5(b).

When a Golgi cell produces an action potential the interface fetches a list of destination granule-cell clusters from memory, an individual packet is generated to be sent to each of these destinations within the network accordingly. The connectivity of the neural network can be updated by adjusting the contents of the memory. A user may alter the contents of the memory to adjust the connectivity by injecting configuration packets into the network. This can be done at start-up or part way through simulation if required by halting the system using the global frame master.

The packet format is shown in Table I. Packets are classified by the setting of a 2-bit type identifier. The generated spike packet contains the address of the targeted granular cell (Core and Cluster ID), allowing for the routers to direct the packet to the correct processing elements. Each granule-cell cluster sums the packets received. This summed value is used as an inhibitory input into the granule-cell clusters. Packets are transmitted between routers using a 4-phase asynchronous protocol [23] and a parallel data bus. The routers are output buffered using a 2-deep FIFO memory element. To inspect the state of the model the network-on-chip is also responsible for transmitting information externally. When a Golgi cell produces an action potential, a ‘Golgi Message’ packet is also transmitted to a specialist processing element. This processing element buffers all received packets and transmits these packets to a PC. This allows for a user to review the state of each Golgi cell at any time.

C. Frame Master

In order to maintain synchronicity within the system a frame master is used. The master is responsible for ensuring that all packets are transmitted to their destination before the processing elements start to process the next time step. This ensures that the granule-cell clusters receive all their updates within the correct time period.

For example, as shown in Fig. 6, the time required for network communication depends on the load of the network, which is determined by the frequency of Golgi cells spiking and the network topologies. This varies for each frame. In each frame, once the first Golgi cell spike event is released (at time $t_1$), the router starts to process the corresponding synaptic packages. After all 20 Golgi cell spike events are computed (at time $t_3$), the processor’s duty in frame 1 is finished. Then the neural processor needs to start computing the next 20 Golgi cell activities for frame 2. However, at the end of frame 1 (time $t_3$), the network has not completed its communication for the current 20 Golgi cells. Therefore extra time is allocated for the network to finish this task before frame 2 begins. As results of this, the frame master generates a low level signal that disables the processor clock for time $t_3 - t_4$ until the network has completed its routing task. The frame master then enables the processor to allow it to start computing again (time $t_4$).

IV. RESULTS

A. Hardware Simulation Results for Passage-of-Time (POT)

Fig. 7 shows a comparison of the membrane potential of a single granule (1) neuron model simulated by the FPGA neural processor and by software (implemented in C with floating-point data type). A fixed point system with 40-bit and 22-fractional bit is employed in this FPGA system, and this length of
bits has been selected to guarantee each operation to have sufficient precision to avoid data overflows and mismatch. The same inputs (30 Hz Poisson spike train) were given to both simulations.

The two simulations produce essentially identical results with very minor differences due to hardware truncation errors. This validated the hardware implementation of the neural model. Increasing the length of bits can eliminate truncation errors but introduce resources utilizations waste.

Then we investigated the simulation results of the complete network model. The hardware POT simulation results are summarized in Fig. 8. Poisson spikes were fed into the simulated network to represent CS inputs through MFs. The simulated network was first fed at each MF with 5-Hz Poisson spikes for 300 ms to set the network to steady-state then 30-Hz Poisson spikes, preceded by 5 ms 200-Hz spikes, are given to excite the network.

Fig. 8(a) shows the spike patterns of 40 granule cells randomly chosen from different granule-cell clusters. These granule cells show different temporal activity patterns. Specifically, they show a random repetition of transitions between bursting and silent states. These bursts are sustained for tens to 300 ms to set the network to steady-state then 30-Hz Poisson spikes, preceded by 5 ms 200-Hz spikes, are given to excite the network.

Thetwosimulationsproduceessentiallyidenticalresultswithveryminordifferencesduetohardwarer truncationerrors.Thisvalidatedthehardwareimplementationoftheneuralmodel.Increasingthelengthofbitscaneliminatetruncationerrorsbutintroduce resourceutilizationswaste.

B. Effects of Blocking NMDA Channel on POT Representation

To further verify our hardware simulation results, we also investigated the effect of blocking NMDA channels, which play a critical role in delayed eyeblink conditioning [24], in our simulations. The hardware and software simulation results are summarized in Fig. 8(d)–(f). When NMDA channels are blocked in either granule cells or Golgi cells, granule cells lose the temporal structure in their firing, instead, they fire spikes at a rather regular manner [Fig. 8(d)]. The similarity index becomes flat except for $\Delta t$ smaller than $\sim 30$ ms. Within the time scale of 30 ms, there are very limited number of spikes to encode robust temporal structure for POT. On the other hand, 30 ms is too short for physiologically relevant POT in a classic delayed eyeblink conditioning experiment. Hence, the GF firing pattern after NMDA-R blockade cannot capture a temporal structure at a time scale of physiologically relevance. The disruption of POT encoding consequent to NMDA channels blockade is reflected by both software [Fig. 8(e)] and hardware simulation [Fig. 8(f)]. The results (both software and hardware) are consistent with those presented in [11].

C. Network on Chip Performance

To investigate the performance of the NoC infrastructure, we replaced the processing elements with configurable random packet generators. Packets were then injected into the network at a defined rate and the latency and throughput of these packets analyzed. The results for a 48-core system involving 960 Golgi cells, identical to the networks described in the sections above, are highlighted in Fig. 10. As the mean firing rate of the Golgi cells increases the median and range of latencies increases slightly. However, all packets are transmitted in under 100 processor clock cycles, which is the time it takes to update the state of a single Golgi cell. No packets are lost at any of the measured input frequencies, indicating that the frame master should not be required apart from when the Golgi firing rate exceeds expectations or when the user intervenes. Within typical cerebellar systems, the Golgi cells fire at a rate of between 40 and 60 Hz, which is within the defined network performance characteristics.

D. FPGA-Based Granular Layer for Neural-Rehabilitation

We illustrated a hypothetical in vivo experimental setup for closed-loop prosthetic application using our FPGA granular layer system in Fig. 9(a). Biological neuronal spike signals would be recorded by using a multichannel neural recording system from the pontine nucleus or from the mossy fibers, which would then be used as inputs to the silicon granule layer model. These neuronal spikes will be processed by the silicon-granular layer, which then generate the appropriately timed output discrete spikes to trigger the stimulation to be injected into the animal. Fig. 9(b) shows an electronic system setup to demonstrate such experiment. A Virtex-5 board is employed to simulate the neural spikes inputs conveyed by MFs,
which are delivered to the FPGA cerebellum model via four bit wires. The input discrete spikes are modeled as two 5 Hz and two 30 Hz Poisson spike trains in 4-bits signals. The proposed silicon granular layer is implemented on the Virtex-7 board with the I/O interface for displaying the system output on the oscilloscope in real-time [Fig. 9(c)]. The displayed GR spikes were taken from three neural processors. The frame-based signal is also shown which is used to monitor and verify system processing behaviors. When the task of each frame is finished, the frame-based signal is changed to a high level value, and each frame uses 25.6 us (the time between X1 and X2) to mimic 1 ms real-world activities. Hence, this setup can complete 1 sec real-world activities in 25.6 ms at full speed as shown at Fig. 11. The system specifications are summarized in Table II.

V. DISCUSSION

A. Scalability of Different Platforms

In Fig. 12 we compare the performance of our design with three alternative approaches previously proposed for implementing spiking neural network (CPU, GPU and multi-core bus). In addition to its higher computational speed, our FPGA-based NoC approach clearly demonstrates scalability compared with the other approaches. The computation time remains almost constant even if the network size increases by an order of magnitude.

An alternative is to use GPU processors which can supplement or even replace CPU’s for parallelizable code. The rise of GPU languages such as CUDA and Open CL have simplified their use enormously. Modern GPU’s exceed 5000 cores and can increase processing speed by orders of magnitude for parallelizable tasks [25], [26], [12]. Additionally, GPU’s offer extremely high raw memory bandwidth, though this is difficult to achieve in practice and requires adhering to strict memory access patterns [25]. Nevertheless, with sufficient power, it is possible to implement spiking neural networks for high speed computation on a massively multi-core GPU. However desktop systems require relatively large power consumption and are not scalable to prosthetic devices. Mobile GPU systems found on typical mobile phones are significantly more power efficient, but have fewer cores, and the shifting incoming/outgoing data via the CPU would significantly reduce their effectiveness. We
Fig. 9. The overall system experimental setup. (a) The closed-loop prosthetic system. The hypothetical in vivo closed-loop experimental setup for cerebellum rehabilitation. (b) The electronic experimental setup. An electronic setup to demonstrate the feasibility of the in vivo experiment. A Virtex-5 board (brain simulator) is employed to simulate the biological spikes conveyed by MFs, which are delivered to the FPGA cerebellum model via four bit wires. The input discrete spikes are modeled as two 5 Hz and two 30 Hz Poisson spike trains in 4-bits signals. The proposed silicon granular layer is implemented on the Virtex-7 board with the I/O interface for displaying the system output on the oscilloscope in real-time. (c) The real-time Input/Output discrete spikes. Shows the real-time input/output discrete spikes and the frame-based signal. When interfaced with an animal in a behavioral experiment, the output of the FPGA could be linked to the stimulators for delivering timed electrical pulse stimulation to the brain. The outputs can alternatively be linked to model Purkinje cells which then be linked to the stimulators.

Therefore chose an FPGA platform with large numbers of I/O’s for potential in vitro and in vivo operation.

One key difference between our FPGA platform and processor based implementations is that we utilize distributed, localized memory banks that avoid sharing of global memory resources. This avoids delays associated with accessing global memory and reduces power consumption by minimizing the size and operating frequency of channels between processors and memory.

The memory usage grows linearly with an increasing number of cells. The major memory consumption is within the storage of connectivity information between Golgi and granule cells. On average, each Golgi cell is connected to 8 granule cells, so the memory requires only 8 words per Golgi cell to store this connectivity information. For each additional Golgi cell another 8 words is required.

A further variance on previous work is the use of frame based encoding. One issue with real-time NoC systems is that spiking information encoded in latency or frequency can be prone to distortion due to congestion [27]–[29]. In contrast, we utilize a stop-start approach whereby all the neural spikes processed and then stopped to allow full transmission around the network whenever necessary. This is actually akin to biology, whereby synaptic transmission, dendritic signal integration and action potential initialization can take time, but transmission speed is actually very fast [30]. In addition to low distortion, this approach also allows us to easily compare among computational models. We can simply extract a specific frame $N$ of the simulations in all cases for detail comparison.

An alternative digital implementation to a NoC is perhaps a bus between processing cores. However, increases in firing frequency will lead to distortion of the information, which will limit the system performance. Alternatively some of these ef-
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LUO et al.: REAL-TIME SIMULATION OF PASSAGE-OF-TIME ENCODING

Fig. 10. The network on chip performance. It is the latency of packet transmission against the rate of packet injection into the network. Packet injection rate has only a minor impact upon latency within the expected range of operation (40–60 Hz). The data plotted in red shows box plots of the latency of packets through the network for varying degrees of background traffic.

Fig. 11. The real-time computational condition among CPU, GPU and FPGA for simulating 1 s activities. The CPU and GPU results are cited from previous work [12].

TABLE II
FPGA-BASED GRANULAR LAYER SPECIFICATIONS

<table>
<thead>
<tr>
<th>Timing issues</th>
<th>Max. clock frequency</th>
<th>Minimum period</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>121.945MHz</td>
<td>8.2ns</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Hardware resources utilization</th>
<th>Processor</th>
<th>Router</th>
<th>Module</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Slice register</td>
<td>2884</td>
<td>792</td>
<td>3676</td>
<td>176424 (29%)</td>
</tr>
<tr>
<td>Slice LUTs</td>
<td>4379</td>
<td>1213</td>
<td>5592</td>
<td>268455 (88%)</td>
</tr>
<tr>
<td>Block RAM/FIFO</td>
<td>20</td>
<td>0</td>
<td>20</td>
<td>960 (93%)</td>
</tr>
<tr>
<td>DSP48E1s</td>
<td>48</td>
<td>0</td>
<td>48</td>
<td>2304 (82%)</td>
</tr>
</tbody>
</table>

Power Consumption
| Dynamic Power                  | -         | -      | 60mW   | 2.88W    |

Fig. 12. Scalability of four different approaches. The dotted lines represent our estimation of system performances, whereas solid lines represent our measurements. The FPGA-based NoC computation time remains constant due to its parallel nature and the efficient communication system.

Fig. 12. Scalability of four different approaches. The dotted lines represent our estimation of system performances, whereas solid lines represent our measurements. The FPGA-based NoC computation time remains constant due to its parallel nature and the efficient communication system.

Effects can be alleviated using traffic management via hierarchical AER architectures [31].

Using a NoC infrastructure as opposed to a bus also reduces power consumption within the design as it allows for much reduced clock frequency. Using Xilinx XPower Analyzer we estimate that when implemented upon a Virtex-7 VC707 XC7VX485T-2FFG1761C Evaluation Kit each module, containing a processor, router and interface, consumes 60 mW of dynamic power, equating to a total dynamic power consumption of 2.88 W when running at full-speed, or 60 mW per processing module.

B. Compare to Other Hardware Implementation Techniques

There are several possible alternative techniques to our frame based network-on-chip architecture. To date, SpiNNaker [32], Neurogrid [33], IBM SyNAPSE [34] are projects that build custom chips or systems for efficient large-scale simulation of general neural network models. These systems are powerful and innovative; however, they may not be optimal for the system that we are implementing in this paper. Based on the NoC analysis, it was found that with the unique network connectivity of the granular layer model that unicasting was more efficient in terms of memory resources by a factor of approximately 2x. The bandwidth for both approaches remained approximately equal. The results of this investigation alongside the reduced complexity of configuration within unicasting made it the preferred choice. Neurogrid employs a smart approach to combine analogue circuits for mimicking neural process and digital circuits for implementing routing components. It can potentially save significant amount of energy consumptions. But, analogue circuits based dimensionless models are not ideal to map conductance-based leaky integration-and-fire neuron in POT model. IFAT [35] is also a well-established platform for brain network real-time operation, but the analogue based integrated and fire array may not provide good scalability.

We are seeking to further optimize our system and to use it for other application. Cassidy et al. [36], [37] proposed a neuro-array architecture for general large-scale neuromorphic
system with corresponding analysis. Their design principles, including external SRAM technique, can provide new insight for optimizing our system. Also, applying our proposed silicon granular layer to perform pattern recognition would be another application which is similar to new IBM chip TrueNorth [34].

In fact, our proposed frame based network on chip architecture is general for spiking neural networks, although in order to implement other models, we need to modify the components appropriately for the target model. For instance, in this work the routing components (transmitter, router and receiver) are customer designed for implementing POT recurrent random network connections; and neural processor architecture is also specifically designed for mapping the connections from granule cells to Golgi cells. Further system tweaking will be required to optimize the performance for a different target model.

C. Neuro-Prosthesis Applications

For translation into neuroprosthesis, our architecture lends itself easily to electrical [38] or optical stimulation methodologies [39], [40]. The FPGA-based granular model can robustly predict responses of POT behavior and thus be used to interface with in vivo and in vitro experiment. Furthermore it is straight-forward to translate generated spikes directly to tissue as each will be encoded with a destination address.

For long-term neuro-prosthesis experiments this design can be translated directly to an ASIC platform in order to increase Portability and to reduce power consumption. We estimate that the routing components (transmitter, router and receiver) are specifically designed for mapping the connections from granule cells to Golgi cells. Further system tweaking will be required to optimize the performance for a different target model.

VI. Conclusion

The goal of the work has been to implement a real-time cerebellar granular layer model onto a FPGA hardware platform utilizing a NoC hardware architecture. Our design can achieve (more than) real time operation for a system of 1000 Golgi cells and 100,000 granule cells on a single FPGA board. This is achieved via an efficient implementation of the mathematical models of the neuron cells; and the use of a frame based architecture which eliminates congestion distortion of spike timing in multiplexed networks. Our design is also highly scalable that computation time remains almost unchanged for a much larger network model.

The major contributions of this paper are summarized as follows: 1) an efficient FPGA-based NoC hardware architecture is proposed for implementing a large-scale cerebellar granular-Golgi layer model for POT encoding; 2) our implementation is computationally efficient that it can complete 1 sec simulation in 25.6 ms and that FPGA provides precise timing control. Together they allow our design to be readily adapted for real-time closed-loop in vitro or in vivo experiment; 3) our NoC architecture is highly scalable and hence it is now possible to simulate the full-scale granular layer with cell density of 1 million cells/mm² as in the real brain, which is 10 times the size of the current model. Such simulation power can open up new possibility for understanding the dynamics of the cerebellar network; and 4) our design can be potential neuro-prosthetics tool for future experimental and clinical applications owing to its high computational power, flexibility, high scalability and power efficiency.

REFERENCES


Chung Tin received the B.Eng. degree in mechanical engineering from the University of Hong Kong, Pokfulam, Hong Kong, in 2002, and the M.Sc. and Ph.D. degrees in mechanical engineering from the Massachusetts Institute of Technology, Cambridge, MA, USA, in 2004 and 2011 respectively.

Since 2012, he has been an Assistant Professor in the Department of Mechanical and Biomedical Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong. His research interests include computational neuroscience, sensorimotor learning, and neuroprosthetic system.

Dr. Tin was the recipient of the Croucher Foundation Scholarship (Hong Kong), American Heart Association Predoctoral Fellowship, and Early Career Award (Research Grant Council, Hong Kong).

Patrick Degenaar received the Bachelor’s (1st class) and M.Res. degrees in applied physics from Liverpool University, Merseyside, U.K., and the Ph.D. degree in bioelectronics from the Japan Advanced Institute of Science and Technology, Nomi, Japan.

Currently, he is a Reader in biomedical engineering at The School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne, U.K. Until 2010, he held a senior lectureship at Imperial College London, London, U.K., where he also held a RCUK fellowship. His core interests lie in neuroprosthetics and bringing devices to clinical practice. He previously led the FP7 OptoNeuro consortium and is now the engineering lead on the CANDO project (http://www.cando.ac.uk) to bring a next generation optogenetic/optoelectronic implants for epilepsy to clinical practice.