research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... ·...

47
Abstract — This work presents a programmable and configurable motion estimation processor capable of performing motion estimation across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion vectors. The core can be programmed using a C-style syntax optimized to implement arbitrary block matching algorithms and configured with different execution units depending on the selected codec, the available inter-coding options and required performance. This flexibility means that the core can support the latest video codecs such as H.264, VC-1 and AVS at high-definition resolutions and frame rates. The configuration and programming phases are supported by an integrated development environment that includes a compiler and profiling tools enabling a designer without specific hardware knowledge to optimize the microarchitecture for the selected codec standard and motion search technique leading to a highly efficient implementation. Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, VC1, AVS, FPGA. I. T INTRODUCTION he emergence of new advanced coding standards such as H.264 [1], VC-1 [2] and AVS [3] with their multiple coding tools have introduced new complexity challenges during the motion estimation (ME) process used in inter-frame prediction. While previous standards such as MPEG-2 could only vary a few options, H.264, VC-1 and AVS add the freedom of using quarter pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the inter-prediction process. The potential complexity introduced by these tools Multi-standard Reconfigurable Motion Estimation Processor for Hybrid Video Codecs Jose L. Nunez-Yanez, Trevor Spiteri, George Vafiadis

Transcript of research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... ·...

Page 1: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Abstract — This work presents a programmable and configurable motion estimation processor capable of performing motion

estimation across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion

vectors. The core can be programmed using a C-style syntax optimized to implement arbitrary block matching algorithms and

configured with different execution units depending on the selected codec, the available inter-coding options and required

performance. This flexibility means that the core can support the latest video codecs such as H.264, VC-1 and AVS at high-

definition resolutions and frame rates. The configuration and programming phases are supported by an integrated development

environment that includes a compiler and profiling tools enabling a designer without specific hardware knowledge to optimize the

microarchitecture for the selected codec standard and motion search technique leading to a highly efficient implementation.

Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, VC1, AVS, FPGA.

I.

TINTRODUCTION

he emergence of new advanced coding standards such as H.264 [1], VC-1 [2] and AVS [3] with their multiple coding tools

have introduced new complexity challenges during the motion estimation (ME) process used in inter-frame prediction. While

previous standards such as MPEG-2 could only vary a few options, H.264, VC-1 and AVS add the freedom of using quarter

pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the

inter-prediction process. The potential complexity introduced by these tools operating on large reference area sizes makes

the full-search approach, which exhaustively tries each possible combination, less attractive from a performance and power

points of view. A flexible, reconfigurable and programmable motion estimation processor such as the one proposed in this

work is well poised to address these challenges by fitting the core microarchitecture to the inter-frame prediction tools and

algorithm for the selected encoding configuration. The concept was briefly introduced in [4] and it is further developed in

this work. The paper is organized as follows. Section II reviews significant work in the field of hardware architectures for

motion estimation. Section III establishes the need for architectures able to support fast motion estimation algorithms in

order to deliver high quality results in a power/area/time-constrained video processing platform. Section IV describes the

processor microarchitecture details while Section V presents the tools developed to program the core and explore the

software/hardware design space for advanced motion estimation. Section VI presents the multistandard hardware extensions

targeting VC-1 and AVS codecs. Finally section VII analyses and compares the complexity/performance of the proposed

solution and section VIII concludes the paper.

Multi-standard Reconfigurable Motion Estimation Processor for Hybrid Video Codecs

Jose L. Nunez-Yanez, Trevor Spiteri, George Vafiadis

Page 2: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

II.MOTION ESTIMATION HARDWARE REVIEW

A. Full Search ME Hardware

There are numerous examples of full-search motion hardware in the literature, and in this section we will review a few

relevant examples. Full-search algorithms have been the preferred option for hardware implementations due to their regular

dataflow, which makes them well suited to architectures using one-dimensional (1-D) or two-dimensional (2-D) systolic

array principles with simple control and high hardware utilization. This approach generally avoids global routing, resulting

in high clock frequencies. Practical implementations, however, need to consider the interfacing with the memories in which

the frame data resides. This can result in large memory data widths and port counts or a large number of registers needed to

buffer the pixel data. Data broadcasting techniques can be used to reduce the need for large memory bit widths, although this

can reduce the achievable clock frequency. These ideas are developed in the work presented in [5] which uses a broadcasting

technique to propagate partial sum of absolute differences (SAD) values and merges the partial results to obtain different

block sizes in parallel.

An improvement on this concept is shown in [6] which develops a 2-D SAD Tree architecture which operates on one

reference location at a time using a four-row array of registers for reference data, thereby removing the need for broadcasting

and also allowing a snake-scan search order which further increases data-reuse. A different approach to using an adder tree is

proposed in [7], which adds variable block size support to a 1-D systolic array by using each processing element (PE) to

accumulate the SAD for each of the 41 motion vectors required, with a shuffling technique. The authors report a latency of

4496 clock cycles to complete the full search on a 16×16 search area with 16 PEs working in parallel. The throughput in this

case is 13 frames per second (fps) in QCIF video resolution. A similar approach based on a 2-D architecture is presented in

[8]. The architecture includes a total of 256 PEs grouped into 16 4×4 arrays that can complete the matching of a candidate

macroblock in every clock cycle. The implementation reports results based on a search area of 32×32 pixels. The whole

computation takes around 1100 clock cycles to complete with a complexity of 23K logic elements (LUTs) implemented in

an Altera Excalibur EPXA10. An FPGA working frequency of 12.5 MHz is reported although the device works at 285 MHz

when targeting a TSMC 0.13 μm standard cell library.

Importantly, with these SAD reuse strategies, it can be seen that full-search implementations have a clear advantage in that

extending them to support variable block sizes requires only a small increase in gate count over their conventional fixed-

block counterparts, and has little or no bearing on its throughput, critical path, or memory bandwidth. On the other hand, full

search invariably implies a large number of SAD operations, and even for reduced search areas, a number of optimizations

have been developed to make it more computationally tractable. For example, the work presented in [9] uses a most-

significant-bit first bit-serial architecture instead of a systolic implementation. This enables early termination when the SAD

of a particular motion vector candidate becomes larger than the current winner during the SAD computation. One of the

Page 3: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

challenges facing the full-search approach in hardware is that throughput is determined by the search range which is

generally kept small at 16×16 pixels to avoid large increases in hardware complexity. Intuitively it is reasonable to expect

that the search ranges should be larger for high-definition video formats (the same object moves more pixels in high

resolution than in a lower resolution screen) and the increase in search range will have a large impact in complexity or

throughput since all the pixels must be processed, limiting the scalability of full search hardware. For example, the work

presented in [10] considers a relatively large search range of 63×48 pixels in their integer full-search architecture, which can

vary the number of pixel processing units. A configuration using 16 pixel processing units can support 62 fps of 1920×1080

video resolution clocking at 200 MHz, but it needs around 154K LUTs implemented in a Virtex5 XCV5LX330 device.

From this short review it can be concluded that full-search hardware architectures focus on integer-pel search while

fractional-pel search is not investigated since it is considered to take only a fraction of the time of the integer search. Also,

rate distortion optimization (RDO) using Lagrangian multipliers that add the cost of the motion vector to the SAD cost are

generally ignored although they can have a large impact on coding efficiency of around 10% as shown in later sections. One

of the difficulties of adding RDO to the previous architectures is that all the motion vector costs need to be calculated in

parallel to avoid becoming a bottleneck, and the additional hardware needed to support these parallel computations with

multipliers involved will increase the complexity considerably.

B. Fast Search ME Hardware

Architectures for fast ME algorithms have also been proposed [11]. The challenges the designer faces in this case include

unpredictable data flow, irregular memory access, low hardware utilization and sequential processing. Fast ME approaches

use a number of techniques to reduce the number of search positions, and this inevitably affects the regularity of the data

flow, eliminating one of the key advantages that systolic arrays have: their inherent ability to exploit data locality for reuse.

This is evident in the work done in [12] that compares a number of fast-motion algorithms mapped onto a systolic array and

discovers that the memory bandwidth required does not scale at anywhere near the same rate as the gate count.

A number of architectures have been proposed which follow the programmable approach by offering the flexibility of not

having to define the algorithm at design time. The application specific instruction-set processor (ASIP) presented in [13]

uses a specialized data path and a minimum instruction set similar to our own work. The instruction set consists of only eight

instructions operating on a RISC-like, register-register architecture designed for low-power devices. There is the flexibility

to execute any arbitrary block matching algorithms and the basic SAD16 instruction computes the difference between two

sets of 16 pixels and in the proposed microarchitecture takes 16 clock cycles to complete using a single eight-bit SAD unit.

The implementation using a standard cell 0.13 μm ASIC technology shows that this processor enables real-time motion

estimation for QCIF, operating at just 12.5 MHz to achieve low power consumption. An FPGA implementation using a

Virtex-II Pro device is also presented with a complexity of 2052 slices and a clock of 67 MHz. In this work, scaling can be

achieved by varying the width of the SADU (ALU equivalent for calculating SADs), but due to its design, the maximum

Page 4: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

parallelism that can be achieved would be if the SAD for the entire row could be calculated in the minimum one clock cycle,

in a 128-bit SIMD manner.

The programmable concept is taken a step further in [14]. This ME core is also oriented to fast motion estimation

implementation and supports sub-pixel interpolation and variable block sizes. The interpolation is done on-demand using a

simplified 4-tap non-standard filter for the half-pel interpolation, which could cause a mismatch between the coder output

and a standard-compliant decoder. The core uses a technique to terminate the calculation of the macroblock SAD when this

value is larger than some previously calculated SAD, but it does not include a Lagrangian-based RDO technique to

effectively add the cost of coding the motion vector to the selection process. Scalability is limited since only one functional

unit is available, although a number of configuration options are available to match the architecture to the motion algorithm

such as algorithm-specific instructions. The SAD instruction, comparable to our own pattern instruction, operates on a 16-

pixel pair simultaneously and 16 instructions are needed to complete a macroblock search point, taking up to 20 clock

cycles. The processor uses 2127 slices in an implementation targeting a Virtex-II device with a maximum clock rate of 50

MHz. This implementation can sustain processing of 1024×750 frames at 30 fps.

There are also examples of ME processors in industry as reviewed in [11]. Xilinx has recently developed a processor

capable of supporting high definition 720p at 50 fps, operating at 225 Mhz [15] in a Virtex-4 device with a throughput of

200,000 macroblocks per second. This Virtex-4 implementation uses a total of around 3000 LUTs, 30 DSP48s embedded

blocks and 19 block-RAMs. The algorithm is fixed and based on a full search of a 4×3 region around 10 user-supplied initial

predictors for a total of 120 candidate positions, chosen from a search area of 112×128 pixels. The core contains a total of 32

SAD engines which continuously compute the cost of the 12 search positions that surround a given motion vector candidate.

III. THE CASE FOR FAST MOTION ESTIMATION HARDWARE ENGINES IN HIGH-DEFINITION VIDEO CODING

It has been shown in the literature [16] that the motion estimation process in advanced video coding standards can

represent up to 90% of the total complexity, especially when considering features such as multiple reference frames, multiple

partition sizes, large search ranges, multiple vector candidates and fractional-pel resolutions. A thorough evaluation of how

these options affect video quality is available at [16] although limited to QCIF (176×144) and CIF (352×288) formats. These

low resolutions formats are of limited application in current communications and multimedia products which aim at

delivering high quality video; even mobile phones already target resolutions of 800×480 pixels. Some of the conclusions

reached in [16] indicate that the impact of large search sizes on coding efficiency is limited, that most of the coding

efficiency is obtained from the first four block sizes (16×16, 16×8, 8×16, 8×8) and that the gains obtained thanks to RD-

Lagragian optimization are substantial. Our research has confirmed that RD-Lagrangian is of vital importance typically

reducing bit rates around 10% for the same video quality. Disabling this optimization reduces performance especially for the

exhaustive search. The reason is that the selection of the winner based only on the SAD can introduce motion vectors that do

Page 5: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

not follow the real motion with excessive ‘noise’ that hurts the bit rate. The conclusion is that RD-Lagrangian should be

present in any high quality motion estimation hardware processor or algorithm. In subsections A and B we re-examine the

effects of search range/sub-partitions using three high-definition 1920×1080 video sequences: tractor (complex chaotic

motion), pedestrian area (fast simple motion) and sunflower (simple slow motion) obtained from [17] using an open-source

implementation of H.264 [18]. In all these experiments we have kept the option of using predicted motion vector candidates

to initialize the search range enabled and RD-Lagrangian optimization enabled since they consistently improve the quality of

the motion vectors.

A. Search range evaluation

For the search range evaluation we consider the well-known hexagonal search algorithm and the full search algorithm, both

as implemented in [18]. We consider the following search ranges: 8×8, 16×16, 32×32, 64×64, 128×128 and 256×256. The

results for the hexagonal case are shown in Figs. 1, 2 and 3. It is clear that the optimal search range varies depending on the

sequence increasing from 32×32 for the sunflower sequence to 128×128 for the pedestrian sequence. The analysis with the

full-search algorithm indicates similar results so it is not included. Overall, a 16×16 search range as used in many full-search

hardware implementations is too restrictive. In our hardware we have increased the range to 96×112 (search window of

112×128) as a tradeoff between hardware efficiency and coding quality, as explained in the following sections.

B. Sub-partition analysis

Fig. 4 shows the results of enabling sub-partitions for the pedestrian area sequence for the hexagonal search algorithm.

The results show that the sub-partitions optimization fails to improve performance for the hexagonal search algorithms and

can actually degrade it. Figs. 5 and 6 correspond to the other two sequences and show equivalent results. Similar experiments

conducted with other search algorithms including full-search also confirm this behavior. It is apparent that the cost of coding

the additional motion vectors required by the sub-partitions can make the gains obtained during residual encoding negligible.

Additionally, if Lagrangian optimization is disabled then the effect of sub-partition coding is much more negative since the

costs of these additional motion vectors are being ignored. Sub-partitions are more effective with smaller resolutions (the

same object will move fewer pixels in CIF compared with HD, for example) and it is possible that different Lagrangian

techniques from the one used in x264 could yield a better performance, but careful analysis is needed when these techniques

are being used in high definition video coding. Otherwise they could result in an important computational overhead with no

apparent benefit. This means that one of the main advantages of full-search hardware which is its ability to calculate in

parallel all the sub-partitions costs by combining SAD results of the smaller sub-blocks could be of limited benefit in high-

resolutions video formats. On the other hand the large search areas required by high-definition as seen in sub-section A could

Page 6: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

be problematic to implement in real-time with full-search hardware resulting in a large hardware complexity and energy

consumption.

IV. PROCESSOR MICROARCHITECTURE

Section III has shown that for high-definition sequences large search ranges are required so complex exhaustive search

algorithms will need very high-throughputs to maintain real-time performance. On the other hand, fast motion estimation

algorithms can offer high quality of results by being able to track real motion better. A configurable/programmable

processor such as the one presented in this work can use different configurations depending on the video resolutions being

processed. Other configuration options that the hardware should support include the number of reference frames, fractional-

pel support and sub-partition sizes.

A programmable/configurable hardware solution can be designed to enable the trade-off among complexity and throughput

depending on the number and type of functional or execution units, so for example if a particular application does not benefit

from rate distortion optimization, this can be disabled, reducing the complexity and power consumption of the core.

According to this criterion, a configurable and programmable motion estimation processor has been designed which can

support H.264 motion estimation over a large range of search options and video resolutions, including high definition, using

different programs and hardware configurations. The microarchitecture of a sample hardware configuration using four

integer-pel execution units, one fractional-pel execution unit and one interpolation execution unit is illustrated in Fig. 7. The

main integer-pel pipeline must always be present to generate a valid processor configuration, but the other units are optional,

and are configured at compile time.

A. Integer-pel execution units (IPEU).

The main IPEU is shown in the middle of Fig.7 and has as principal components: (1) The physical address calculator that

transforms the motion vector offsets and vector candidates into addresses to the reference memory, (2) The vector alignment

unit that aligns the two 64-bit words obtained from the reference memory into a single 64-bit word of valid pixels ready to

operate with the data from the current macroblock memory. (3) The SAD logic which is formed by the SAD operator and

the adder tree to obtain and accumulate the SAD values,. (4) the Motion Vector Decision unit which decides which SAD and

motion vector candidate are kept as current winners for the next interaction. Each functional unit uses a 64-bit wide word

and a deep pipeline to achieve a high throughput. All the accesses to reference and macroblock memory are done through

64-bit wide data buses and the SAD engine also operates on 64-bit data in parallel. The memory is organized in 64-bit words

and typically all accesses are unaligned, since they refer to macroblocks that start in any position inside this word. By

performing 64-bit read accesses in parallel to two memory blocks, the desired 64 bits inside the two words can be selected

inside the vector alignment unit. The number of integer-pel execution units is configurable from a minimum of one to a

Page 7: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

maximum limited only by the available resources in the technology selected for implementation. Each execution unit has its

own copy of the point memory and processes 64 bits of data in parallel with the rest of the execution units. The point

memories are 256×16 bits in size and contain the x and y offsets of the search patterns. For example a typical diamond search

pattern with a radius of 1 will use four positions in the point memory with values [−1,0], [0, −1], [1,0], [0, 1]. Any pattern

can be specified in this way, and multiple instructions specifying the same pattern can point to the same position in the point

memory saving memory resources. Each integer-pel execution unit receives an incremented address for the point memory so

each of them can compute the SAD for a different search point corresponding to the same pattern. This means that the

optimal number of integer-pel execution units for a diamond search pattern is four, and for the hexagon pattern six. Further

optimization to avoid searching duplicated patterns can halve the number of search points for many regular patterns. In

algorithms which combine different search patterns, such as UMH, a compromise can be found to optimize the hardware and

software components. This illustrates the idea that the hardware configuration and the software motion estimation algorithm

can be optimized together to generate different processors depending on the software algorithm to be deployed.

B. Fractional-pel Execution Unit (FPEU) and Interpolation Execution Unit (IEU).

The engine supports half- and quarter-pel motion estimation thanks to a fractional-pel execution unit (FPEU) and an

interpolation execution units (IEU) that calculate the quarter- and half-pel values using interpolation filters as defined in the

standard. The FPEU is conceptually similar to IPEU although in this case two vector alignment units are required since the

quarter-pel interpolation is done on-demand by averaging the corresponding half-pel reference data obtained from the

original, diagonal, horizontal or vertical memories. A memory block storing the original pixels without interpolation is

included since it is needed for quarter-pel operations and in this way no accesses to the main reference memory are required

so that the integer-pel execution unit can operate in parallel. The number of IEUs is limited to one, but the number of FPEUs

can be configured at compile time. The IEU interpolates the 20×20 pixel area that contains the 16×16 macroblock

corresponding to the winning integer motion vector. The interpolation hardware is cycled three times to calculate first the

horizontal pixels, then the vertical pixels, and finally the diagonal pixels. The IEU calculates the half pels through a six-tap

Wiener filter as defined in the standard. The IEU is formed by a total of eight systolic 1-D interpolation processors with six

processing elements each. The objective is to balance the internal memory bandwidth with the processing power so that in

each cycle, a total of eight valid pixels are presented to one interpolator. The interpolator starts processing these eight pixels

producing one new half-pel sample after each clock cycle. In parallel with the completion of 1-D interpolation of the first

eight-pixel vector, the memory has already been read another seven times, and its output assigned to the other seven

interpolators. The data read during the ninth memory cycle can then be assigned back to the first interpolator, obtaining high

hardware utilization. The horizontally-interpolated area contains enough pixels for the diagonal interpolation to also

complete successfully. A total of 24 rows with 24 bytes each are read. Each interpolator is enabled nine times so that a total

of 72 eight-byte vectors are processed. Due to the effects of filling and emptying the systolic pipeline before the half-pel

Page 8: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

samples are available, a total of 141 clock cycles are needed to complete half-pel horizontal interpolation. During this time,

the integer pipeline is stalled, since the memory ports for the reference memory are in use. Once horizontal interpolation

completes, and in parallel with the calculation of the vertical and diagonal half-pel pixels and the fractional-pel motion

estimation, the processing of the next macroblock or partition can start in the integer-pel execution units. Completion of the

vertical and diagonal pixel interpolation takes a further 170 clock cycles, after which the motion estimation using the

fractional pels can start. As mention previously quarter-pel interpolation is done on-the-fly as required simply by reading the

data from two of the four memories containing the half- and full-pel positions, and averaging according to the H.264

standard. The quarter-pel processing unit uses two vector alignment units to keep up with the stream of pixels data being

read from the half-pel memories. Two aligned 64-bit vectors are generated in parallel and averaging takes place in this data

in a single clock cycle. This maintains the same level of performance as for the integer pipeline. The processor is designed to

execute fractional and integer-pel motion estimation in parallel over different macroblocks so that while fractional-pel

refinement is being performed in macrobock n, the integer-pel parts are already calculating macroblock n+1. In general for

most algorithms, the number of points searched for the fractional-pel refinement is generally lower than for the integer-pel

part, so it completes faster even if the additional overhead of calculating the half-pel pixels is taken into account. In any case,

the cycles needed by each part depend on the type of search strategies deployed at the integer-pel and fractional-pel levels

and will be dependent on the algorithm. An optimal solution will balance the two parts to avoid having one part idle while

the other one is busy.

V. DEVELOPMENT AND ANALYSIS TOOLS

To facilitate the access to the hardware without needing specific knowledge of the microarchitecture, a high level C-like

language called EstimoC and a compiler have been developed. The EstimoC language is aimed at designing a broad range of

block-matching algorithms. The code can be developed and compiled in an integrated development environment (IDE) for

motion estimation. The language contains a preprocessor for macro facilities that include conditional (if) and loop (for,

while, do) statements. The language also has facilities directly related to the motion estimation processor instruction set

(ISA) such as checking the SAD of a pattern consisting of a set of points, and conditional branching depending on which

point from the last pattern check command had the best SAD. The target ISA for the compiler is simple and it is formed by a

total of 8 instructions as illustrated in Fig.8. There are two arithmetic instructions (opcodes 0,1), three jump instructions

(opcodes 2,3,4) and three compare instructions (opcodes 5,6 ,7).

The compiler converts the program to assembly and then to a program memory file containing instructions and a point

memory file containing patterns that can then be written directly to the firmware memories of the hardware core. Fig. 8

shows an example block-matching algorithm written in EstimoC and excerpts from the target files.

Page 9: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

The algorithm is a diamond search pattern executed for up to five times for a radius of 8, 4, 2 and 1 pixels, followed by a

small full search at fractional pixel level. The first set of check() and update() commands create the first search pattern,

which consists of five points. Each check() command adds a point to the search pattern being constructed and the update()

command completes the pattern. This pattern is compiled into the instruction at program address 00, which uses the five

points available in the point memory at addresses 00–04. The preprocessor goes through the do while loop three times, with s

taking the values 4, 2 and 1. Each time, a four-point pattern is checked for up to five times. The # if (WINID == 0) #break;

syntax ensures that if a pattern search does not improve the motion vector estimate, it is not repeated. The final lines create a

25-point fractional pattern search. Fig. 10 provides two further examples of how the how the EstimoC language can be used

to implement arbitrary block-matching motion estimation algorithms. The full-search example in Fig.10 corresponds to the

classical exhaustive search algorithm and it is included as a reference. It is not recommended due to its high computing

requirements. Notice that the algorithm has an initial check() command followed by the full-search loop itself. This is done

to be able to test all the motion vector candidates first and then conduct the full-search only around the winner. The hardware

will execute the first instruction once per motion vector candidate available. The compiled machine code shown below the

source code contains only two instructions: one for the first check() and one for the loop with a total of 225 search points

(15x15). Notice the parallelism is expressed without the need for any specific constructs such as PAR normally used in

hardware extensions of general purpose programming languages. The compiler recognizes all the search points that can be

computed in parallel and combines them in a single instruction with the help of the update command that tells the compiler

to stop looking for further checks to combine in a single instruction. The hardware will execute each instruction in a

different number of interactions depending on the available number of execution units since each execution unit can process

a check point in parallel.

The UMH example in Fig.10 contains sections from the source code corresponding to the sophisticated UMH (Uneven

Multihexagon Search) recommended in the H.264 standard. The source code starts with the definition of the search patterns

so that they can be called directly using the check command simplifying the program. The UMH algorithm performs

different pattern checks and uses a cost function (SAD or Lagrangian optimized SAD) to decide the type of patterns to

follow. For example, after the first check either a large or small cross can be used depending on the cost function. The COST

keyword is used by the compiler to generate the appropriate compare instructions followed by conditional jump instructions.

Finally, after the integer-pel search completes, a fractional refinement is conducted around the winner using two half-pel

diamond interactions and a final quarter-pel square pattern. The binary for UMH contains a total of 27 instructions and has

not been included for simplicity reasons. These examples illustrate the multiple programming options are available for

block-matching motion estimation algorithms and the difficulties of programming these algorithms without appropriate

compiler and tool support. On the other hand, the simple programming model and ISA of the proposed processor mean that

no overheads are introduced by using the compiler instead of writing assembly or machine code directly.

Page 10: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Once the program has been compiled into machine code, it can be complicated and time consuming to use the actual

motion estimation hardware for performing the analysis and configuration. The tasks required include synthesizing and

implementing a processor with some specific configuration in the FPGA, which will take around one hour per configuration

due to place&route running times. A cycle-accurate simulator can speed up the development cycle significantly by reducing

the number of tasks required to perform the analysis of a particular configuration. Additionally, the designer does not need

access to the actual hardware when using the simulator. A cycle-accurate model of the processor was developed as part of

the toolset. x264, a free software library for encoding H.264, was modified to use the cycle-accurate model when searching

for the motion vectors during motion estimation. The cycle-accurate simulator can be used directly from the developed IDE.

The designer can design a motion estimation algorithm and test it using different processor configuration parameters. The

simulator takes several parameters as inputs. The inputs which determine the processor configuration are: the program and

point memories generated by the Estimo compiler, the number of integer and fractional execution units, the minimum size

for block partitioning, whether to use motion vector cost optimization, and whether to use multiple motion vector candidates.

The simulator takes other options which do not affect the processor configuration itself, which are: the video file to process

and its resolution, the maximum number of frames to process, the quantization parameter (QP), and the number of reference

frames to use. The simulator will then process the video file using the selected search algorithm and processor configuration.

The output of the simulator includes: the bit rate of the compressed video, the peak signal-to-noise ration (PSNR), the

number of frames that can be processed per second, the number of clock cycles required per macroblock, and the energy

requirements per macroblock. The designer can also generate plots of the results until he or she is satisfied with a particular

configuration and algorithm. At this point, the tools can be used to generate a VHDL configuration file which can be used

together with the rest of the hardware library to implement the motion estimation processor using standard FPGA synthesis

and place&route tools. Fig.11 shows the overall design flow from algorithm conception to processor implementation.

VI. MULTI-STANDARD HARDWARE EXTENSIONS

The previous sections have concentrated on the hardware and software support for motion estimation in H.264, arguably the

most popular advanced video coding standard. The two additional codecs which have been considered in this work include

VC-1, the standard based on WMV-9 developed by Microsoft, and AVS the Chinese video coding Standard. Together with

H.264 these two codecs are state-of-the-art standards being deployed across a wide range of applications such as high

definition broadcast television, optical disc storage (Blu-ray) and broadband video streaming over the internet. They are all

transform and block-based codecs and share a lot of similarities. They process pixel groups in blocks of varying sizes in two

main modes: intra-mode in which a frame is compressed independently of other frames, and inter-mode in which temporal

redundancies among frames are exploited using motion estimation and compensation techniques. The transform used is

Page 11: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

DCT-based and it is applied either to the pixels themselves during intra coding or to the residuals resulting from the motion

estimation phase during inter coding. The final stage is the entropy coding stage, which is generally based on variable-length

codes (VLC) although H.264 can use arithmetic coding in the form of CABAC. The motion estimation is critical from the

performance and complexity point of view for all of them. Since these standards do not specify the motion estimation

technique deployed, the video engineer is free to develop new algorithms trading speed and quality as long as the generated

output bitstream can be handled by the standard decoders. These features are well-suited to a configurable and

programmable processor such as the one presented in the previous sections, which can exploit the similarities between the

different standards to support the standards with few hardware changes.

A. Motion estimation differences in the three standards.

All the three codecs can operate at a maximum resolution of quarter-pel and support block sizes from 16×16 down to 4×4 for

AVS (4×4 available in AVS-M for mobile multimedia applications) and H.264 standards and down to 8×8 for the VC-1

standard. From the motion estimation point of view, the most significant differences exist in how the half-pel and quarter pel

pixels are calculated. H.264 uses a six-tap filter for the half-pel pixels and simple averaging for the quarter-pel pixels, as

seen in the previous sections. On the other hand VC-1 and AVS use four-tap filters for half- and quarter-pel interpolation

with different coefficients. This is summarized in Table 1 that also shows the dividing factors used to scale the result of the

filter operation. The table also shows that VC-1 uses a special case to handle ¾ interpolation, which is a special case for

quarter-pel interpolation and corresponds to pixels that are located as shown in Fig. 12. In this case, although the coefficient

values are equivalent, they are applied to opposite input pixels. Also from the table it can be seen that all the coefficient

multiplications can be obtained with simple shift and adds, except for the multiplication by 53 that will need a multiplier

structure. The complexity cost of this multiplier will be lower than a full multiplier since the multiplicand are fixed so it can

be implemented with logic. VC-1 also includes an optional lower complexity mode that uses a simple two-tap bilinear filter

for the ¼ positions which has not been included in the table.

To add support for these two new standards in our processor, the half-pel and quarter-pel processing functions need to be

modified to accommodate these new coefficients. Half-pel interpolation is done exhaustively over the 20×20 surrounding the

winning integer-pel motion vector. In H.264 this is done for performance reasons to be able to efficiently process the

calculation of the quarter-pel pixels that need diagonal half-pel pixels. Notice that to obtain diagonal pixels, we need to

obtain horizontal half-pel pixels first. Performance will be negatively affected since the processor will need first to obtain the

half-pel horizontal pixels, then the half-pel diagonal pixels and finally the quarter-pel pixels, interrupting the pipeline while

this extra processing is taking place. This will be even more challenging for VC-1 and AVS that need not just two, but four

half-pel pixels to obtain the correct value of the quarter-pel pixels. In the H.264 solution, all half-pel pixels have been

already calculated and can be read in parallel to obtain the quarter-pel pixels, maintaining the pipeline fully utilized. The

Page 12: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

following subsections describe how the half-pel and quarter-pel processing functions have been extended to support VC-1

and AVS.

B. Half-pel pixel processing extensions.

Half-pels are calculated in eight systolic processors that contain six PEs each. To support VC-1 and AVS is straight-forward

since the number of coefficients involved is only four, so two PEs are set to zero to disable them when they are not needed.

Data still flows through the disabled PEs, so the control logic that reads integer pixels from memory and inputs them into the

processing elements does not need to be modified. Fig.13 shows the simplified diagram of the systolic processor. We have

opted for a simply systolic structure with a global adder since it obtains the required throughput and does not become a

bottleneck in the processor. An output register temporarily buffers eight output pixels before they are written to the half-pel

memories. The performance of the half-pel processing unit is equivalent in all three possible modes. The complexity of the

systolic processor in Fig.13 rises from 383 logic cells to 450 logic cells when it is configured in multi-standard mode, which

is a modest increase.

C. Quarter-pel pixel processing extensions.

Compared with the half-pel extensions, the quarter-pel processing unit needs more attention. The first challenge is that both

VC-1 and AVS use a similar four-tap filter instead of the two-tap averaging filter used in H.264. This means that four half or

integer pixels are needed to obtain each quarter-pel pixel. In H.264 mode the data is obtained from accessing two of the

half-pel memories, aligning as required by the motion vector, and averaging. To guarantee that eight valid pixels are

available after the alignment, two 64-bit memory locations are read from each memory using dual-port memories (two reads

and one write). The alignment units then select the right eight bytes out of the 16 bytes read. In VC-1 and AVS, the number

of ports needed to get the four eight-byte vectors is doubled. This means that the number of BRAMs used to store the half-

pel pixels must be doubled and the number of vector alignment units must also be increased from two to four. This will

double the logic needed to support fractional-pel searches in VC-1 and AVS. Instead of doubling the logic, we opt for

halving the performance of the fractional-pel pipelines when operating in VC-1 or AVS modes. The memories maintain the

number of ports as in H.264 but are read twice to obtain two eight-byte aligned vectors each time. To perform the filtering

itself we have designed a quarter-pel interpolator as shown in Fig.14. Similarities exist among the shift and add operations

needed to obtain the interpolated pixels for the three standards. These similarities can be exploited to reduce the complexity

of this unit as sown in Fig.14. The multiplexers are controlled by signals defined by the active standard. Pipelines a and b in

Fig.14 process two eight-pixel vectors. Operator sharing among the standards means that the additional logic needed to

support VC-1 and AVS is reduced. The limitation in fractional-pel performance in VC-1 and AVS modes is only a problem

if the fractional-pel searches take longer than the integer-pel searches, which is not generally the case. If this is the case, the

core can be configured with more fractional-pel execution units. The complexity of the quarter-pel pixel processing block in

Page 13: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

H.264 mode is approximately 82 logic cells, since simple pixel averaging is involved. This value increases to 508 logic cells

in multi-standard mode since more complex operations and a more complex state machine are needed to cycle the logic

twice and interpolate four half- and full-pel pixels.

VI.HARDWARE PERFORMANCE EVALUATION AND IMPLEMENTATION

For the implementation, we have selected the Virtex-4 SX35 device included in the ML402 development platform. This

device offers a medium level of density inside the Virtex-4 family and can be considered main stream being fabricated using

90 nm CMOS technology. The results of implementing the processor with different numbers and types of execution units are

illustrated in Table 2. The basic configuration is relatively small using 7.4% of the available logic resources and 21 memory

blocks. The size of the block RAMS in Virtex-4 devices enables two reference search areas of size 112×128 to fit in this

memory as previously described. Each new execution unit adds around 1600 logic cells and 17 embedded memory blocks to

the complexity. The fractional and integer execution units have been carefully pipelined and all the configurations can

achieve a clock rate of 200 MHz. To obtain a performance value in terms of macroblocks per second performance is not as

straight-forward as in the full search case that always computes the same number of SADs for each macroblock. In this case,

the amount of motion in the video sequence, the type of algorithm and the hardware implementation affect the number of

macroblocks per second that the engine can process. Equation (1) can be used to estimate the cycle count needed to process

a single 16×16 partition and a single reference frame. The variables in the equation are ppp (search points per pattern), eu

(execution units available), ppm (average search points per macroblock).

(1)

The equation takes into account that the SAD calculations for each macroblock take 32 cycles (32 SADs of eight bytes

each) plus one cycle of memory read/write overheads and motion vector decision. There is also an 11-cycle overhead

representing the time needed to empty the integer pipeline before the best motion vector can be found in each pattern

iteration and the next pattern started from the current winning position. Also, the current microarchitecture always uses 33

cycles per search point and it does not try to terminate the SAD calculations earlier if the current cost becomes larger than

the cost obtained during a previous calculation. There are two reasons why this optimization based on monitoring the SAD

during the search point calculation has not been used: firstly, since the core uses multiple execution units it is very important

that all the execution units are maintained in synchrony so that a single control unit can issue the same control signals to all

the execution units. Execution units terminating at different clock cycles will invalidate this requirement. Secondly, the

detection of the cost has to be done at the bottom of the pipeline after the addition of the motion vector cost to the

accumulated SAD value. This means that all the SAD calculations already in flight in the pipeline must be invalidated,

Page 14: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

introducing a bubble in the pipeline with a cost of 11 clock cycles. Unless the early detection saves more than 11 clock

cycles (which is not the case in our experiments, since the cost of nearby points tend to be close in value) there will be no

performance gains obtained from this technique but a considerable overhead in terms of logic. In order to use equation (1)

we need to estimate the average numbers of points searched per macroblock. These values have been measured using the

x264 software implementation configured with the diamond, hexagon and UMH algorithms, using the previous high

definition sequences with varying degrees of motion complexity. The values measured indicate that 15 points are tested for

the diamond search algorithm, 22 for the hexagon search algorithm, and 70 for the UMH search algorithm, for the 16×16

partition case, and these values increase by a factor of 1.5 if the 8×8 partitions are considered as well. The hardware can

overlap the fractional-pel search with the integer-pel search working on different macroblocks in parallel, so as long as the

fractional-pel completes faster than the integer-part, it will not affect the performance of the core. In order to validate this

statement we have further analyze the number of search points evaluated in x.264 algorithm. From the three integer-pel

algorithms available in x.264 it is obvious that the simple diamond search is the one that with a higher probability of

completing before the fractional search part. We must also note that x.264 only offers the diamond search for the fractional

refinement consisting of up to 2 half-pel followed by up to 2 quarter-pel interactions. Table 3 shows the effects in search

complexity and bit rate reduction of adding the fractional-pel refinement to the integer diamond search for the three high-

definition videos tested. Table 3 shows that the number of search points tested when both half-pel and quarter-pel is roughly

equivalent to the full-search part for the Sunflower and Tractor video sequences and smaller for the Pedestrian area

sequence. It also shows that the reduction in bit rate thanks to the fractional-pel search is very important. With this data we

can reasonably assume that the fractional-pel search will complete in approximately the same amount of time as the integer-

pel search and that it can be performed in parallel without increasing the total number of clock cycles. Table 4 shows the

performance figures obtained for the three motion estimation algorithms analyzed as a function of the number of integer-pel

execution units. The table shows the macroblocks per second for different configurations and also the largest video format

that each configuration can support. Table 4 only shows the optimal hardware configuration for each algorithm, taking into

account that, for example, a diamond search pattern does not benefit from more than four IPEUs and that a configuration

with three IPEUs will need the same number of cycles as for the two IPEUs case. The reason for this is that while the first

iteration will use all three IPEUs, a second iteration will still be required to complete the pattern instruction, when only one

IPEU will be used. Finally, Table 5 compares the performance and complexity figures of the base configuration of our

processor against the ASIP cores proposed in [13] and [14] in terms of performance complexity. The figures measured in the

general purpose P4 processor with all assembly optimizations enabled are also presented as a reference although the power

consumption and cost of this general purpose processor are not suitable for the embedded applications this works targets.

These types of comparisons are difficult since the features of each implementation vary. For example, our base configuration

does not support fractional pel searches and the addition of the interpolator and fractional pel execution unit in parallel with

Page 15: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

the integer pel execution unit increases complexity by a factor of three. The core presented in [14] does support fractional pel

searches, although with a non-standard interpolator, and both searches must run sequentially. Overall, Table 5 shows that our

core offers a similar level of integer performance in terms of clock cycles for the diamond search algorithm to the ASIP

developed in [14] with one execution unit, and performance almost doubles if the configuration instantiates two execution

units as shown in the last row. For these experiments, our core was retargeted to a Virtex-II device to obtain a fair

comparison, since this is the technology used in [13] and [14]. The pipeline of the proposed solution can clock at double the

frequency as shown in the table, and this helps to justify why our solution with a single execution unit can support 1080p HD

formats while the solution presented in [13] is limited to 720p HD formats. The measurements of cycles per macroblock

were obtained processing the same CIF sequences as used in [14].

The diamond search used in this experiment corresponds to the implementation available in x264 that includes up to 8

diamond interactions followed by a square refinement using a single reference frame and a single macroblock size (16×16).

It is also noticeable that 287 cycles per macroblock will generate a throughput of almost 2000 CIF frames per second,

enabling the addition of extra reference frames and sub-partitions while still operating in real-time.

VII. CONCLUSIONS

The proposed processor exploits both reconfigurability and programmability in an innovative way to support motion

estimation in three state-of-the-art video coding standards, seamlessly trading complexity, throughput and quality of results,

and matching the architecture to the workload requirements. The processor has been named LiquidMotion to reflect its

adaptability and the toolset, which enables the designer to create new algorithms and hardware processors without specific

knowledge of the microarchitecture, is available for download at http://sharpeye.borelspace.com/. Support for older

standards such as MPEG2 can easily be added, since in this case, sub-pixel resolution is limited to half-pel using an

averaging filter. The advanced features available in these codecs are supported, including a parallel Lagrangian optimization

hardware unit which yields 10% lower bit rates for the same quality without affecting throughput. The research also shows

the importance of using relatively large search windows at 128×128 pixels for high definition video samples although further

increases of the search window do not bring noticeable benefits. The multi-standard hardware extensions targeting VC-1 and

AVS increase the flexibility of the processor with a relatively small hardware cost. Future work involves adding the core as

part of a dynamically reconfigurable multiFPGA array targeting multimedia applications.

Page 16: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

[1] Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T. and Wedi, T., “Video coding with H.264/AVC: tools,

performance and complexity”. IEEE Circuits Syst. Mag. v4. pp. 7-28.

[2] Sridhar Srinivasan, et.all., “Windows Media Video 9: overview and applications”, EURASIP Signal Proc: Image Communication, 19 (2004) pp. 851-

875.

[3] L. Yu et al. “An Overview of AVS-Video: tools, performance and complexity”, Visual Communications and Image Processing 2005, Proc. of SPIE,

vol. 5960, pp.596021, July 31, 2006.

[4] Nunez-Yanez, J.L.; Hung, E.; Chouliaras, V., 'A configurable and programmable motion estimation processor for the H.264 video codec,' FPL 2008.

International Conference on , vol., no., pp.149-154, 8-10 Sept. 2008

[5] Huang, Y.-W., Wang, T.-C., Hsieh, B.-Y., Chen L.-G. “Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4

AVC/JVT/ITU-T H.264”. ISCAS. May 2003.

[6] Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen; Tu-Chih Wang; Liang-Gee Chen, "Analysis and architecture design of variable

block-size motion estimation for H.264/AVC", IEEE TCSVT, vol.53, no.3, pp.578-593, March 2006

[7] Yap, S.Y.; Mccanny, J.V., ‘A VLSI architecture for advanced video coding motion estimation’, ASAP, pp. 293-301, 24-26 June 2003

[8] Chao-Yung Kao and Youn-Long Lin, “An AMBA-Compliant Motion Estimator For H.264 Advanced Video Coding” IEEE International SOC

Conference (ISOCC), Seoul, Korea, October 2004

[9] Brian M. Li , Philip H. Leong, “Serial and Parallel FPGA-based Variable Block Size Motion Estimation Processors”, Journal of Signal Processing

Systems, Vol. 51 , No. 1, pp. 77-98 April 2008

[10] Theepan Moorthy, Andy Ye: A scalable computing and memory architecture for variable block size motion estimation on Field-Programmable Gate

Arrays. FPL 2008: 83-88

[11] Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen, Liang-Gee Chen, “Survey on Block Matching Motion Estimation Algorithms and

Architectures with New Results”, The Journal of VLSI Signal Processing, Vol. 42, No. 3. (March 2006), pp. 297-320.

[12] Sheu-Chih Cheng; Hsueh-Min Hang, "A comparison of block-matching algorithms mapped to systolic-array implementation," IEEE TCSVT, IEEE

Transactions on , vol.7, no.5, pp.741-757, Oct 1997

[13] T. Dias , S. Momcilovic , N. Roma , L. Sousa, “Adaptive motion estimation processor for autonomous video devices”, EURASIP Journal on

Embedded Systems, v.2007 n.1, pp.41-41, January 2007

[14] Babionitakis, Konstantinos1, et al., “A real-time motion estimation FPGA architecture”, Journal of Real-Time Image Processing, Volume 3, Numbers

1-2, March 2008 , pp. 3-20(18)

[15] Information available at http://www.xilinx.com/ products/ipcenter/DO-DI-H264-ME.htm

[16] S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans, “Performance and complexity co-evaluation of the advanced video coding standard for

cost-effective multimedia communications,” EURASIP J. Appl. Signal. Process., no. 2, Feb. 2004, pp. 220-235.

[17] 1080p HD sequences obtained from http://nsl.cs.sfu.ca/wiki/index.php/Video_Library_and_Tools#HD_Sequences_from_CBC

[18] Information available at http://www.videolan.org/developers/x264.html

Page 17: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Fig. 1. Search range analysis for sunflower sequence

Page 18: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Fig. 2. Search range analysis for pedestrian area sequence

Page 19: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Fig. 3. Search range analysis for tractor sequence

Page 20: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Fig. 4. Sub-partitions analysis for pedestrian area sequence

Page 21: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Fig. 5. Sub-partitions analysis for sunflower sequence

Page 22: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Fig. 6. Sub-partitions analysis for crowdrun sequence

Page 23: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

Pointmemory

PhysicalAddress

Calculator

Programmemory

Instruction fetch,decode and issue Line offset

Next rmaddress

FP pipelineControl

+1 +1 +1

referencememory

Vectoralignment

referencememory

VectoralignmentCurrent

macroblockmemory

Pointmemory

Pointmemory

Pointmemory

PhysicalAddress

Calculator

PhysicalAddress

Calculator

PhysicalAddress

Calculator

SAD

Adder tree

SAD

Adder tree

FP pipelineControl

FP pipelineControl

referencememory

Vectoralignment

Adder tree

referencememory

Vectoralignment

Adder tree

Motion vector decision

+ + + +

GP Register File (motionvector candidates and

results)

Motion vector candidate

Best motion vectorBest SAD

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HP in

terp

olato

r Sys

tolic

Pro

cess

or

HorizontalHP pixels

Vertical HPpixels

DiagonalHP pixels

Original FPpixels

QP pipelinecontrol

Currentmacroblock

memory

Adder tree

Vectoralignment

Quarter pel interpolation

Best motion vectorBest COST

16 16 16 16

16

126Next rmaddressLine offset 126 Next rm

address12Line offset

6 Next rmaddress12

Line offset6

16161616

rmaddress 9 rm

address 9

6464646464

64

646464

64

6464

16

1616

16

16 16

rmaddress

9

64

64

64

64 64 64

88888888

64 64 64 64 64 64

64 64

64

6464

16

16

32

8

8 24

1616

Unalignedpixel data

Unalignedpixel data

Unalignedpixel data

Alignedpixel data

Alignedpixel data

Alignedpixel data

Current SAD

Current SAD

Current cost

Current SAD

Unalignedpixel data

Alignedpixel data

Alignedpixel data

HP interpolationdata

HP data

QP data

Current SAD

Control signals

Control signals

Control signals

Controlsignals

Instructionaddress

Instructions

HP Pixel processing

Fractional pelpipeline

Main integer pelpipeline

Auxiliary integerpel pipelines

64

SADSAD

SAD

Adder

MVCOST

Quantization

MVPMV

Adder Adder Adder

Quantization

MVCOST

Adder

Quantization

MVPMV

6464

64 64

Vectoralignment

64

QP pixelprocessing

Fig. 7. Processor microarchitecture with a total of six execution units

Page 24: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

24

Fig. 8. Instruction set architecture ISA

Integer pattern instruction

0010 Winner field immediate8

Unconditional jump to label

0011 immediate8

15 8 7

unused

Conditional jump to label (if winner field = winner id the jump to inmediate8)

winner id = 0 no winner in pattern else ids the winner execution unit

0100 unused immediate8

Compare (if less than set condition bit)

Compare (if greater than set condition bit)

Compare (if equal set condition bit)

0000 Pattern address Number of points

16

0101 reg immediate14

0110 reg

0111 reg

Op code Field A Field B

Fractional pattern instruction

0001 Pattern address Number of points

716 15 8

Conditional jump to label (if condition bit set jump to label)

immediate14

immediate14

1315

Page 25: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

25

S = 8; // Initial step size

check(0, 0);check(0, S);check(0, -S);check(S, 0);check(-S, 0);update;

do{ S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; }} while( S > 1);

for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j);update;

0 0 05 00 chk NumPoints: 5 startAddr: 01 0 04 05 chk NumPoints: 4 startAddr: 52 2 00 0B chkjmp WIN: 0 goto: 113 0 04 05 chk NumPoints: 4 startAddr: 5 ……………….11 0 04 0A chk NumPoints: 4 startAddr: 912 2 00 15 chkjmp WIN: 0 goto: 21 ………………..21 0 04 0D chk NumPoints: 4 startAddr: 1322 2 00 1F chkjmp WIN: 0 goto: 31 ……………….31 1 19 11 chkfr NumPoints: 25 startAddr: 17

Integer check pattern instruction

Conditional jump instruction

Fractional check pattern instruction

Fig. 9. Programming example

Page 26: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

26

/* UMH algorithm

*/

Pattern(diamondhp){ check(0,0.5) check(0,-0.5) check(0.5,0) check(-0.5,0)}Pattern(diamondqp){ check(0,0.25) check(0,-0.25) check(0.25,0) check(-0.25,0)}

/*other pattern definitions*/

check(0,0);update;check(diamond);

#if (COST > 2000){ //large cross // horizontal cross for(i = -17 to 17 step 2) { check(i,0); } // vertical cross for(i = -7 to 7 step 2) { check(0,i); } update;}

//hp refinement

for(loop = 1 to 2 step 1){ check(diamondhp); #if( WINID == 0 ) #break;}

//qp refinement

check(squareqp);

//End

#else{ //small cross // horizontal cross for(i = -5 to 5 step 2) { check(i,0); } // vertical cross for(i = -3 to 3 step 2) { check(0,i); } update;}

/* large hexagon, small full search, hexagon and final square refinement as in UMH*/

Fig. 10. Programming example for the full-search and UMH algorithms.

//full search example

//first pointcheck(0,0);update;

//full searchfor(i = -7 to 7 step 1) for(j = -7 to 7 step 1) check(i, j);update;

0 0 01 00 chk NumPoints: 1 startAddr: 0 1 0 E1 01 chk NumPoints: 225 startAddr: 1

Page 27: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

27

R TLC om ponent Lib rary

H igh level M Ealgorithm code SharpEye

C om pile r

Assem bly codeSharpEyeAssem bler/L inker

P rogram B inary Point B inary

Cyc le A ccurateS im ula tor/

C onfiguratorConstra in tsenergy/

throughput/quality/area

C onstra in tsm et?

Num ber and type offunctional un its

(In teger and fractionalpel, Lagrangian,

m otion vecto rcandida tes, e tc)

P rocessorb its tream

N oYes

G eneratecon figura tion R TL

file

StandardSynthes is/

P lace&routeFPG A too ls

N ew H ardwarecon figuration

N ew M E algorithm

Fig. 11. Processor programming and configuration workflow.

Page 28: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

28

Table 1. Fractional-pel interpolation filters coefficients

Page 29: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

29

Fig. 12. Fractional pixels locations

PE1 PE2 PE3 PE4 PE5 PE6

Pixel IN

Mode

Interpolated Pixel OUT

Add and shift

13 13 13 13 13

8

8 8 8 8 8

13

Fig. 13. Half-pel systolic processor unit

Page 30: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

30

Virtex - 4 SX35

Configuration LUTs used/LUTs available

Memory blocks used/Memoryblocks available/Minimummemory bits

Criticalpath (ns)Logiclevels

1 IPEU/ 0FPEU

2259/30720 (7.4%) 21/192 (10%)/95 Kbits 4.976/8

2 IPEU/ 0FPEU

3805 /30270 (12.6%) 38/192 (19%)/179 Kbits 5.040/8

3 IPEU/ 0FPEU

5571/30270 (18.4 %) 55/192 (28%)/263 Kbits 5.032/7

1 IPEU/ 1FPEU

9143/30270 (30.2%) 31/192 (16%)/95+42 Kbits 4.986/6

2 IPEU/ 1FPEU

10985/30270 (36.2%) 48/192(39%)/179+84 Kbits 4.996/9

Table 2. Processor complexity in H.264 mode

Page 31: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

31

HP pixels a HP pixels b

1(a) 1(b)

-4 (a) 53 (b)

18(a) -3 (b)

-3(b) 18(a)

53(b) -4(a)

1 (a) 3 (b)

3 (b) 1 (a)

<< 4X

+

+

QP pixels

6464

H.264

VC1

AVS

53<< 1

+x -1

<< 1

+

AVS or H.264 (x1)

VC-1VC-1

AVS

64

64

VC-1

VC1 (x18)

96

72

104

104

104

64

VC-1 (x53)

112

72

80 AVS (x3)

104x-1

VC1 (x-3)

H.264 (x1)

112

112 80

112

113

113

>>

113

VC-1 (x-4)

coefficient (pipeline)standard

Fig. 14. Quarter-pel processor unit

Page 32: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

32

1080p video sequence

Type of fractional- pel search done ( average search points per macroblock

/ bit rate reduction %) None Half - pel only Half - pel and

Quarter-pel FP

Search Points

Bit Rate Reduction

HP Search Points

Bit Rate Reduction

HP&QP Search Points

Bit Rate Reduction

Pedestrian area

15.7 0% 5.4 6% 9.3 12%

Sunflower 11.3 0% 5.0 5% 9.1 11.5%

Tractor 11.1 0% 6.2 16% 10.2 25%

Table 3. Evaluation of fractional-pel search

Page 33: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

33

Nu Number of Integer Execution units implemented

Throughput in macroblocks per second (16x16, diamond, 200 MHz, 4 ppp)

Throughput in Macroblocks per second (16x16, 8x8, diamond, 200 MHz,4ppp)

Throughput in macroblocks per second (16x16, hexagon, 200 MHz, 6 ppp)

Throughput in Macroblocks per second (16x16, 8x8, hexagon, 200 MHz, 6ppp)

Throughput in Macroblocks per second (16x16 UMH 200 MHz, 16 ppp)

Throughput in Macroblocks per second (16x16, 8x8 UMH 200 MHz, 16 ppp)

1 372,960 (1080p@30)

233,918 (720p@50)

260,983 (1080p@30)

173,988 (720p@30)

84,813

56,542

2 692,640 (1080p@50)

461,760 (1080p@50)

495,867 (1080p@50)

330,578 (1080p@30)

166,233 (720p@30)

110,822

3

708,382 (1080p@50)

472,255 (1080p@50)

4 1,212,121 (1080p@50)

808,080 (1080p@50)

319,680 (1080p@30)

213,120 (720p@50)

6

1,239,669 (1080p@50)

826,446 (1080p@50)

8

593,692 (1080p@50)

395,794 (720p@30)

16

1,038,961 (1080p@50)

692,640 (1080p@50)

Table 4. Performance analysis of the configurable processor

Page 34: research-information.bristol.ac.ukresearch-information.bristol.ac.uk/files/34808310/iet_me... · Web viewThe implementation using a standard cell 0.13 μm ASIC technology shows that

34

FPGA clock Memory

(MHz, Virtex-II) (BRAMS)

Intel P4 assembly ~3,000 N/A N/A N/A

Dias et al. [13] 4,532 2,052 67 4(external reference area)

Babionitakis et al. [14] 660 2,127 50 11 (1 reference area of 48x48 pixels)

Proposed with one integer-pel execution unit

510 1,231 125 21 (2 reference areas of 112x128 pixels)

Proposed with two integer-pel execution units

287 2,051 125 38(2 reference areas of 112x128 pixels)

Processor implementation

Cycles per MB (Diamond search)

FPGA Complexity (Slices)

Table 5. Performance/complexity comparison