FPGAbased fully pipelined and highquality image rotation.doc

资源描述

《FPGAbased fully pipelined and highquality image rotation.doc》由会员分享，可在线阅读，更多相关《FPGAbased fully pipelined and highquality image rotation.doc（8页珍藏版）》请在三一办公上搜索。

1、精品论文FPGA-based fully pipelined and high-quality image rotationGongchao Tang, Luxin Yan(Huazhong University of Science and Technology IPRAI)5Abstract: This paper focuses on the low latency and the fully pipelined implementation of image rotation with memory reduction on FPGAs board. The method we use

2、d is based on three-pass algorithm and cubic convolution interpolation. The difficulty of the work is the construction of the fully pipeline as it is hard to get the positional relationship of pixels without storing them after shearing. In this article, mechanisms on judging the row offset and colum

3、n offset is presented to select pixels in the10same row or same column for interpolation in template, which solves the bottleneck of pipeline. Finally, the results of rotation, summary of the memory consumption in FPGA, and the comparison with existing method are presented.Key words: :Image rotation

4、; three-pass algorithm; cubic convolution interpolation; FPGA pipelinedimplementation150IntroductionDigital image rotation is widely used in the fields of military, aerospace and medical imaging. As we know, traditional image rotation is performed in 2-D space. Since the coordinates of pixels after

5、shearing are not integers, we must use 2-D interpolation, and an n-degree (n3) interpolation is needed for higher quality. All of this above makes it complicated and time-consuming. So20implementation of rotation by software will take too much time to calculate and cant be up to real-time. As implem

6、ented on hardware, the trade-off between quality of the image and complexity of the approach should be considered. The author1 implemented image rotation in DSP, and test some methods of interpolation, but the throughput is only 25 frames per second for video images, large number of memory is needed

7、, and the calculation speed limits the process of25high-quality request. So for higher speed, there is need to introduce FPGA. As implemented on FPGAs board, the author is inclined to use external memory to store the temporary results for pixels reorientation2, which wastes much time on reading, wri

8、ting data, and influences on the efficiency of FPGA. The three-pass algorithm makes whole rotation be decomposed in an appropriate sequence of 1-D signal translations. To assure the quality of the result, we use cubic30convolution interpolation after every shearing, which has much better performance

9、 on quality than most other interpolations and is simplified to a 1-D approach by three-pass. In this paper, a fully pipelined implementation of rotation based on three-pass algorithm and cubic convolution interpolation is proposed, which need only store 7 rows of image, and only significative pixel

10、s are processed in the flow. The delay of the system could be ignored.351ALGORITHM OF IMAGE ROTATIONThe basic 2-D image rotation operation can be described using the following equations:- 8 - x cosq-sinq x = (1) y sin qcosq y Thus, a gray-level pixel at position (x, y) in the original image is mappe

11、d to the position (x, y) in the destination image following a rotation of angular magnitude .Brief author introduction:Gongchao Tang,male,master degree,IC designCorrespondance author: Luxing Yan,1978,male,Associate Professor,IC and computer. E-mail:super68640The 2-D formula above can be decomposed t

12、o3 4: x 1 = -tan q 10 12 -tan q x 2 (2) y 01 sinq1 01 y Each elementary transformation in (2) corresponds to a shearing of the image in the x or y direction. The processing is one dimensional in nature since each row is simply translated by anoffset Dx = - y tan q ，or each column is translated by an

13、 offset y=xsin.We limit -45245 ,45 because we can first rotate the image by 90,180 or 270 without any loss when the angle is beyond the range-45 ,455.As we use cubic convolution interpolation after shearing, the corresponding 2-D interpolationformulas for computing the value are of the form 6:f ( x,

14、 y) = f ( x i , y i ) s( x - x i ) s( y - y i )50f ( x i , y i ) is the gray-level value of ( x i , y i ) ，after shearing whilex i and(3)y i are decimal,and f ( x i , y i )is the new value of interpolation on integral coordinate (x,y).The cubic interpolation function s( ) is given by （6）: 3 v 3 - 5

15、v 2 + 10 v 1 22s(v) = - 1 v 3 + 5 v 2 - 4 v + 21 v 2(4) 220v 2Corresponding to the rotation matrixs decomposability, (3) can be simplified to:55f ( x, y) = f ( x i , y j ) s( x - x i )(5)or f ( x, y) = f ( x i , y j ) s( y - y j )(6)And from (4) we find that the interpolation needs 4 pixels in the s

16、ame row or in the same column at a time. In reverse, as long as we get 4 pixels in the same row or column, pixel(x, y) must be on integral coordinate between the second pixel and the third pixel.60Here we give a function about :0counter-clockwise rotatedsign(q ) = 1clockwise rotatedfigure 1. archite

17、cture view of design2Implementation on FPGAS board65The architecture of design is shown in figure 1, the whole method consists of 3 shearing, module as coordinate calculate, offset calculate, and interpolation contain only simple calculations which can be easily implemented in FPGA. As we do not use

18、 memory to store results of every shearing, the data flow may be not arranged by coordinates (especially in the 3rd shearing), bottleneck of the pipeline will be the judgment of pixels positions used for interpolation. Here we give a70mechanism, when a pixel flows in, we can affirm its neighbor (in

19、row or column) within one clock cycle. The judging details are different in every shearing. The shapes of image after shearing areshowed in figure 2.75A. First shearingfigure 2. results after shearingThe structure of the 1st shearing is in figure 3. Pixels flow through registers a4, a3, a2, and a1 s

20、o that 4 adjacent pixels in the same row are selected to finish first interpolation. Obviously the 4 pixels will be on the coordinates of decimal fraction after 1st shearing, they are used to calculate the pixel whose coordinate is integer between a2 and a3.80Module of edge judgment makes us build t

21、he right template at the edge of image, we calculate the coordinates of the original image by module of coordinate calculate 1, with which we calculate the row offset and value of the cubic interpolation function s(). The row offset is used to calculate the new pixels integral coordinate between a2

22、and a3 after 1st shearing, and value of the cubic interpolation function with a4, a3, a2, a1 will help to calculate the value of the new pixel.85B. Second shearingfigure 3. structure of the 1st shearingThe structure of the 2nd shearing is in figure 4, the image size is MN. When result image of1st sh

23、earing flows through registers a41, a42, a43, a12, a13 and the 3 FIFOs, we build up a90template(a32 as the center) showed in figure 5 referring to Figure 2(2).Here we give a new module“row offset judge” to find 4 adjacent pixels in the same column in the template.n11n12n13a21a22a23FIFO3(depth=M-3)a3

24、2FIFO2(depth=M-3)image from 1st shearinga41a42a43FIFO1(depth=M-3)edge judgementrow offset judgesina4a3a2Xn32, Yn32column offset calculatea1interpolations|y-yi| calculateresultcoordinate caculatecoordinate of resultfigure 4. structure of the 2nd shearingWe know that the offset of row in the 1st shear

25、ing isqDx = - y tan q ,which means the2q95difference between offsets of any adjacent two rows is.As the range, we have max(tan)22=tan22.5=0.414, and 2tan22.5=0.818. So the difference among offsets of any adjacent three rows is no more than 1 pixel. We conclude that a row must shift 1 pixel or hold s

26、till relative to its neighbor in the 1st shearing.100105110115figure 5. template in the 2nd shearingAccording to the analysis given above, for selecting 4 pixels a1, a2, a3, a4 in the same column, we judge row offsets in the 1st shearing. And without question, a3=a32.Take a1 for example.If offset_ro

27、w3= =offset_row1, which means row3 and row1 hold still relatively in the 1st shearing, so a32 and a12 are still in the same column, we could take a1=a12.Else if offset_row3!=offset_row1, which means row1 shifts 1 pixel relative to row3. For more, if sign()=0, row1 shift 1pixel left, so after 1st she

28、aring, a13 and a32 are in the same column, then a1=a13; or a1=a11.Following the rule of selection, a2 and a4 are given.a2 = (offset_row3= =offset_row2) ? a22 : (sign() ? a21 : a23 );a4 = (offset_row3= =offset_row4) ? a42 : (sign() ? a43 : a41 );C. Third shearingThe flow of the 3rd shearing is simila

29、r to 2nd shearing, but more complicated at selecting a1, a2, a3, and a4 which are adjacent and in the same row. Following Figure 2(3), figure 6 shows thetemplate (a33 as the center) built in the 3rd shearing.figure 6. template built in the 3rd shearingHere we use the row offset in the 1st shearing a

30、nd the column offset in the 2nd shearing to120find a1, a2, a3, a4. As max(sin)=0.7071 at the range any adjacent two rows is not more than 1 pixel.-45o q 45o , the difference between125130135140We take a33 as the center of the template. First we assign the third pixel a3 = a33, and selections of the

31、left 3 pixels are based on the offset of row or column they are in compared to a33.We take selection of a1 as example.1) If offset_column3= =offset_column1, which means column 1 and column3 in Figure 6 hold still relatively in the 2nd shearing, as in the 1st shearing there is only row shift, so a31

32、is in the same row as a33 all the time, and obviously, a31 is the right choice for a1.2) Else if |offset_column3-offset_ column1| = =1, which means column 1 move 1 pixel relative to column 3, up or down, so the choice of a1 must be from row2 or row4.If sign()=0, the image is counter-clockwise rotate

33、d, and column 1 must be shifted down relative to column3, so pixels of row2 is the right choice. If offset_row3= =offset_row2, row3 and row2 hold still relatively in the 1st shearing, so a21 moves to a31 relatively after the 2nd shearing, a1=a21. If offset_row3!=offset_row2, a22 moves to a21 relativ

34、e to a33 in the 1st shearing, and then moves to a31 in the 2nd shearing, then we get a1= a22.And when sign()=1, the image is clockwise rotated, and column 1 must be shifted up relative to column3, so pixels of row4 is the right choice for a1. The complete judgment in this condition is similar to the

35、 condition of sign()=0.3) Else if |offset_column3-offset_ column1|= =2, which means column 1 shift 2 pixels relativeto column 3, up or down, so the choice of a1 must be from row1 or row5, and the complete judgment is similar to 2)Based on the rule of selection, we select a2 following this:if (offset

36、_column3= =offset_column2 ) a2=a32;else if (offset_column3!=offset_column2 & sign()= =0 )145if ( offset_row3= =offset_row2 ) a2=a22;else if ( offset_row3!=offset_row2 ) a2=a23;else if (offset_column3!=offset_column2 & sign()= =1 )150if ( offset_row3= =offset_row4 ) a2=a42;else if ( offset_row3!=offs

37、et_row2 ) a2=a43;And method for getting a4 is similar to, or even can be seen the same as method of a2.Values of sin and tan q are stored as a look-up table, with 0.01o resolution. So we can2155160get the value of trigonometric function within 1 clock cycle.3Results and discussionWe emulate the meth

38、od with 256 256 8bit image in modelsim, the processing time and memory consumption are summarized in Table 1. It can be seen that the proposed method costs less time and memory bits. The method is validated in a PCI card, the pixel frequency is 30MHz, and the FPGA is EP1S20 of Stratix series. For th

39、e 128 128 8bit image, the delay of processing is only about 0.014ms. It takes about 0.56ms for the whole rotation. The total memory165consumption is 7168bits, which means we need only store pixels of 7 rows in all. The processing time and memory requirement is only related to the size of image and t

40、he pixel frequency (see Table 2). Furthermore, let the image size M N k bit, pixel frequency f (MHz), the latency of every shearing is showed in Table 3, from the table we conclude that the latency of the method is Td=(38+M3)/f (us), and the whole processing time is T=(38+M3+MN)/f = (38+M(N+3) /f(us

41、).Table 1COMPARISON OF THE 3 METHODSMethodRotationtimeAngleMemoryconsumptionImage sizeFrequency217.6ms452 512k 8bit256 2568 bit20MHzProposed method3.32ms457 256 8bit256 256 8 bit20MHzTable 2STATISTICS OF IMAGE IN DIFFERENT SIZESImage sizeLatency(ms)Rotation time(ms)Memory(kbits)Frequency（MHz）2562568

42、bit0.0161.3214505125128bit0.0315.282850102410248bit0.06221.045650170Table 3STAT . OF LATENCY (COUNTED BY CLOCK CYCLES)Template builtSelection of pixelsrow or column offsetcalculateinterpolation1st shearing10362nd shearingM+13363rd shearing2M+15361751801851904ConclusionIn this paper, we have presente

43、d the implementation of image rotation, and validate the method by a 128128 image. The whole rotation is implemented without large-scale image stored, the fully pipelines latency is only pixel cycles of several rows. The rotation time is unrelated to angle , which means it takes the same time to rotate the image at any angle. It is also adaptive to image of larger size. And the cubic convolu

展开阅读全文