All ATmega, AT90CAN and AT90PWM have an on-board hardware multiplicator, that performs 8 by 8 bit multiplications in only two clock cycles. So whenever you have to do multiplications and you are sure that this software never ever needs not to run on an AT90S- or ATtiny-chip, you can make use of this hardware feature. This page shows how to do it.

The sections are:

mul R16,R17

As the result of these two 8-bit binaries might be up two 16 bits long, the result will be in the registers R1 (most significant byte) and R0 (least significant byte). That's all about it.

The program demonstrates the simulation in the Studio. It multiplies decimal 250 (hex FA) by decimal 100 (hex 64), in the registers R16 and R17.

The registers R0 (LSB) and R1 (MSB) hold the result hex 61A8 or decimal 25,000.

And: yes, that requires only two cycles, or 2 microseconds with a 1 Mcs/s clock.

To the top of that page

First the math: a 16-bit-binary are simply two 8-bit-binaries, where the most significant one of these two is multiplied by decimal 256 or hex 100. For those who need a reminder: the decimal 1234 is simply (12 multiplied by 100) plus 34, or (1 multiplied by 1000) plus (2 multiplied by 100) plus (3 multiplied by 10) plus 4. So the 16-bit-binary m1 is equal to 256*m1M plus m1L, where m1M is the MSB and m1L is the LSB. Multiplying m1 by 8-bit-binary m2 so is, mathmatically formulated:

m1 * m2 = (256*m1M + m1L) * m2, or 256*m1M*m2 + m1L*m2

So we just need to do two multiplications and to add both results. Sorry, if you see three asterisks in the formula: the multiplication with 256 in the binary world doesn't require any hardware at all, because it is a simple move to the next higher byte. Just like the multiplication by 10 in the decimal world is simply moving the number one left and write a zero to the least significant digit.

So let's go to a practical example. First we need some registers to

- load the numbers m1 and m2,
- provide space for the result, which might have 24 bits length.

```
;
; Test hardware multiplication 16-by-8-bit
;
; Register definitions:
;
.def Res1 = R2
.def Res2 = R3
.def Res3 = R4
.def m1L = R16
.def m1M = R17
.def m2 = R18
```

First we load the numbers:
```
;
; Load Registers
;
.equ m1 = 10000
;
ldi m1M,HIGH(m1) ; upper 8 bits of m1 to m1M
ldi m1L,LOW(m1) ; lower 8 bits of m1 to m1L
ldi m2,250 ; 8-bit constant to m2
```

The two numbers are loaded into R17:R16 (dec 10000 = hex 2710) and R18 (dec 250 =
hex FA).Then we multiply the LSB first:

```
;
; Multiply
;
mul m1L,m2 ; Multiply LSB
mov Res1,R0 ; copy result to result register
mov Res2,R1
```

The LSB multiplication of hex 27 by hex FA yields hex 0F0A, written to the
registers R00 (LSB, hex A0) and R01 (MSB, hex 0F). The result is copied to
the lower two bytes of the result register, R3:R2.Now the multiplication of the MSB of m1 with m2 follows:

```
mul m1M,m2 ; Multiply MSB
```

The multiplication of the MSB of m1, hex 10, with m2, hex FA, yields hex 2616
in R1:R0.Now two steps are performed at once: multiplication by 256 and adding the result to the previous result. This is done by adding R1:R0 to Res3:Res2 instead of Res2:Res1. R1 can just be copied to Res3. R0 is added to Res2 then. If the carry is set after adding, the next higher byte Res3 is increased by one.

```
mov Res3,R1 ; copy MSB result to result byte 3
add Res2,R0 ; add LSB result to result byte 2
brcc NoInc ; if not carry, jump
inc Res3
NoInc:
```

The result in R4:R3:R2 is hex 2625A0, which is decimal 2500000 (as everybody knows),
and is obviously correct.The cycle counter of the multiplication points to 10, at 1 Mcs/s clock a total of 10 microseconds. Very much faster than software multiplication!

To the top of that page

m1 * m2 = (256*m1M + m1L) * (256*m2M + m2L) = 65536*m1M*m2M + 256*m1M*m2L + 256*m1L*m2M + m1L*m2L

Obviously four multiplications now. We start with the first and the last as the two easiest ones: their results are simply copied to the correct result register positions. The results of the two multiplications in the middle of the formula have to be added to the middle of our result registers, with possible carry overflows to the most significant byte of the result. To do that, you will see a simple trick that is easy to understand. The software:

```
;
; Test Hardware Multiplication 16 by 16
;
; Define Registers
;
.def Res1 = R2
.def Res2 = R3
.def Res3 = R4
.def Res4 = R5
.def m1L = R16
.def m1M = R17
.def m2L = R18
.def m2M = R19
.def tmp = R20
;
; Load input values
;
.equ m1 = 10000
.equ m2 = 25000
;
ldi m1M,HIGH(m1)
ldi m1L,LOW(m1)
ldi m2M,HIGH(m2)
ldi m2L,LOW(m2)
;
; Multiply
;
clr R20 ; clear for carry operations
mul m1M,m2M ; Multiply MSBs
mov Res3,R0 ; copy to MSW Result
mov Res4,R1
mul m1L,m2L ; Multiply LSBs
mov Res1,R0 ; copy to LSW Result
mov Res2,R1
mul m1M,m2L ; Multiply 1M with 2L
add Res2,R0 ; Add to Result
adc Res3,R1
adc Res4,tmp ; add carry
mul m1L,m2M ; Multiply 1L with 2M
add Res2,R0 ; Add to Result
adc Res3,R1
adc Res4,tmp
;
; Multiplication done
;
```

Simulation shows the following steps. Loading the two constants 10000 (hex 2710) and 25000 (hex 61A8) to the registers in the upper register space ...

Multiplying the two MSBs (hex 27 and 61) and copying the result in R1:R0 to the two most upper result registers R5:R4 ...

Multiplying the two LSBs (hex 10 and A8) and copying the result in R1:R0 to the two lower result registers R3:R2 ...

Multiplying the MSB of m1 with the LSB of m2 and adding the result in R1:R0 to the result register's two middle bytes, no carry occurred ...

Multiplying the LSB of m1 with the MSB of m2 and adding the result in R1:R0 to the result register's two middle bytes, no carry occurred. The result is hex 0EE6B280, which is 250000000 and obviously correct ...

Multiplication needed 19 clock cycles, which is very much faster than with software multiplication. Another advantage here: the required time is ALWAYS exactly 19 cycles, and it doesn't depend on the input numbers (like is the case with software multiplication and on overflow occurances (thanks to our small trick of adding zero with carry). So you can rely on this ...

To the top of that page

```
; Hardware Multiplication 16 by 24 bit
.include "m8def.inc"
;
; Register definitions
.def a1 = R2 ; define 16-bit register
.def a2 = R3
.def b1 = R4 ; define 24-bit register
.def b2 = R5
.def b3 = R6
.def e1 = R7 ; define 40-bit result register
.def e2 = R8
.def e3 = R9
.def e4 = R10
.def e5 = R11
.def c0 = R12 ; help register for adding
.def rl = R16 ; load register
;
; Load constants
.equ a = 10000 ; multiplicator a, hex 2710
.equ b = 1000000 ; multiplicator b, hex 0F4240
ldi rl,BYTE1(a) ; load a
mov a1,rl
ldi rl,BYTE2(a)
mov a2,rl
ldi rl,BYTE1(b) ; load b
mov b1,rl
ldi rl,BYTE2(b)
mov b2,rl
ldi rl,BYTE3(b)
mov b3,rl
;
; Clear registers
clr e1 ; clear result registers
clr e2
clr e3
clr e4
clr e5
clr c0 ; clear help register
;
; Multiply
mul a2,b3 ; term 1
add e4,R0 ; add to result
adc e5,R1
mul a2,b2 ; term 2
add e3,R0
adc e4,R1
adc e5,c0 ; (add possible carry)
mul a2,b1 ; term 3
add e2,R0
adc e3,R1
adc e4,c0
adc e5,c0
mul a1,b3 ; term 4
add e3,R0
adc e4,R1
adc e5,c0
mul a1,b2 ; term 5
add e2,R0
adc e3,R1
adc e4,c0
adc e5,c0
mul a1,b1 ; term 6
add e1,R0
adc e2,R1
adc e3,c0
adc e4,c0
adc e5,c0
;
; done.
nop
; Result should be hex 02540BE400
```

The complete execution requires
- 10 clock cycles for loading the constants,
- 6 clock cycles for clearing registers, and
- 33 clock cycles for multiplication.

©2008 by http://www.avr-asm-tutorial.net