Assembler introduction => Div by 10
(Diese Seite in Deutsch:
Beginner's introduction to AVR assembler language
Dividing binaries by 10 in assembler language
Very often needed: dividing a binary by 10. Dividing by 8 is simple (just three right shifts),
but how to 10?
Now: there are two methods:
- One can simply subtract ten from the number, and counts that until a carry
happens. With small numbers up to 100 this is manageable, but with large numbers
(16 bit or even larger) this a method to send the controller into useless and
endless loops. Do that only if your controller has nothing else to do.
- With the second method it is irrelevant how large the numbers are: 16 bit
are only twice as time-consuming as 8 bits are, not 256 times like in the first
case. Also, the calculation lasts always the same time and it doesn't matter
how large or small the number is. You'll get this exclusively here, designed
by the website's owner personally.
Classical: Dividing by 10 with a counting loop
To do this, you'll need an additional counter register, if your number is 8 bit.
This register is set to 0xFF with the instruction ldi rCount,0xFF or
with ser rCount or, if it is R0 to R15, with clr rCount
and dec rCount. In a loop this is counted up with
inc rCount. Then the number is subtracted with 10,
subi rNmbr,10. If after that the carry flag is clear, the loop is
restarted and cotinues counting and subtracting. If not, the result of the division
by ten is in the counter register.
Each run through the loop needs four clock cycles. The last, extra cycle, needs
only three. So the total number of cycles is
ncycles = 4 * Number / 10 + 3
With 255 these are 103 clock cycles.
With 16 bit wide numbers you'll need a 16 bit number storage and a 16 bit counter.
The counter can be any two registers. If you'll select R25:R24 or XH:XL or YH:YL
or ZH:ZL for those, you can count the counter up with ADIW R24,1
and you can subtract with SBIW XL,10).If you need all your 16-bit
pairs for other purposes, you can either push and pop them or you can use other
registers from R16 upward with subi rNmbrL,10 for the LSB and
sbci rNmbrH,0 for the MSB.
With ADIW and SBIW the loop lasts six clock cycles per execution, the last extra
loop needs five. For dividing 65,535 by 10 you'll need
ncycles = 6 * 65535 / 10 + 5 = 39,323 clock cycles
and your controller needs several milliseconds to do that. With 24 bit numbers
you'll be completely lost in loops.
So there is a need for a different and faster approach.
Also classical: binary division
The classical division by 10 works as follows. The registers to be divided (here
16 bit wide) are associated with an addiditional byte, here rN2.
Division by 10 starts with loading the number to be divided into
rN1:rN0 and clearing rN2. Then bit 15 is shifted into
rN2 by shifting and rotating these three bytes to the left. The
additional byte is then subtracted by 10. If that leads to a set carry bit,
subtraction is undone by subtracting minus 10 and a zero results (carry is cleared).
If the carry was not set, because the additional byte was equal or larger than 10,
subtraction is not undone and the carry flag is set to one.
Now two additional result registers rR1:rR0 come into play. For
16-bit-numbers two additional bytes are required. Those are cleared at the beginning,
but their least significant bit is set to one (we will see later on why this is
The determined carry from above is now rotated into those two registers from the right
side to the left. If the most significant bit that is rotated to the carry is clear,
the whole is repeated until all bits in the number have been shifted into the
rN2 register, have been subtracted by 10 and their result been
shifted into the result registers. That rotating now uses the carry bit to determine
if additional bits have to be treated: if a one (the one that was put into the least
significant bit at start-up) rotates out of the result registers the division is done.
Register rN2 is then used for rounding. If it is five or higher, a
one is added to the result (in 16 bit mode). This result is then moved to
rN1:rN0. If you need the added three registers for other purposes,
too, you'll have to save them at the beginning on the stack and to restore those at
the end now.
The source code for this is here. The results for those
numbers show that rounding is exact. The execution times are depending from the
numbers of ones in the result, because those process a little bit faster (no undo
necessary, one jump less). But all are in the range between 204 and 218 clock
cycles. This is more than the first method, if eight bits only are considered, but
extremely faster than the respective 16-bit-method with repeated subtraction shown
Dividing by 10 after Schmidt
Even though binary division is simple, my own new method is the fastest. It is
based on simple CPU operations such as right shifting and subtraction (AVR: LSR,
ROR, SUB and SBC) only, does not use multiplication and so can be performed on
The first approach is to divide the number by 8. If you divide the number by eight
instead of ten, you'll already have a good first neighbor to the correct result.
So e.g. 100 / 8 = 12 (more exact: 12.5), already near to the result of 10.
Which further dividers (by 16, by 32, by 64, etc.) have to be subtracted from
that to get closer to the result?
We see on that row, that we are already at ten if we subtract Number / 64. If the
number would be 255, we would need to additionally subtract Number / 128 to be
at the correct result. Larger numbers need more right shifts and subtractions.
Interesting is the row of numbers in column 3 of the table. These say if the
number divided by the two's power has to be subtracted (1) or not (0). Starting
from 16 two divided numbers are not subtracted (16 and 32), but the two following
(64 and 128) have to be subtracted. This 0011 row also appears at the
higher powers of two. Even if we go far beyond the 8-bit horizon we'll see this
To prove that here is the same for 55,000. Now we'll have to divide the number
eight times more often, in total 23 times.
The same 0011 row appears over the subtracted dividings. I leave it to the
mathematicians to find out why this row re-appears over and over. For me it is
sufficient that it works correct. And it makes a perfect, simple and aesthetic
algorithm, perfect for an AVR.
To divide the number by the powers of two needs an extra register at 8 bit and
two additional registers at 16 bit. Your divider register adds further two resp.
four registers to that. In total you'll need three times the number of registers
that your number has.
Because dividing the number by additional two's and subtracting are always the same,
execution times are also the same, no matter how small or large a number is. If
you use a loop instead of divisioning all by single instructions, your dependancy
on number sizes increases a little bit.
With this assembler source code I have made the three
diffent versions for 8 as well as for 16 bits executable and tested all in the
avr_sim simulator. Here are the results:
When simulating I have saved all additional registers on the stack and restored
their content after dividing. This is for clear comparison only, you do not need
that if you have plenty registers available.
Correct rounding with Schmidt's method
One issue in dividing is correct rounding: if the rest of the division is equal
or larger than 0.5 the result should be rounded up.
A simple and cheap method would be to add 5 to the number. As this would lead to
an overflow if the 16-bit number is 65,531 or larger, the 5 can only be added after
the number has already been divided by 8. Therefore the adder at this point is
5 / 8, which is hexadecimal 0x0000A000 and 0xA0 is added to byte 1 of
the divided result.
By experiments I found that up to an adder between 5.0 and 5.9988 the rounding
works correct. A value of 5.5 gives the best distance to the limits, which is
hexadecimal 0x0000AFFB. This is also applied in the source code.
Here you'll find the source code for this method
for 16 bit numbers. Because the row 0011 appears over and over, I've
made a loop for that. The loop ends, if nothing is left to subtract. Smaller
numbers reach that earlier (after four or five loop executions, large numbers
need up to eight executions und all divider/subtractor registers are empty.
The code consists of only 54 instruction words and is very compact. The reason
why I decided to check that all three lower registers are empty comes from
the fact that in very rare cases the two least significant registers can be
empty, while the third still isn't empty. That can happen with very special
rounding adders. The priorization of the three registers is oriented on their
Even more rounding
To do rounding even more exact, the repeated division by 2 has to consider
transfers to the carry flag following the ror rN0 instruction:
This should round the result up, too. As this rounding up of the last byte
can lead to additional actions, if it leads to zero, the next upper bytes
can be affected from that, too. Practice has shown that this rounding up can
lead to a slight overrun of the subtraction, so that the result is rounded
up from 6 on upward. To correct this, 6 / 8 or hexadecimal 0xC0
is added instead of 5+.
The source code for this is here. On the top you can
add the number to be divided as constant, the result comes in register pair R17:R16.
The code has been optimized in that it uses the loop, just like in version 1.
Note that the relative branching at the end of the loop is already at its limit,
so do not add more than a few words in between the loop.
The basic decision to add the source code for dividing by four without subtraction
and by another four with subtraction the consequence that smaller numbers need less
execution clock cycles. This can be seen from the table: the shortest execution time
is 182 clock cycles for dividing zero, the longest are 364 clock cycles for
dividing 65,535. Each loop repetition adds roughly 47 cycles.
What can also be seen from the table: rounding is absolutely correct, no matter how
small or large the numbers are!
Due to the optimization the source code is only 90 words long, slightly longer
than the previous version, of course. That fits also into very small flash memory.
Note that saving and re-storing needed registers is not included here. If you need
it, add some pushes on top and the same number of pops at the end (and, of course,
initiation of the stack on top).
Note that execution times now are longer than for the respective binary division
and with the previous version 1. So there is a lot of optimization potential.
- For numbers smaller than 100 and 8 bit wide only the repeated subtraction of 10
is the fastest.
- In all other cases my new method is the fastest, binary division is a little
bit slower than my method.
To the page top
©2021 by http://www.avr-asm-tutorial.net