Path: Home => AVR-EN => Assembler introduction => Div by 10    (Diese Seite in Deutsch: Flag DE) Logo

Beginner's introduction to AVR assembler language

Dividing binaries by 10 in assembler language

Very often needed: dividing a binary by 10. Dividing by 8 is simple (just three right shifts), but how to 10?

Now: there are two methods:
  1. One can simply subtract ten from the number, and counts that until a carry happens. With small numbers up to 100 this is manageable, but with large numbers (16 bit or even larger) this a method to send the controller into useless and endless loops. Do that only if your controller has nothing else to do.
  2. With the second method it is irrelevant how large the numbers are: 16 bit are only twice as time-consuming as 8 bits are, not 256 times like in the first case. Also, the calculation lasts always the same time and it doesn't matter how large or small the number is. You'll get this exclusively here, designed by the website's owner personally.

Classical: Dividing by 10 with a counting loop

To do this, you'll need an additional counter register, if your number is 8 bit. This register is set to 0xFF with the instruction ldi rCount,0xFF or with ser rCount or, if it is R0 to R15, with clr rCount and dec rCount. In a loop this is counted up with inc rCount. Then the number is subtracted with 10, subi rNmbr,10. If after that the carry flag is clear, the loop is restarted and cotinues counting and subtracting. If not, the result of the division by ten is in the counter register.

Each run through the loop needs four clock cycles. The last, extra cycle, needs only three. So the total number of cycles is
ncycles = 4 * Number / 10 + 3

With 255 these are 103 clock cycles.

With 16 bit wide numbers you'll need a 16 bit number storage and a 16 bit counter. The counter can be any two registers. If you'll select R25:R24 or XH:XL or YH:YL or ZH:ZL for those, you can count the counter up with ADIW R24,1 and you can subtract with SBIW XL,10).If you need all your 16-bit pairs for other purposes, you can either push and pop them or you can use other registers from R16 upward with subi rNmbrL,10 for the LSB and sbci rNmbrH,0 for the MSB.

With ADIW and SBIW the loop lasts six clock cycles per execution, the last extra loop needs five. For dividing 65,535 by 10 you'll need
ncycles = 6 * 65535 / 10 + 5 = 39,323 clock cycles

and your controller needs several milliseconds to do that. With 24 bit numbers you'll be completely lost in loops.

So there is a need for a different and faster approach.

Also classical: binary division

Registers for classical division The classical division by 10 works as follows. The registers to be divided (here 16 bit wide) are associated with an addiditional byte, here rN2. Division by 10 starts with loading the number to be divided into rN1:rN0 and clearing rN2. Then bit 15 is shifted into rN2 by shifting and rotating these three bytes to the left. The additional byte is then subtracted by 10. If that leads to a set carry bit, subtraction is undone by subtracting minus 10 and a zero results (carry is cleared). If the carry was not set, because the additional byte was equal or larger than 10, subtraction is not undone and the carry flag is set to one.

Result registers for classical division Now two additional result registers rR1:rR0 come into play. For 16-bit-numbers two additional bytes are required. Those are cleared at the beginning, but their least significant bit is set to one (we will see later on why this is useful).

The determined carry from above is now rotated into those two registers from the right side to the left. If the most significant bit that is rotated to the carry is clear, the whole is repeated until all bits in the number have been shifted into the rN2 register, have been subtracted by 10 and their result been shifted into the result registers. That rotating now uses the carry bit to determine if additional bits have to be treated: if a one (the one that was put into the least significant bit at start-up) rotates out of the result registers the division is done.

Register rN2 is then used for rounding. If it is five or higher, a one is added to the result (in 16 bit mode). This result is then moved to rN1:rN0. If you need the added three registers for other purposes, too, you'll have to save them at the beginning on the stack and to restore those at the end now.

Division by 10 - the classical division method The source code for this is here. The results for those numbers show that rounding is exact. The execution times are depending from the numbers of ones in the result, because those process a little bit faster (no undo necessary, one jump less). But all are in the range between 204 and 218 clock cycles. This is more than the first method, if eight bits only are considered, but extremely faster than the respective 16-bit-method with repeated subtraction shown above.

Dividing by 10 after Schmidt

Even though binary division is simple, my own new method is the fastest. It is based on simple CPU operations such as right shifting and subtraction (AVR: LSR, ROR, SUB and SBC) only, does not use multiplication and so can be performed on any CPU.

The first approach is to divide the number by 8. If you divide the number by eight instead of ten, you'll already have a good first neighbor to the correct result. So e.g. 100 / 8 = 12 (more exact: 12.5), already near to the result of 10. Which further dividers (by 16, by 32, by 64, etc.) have to be subtracted from that to get closer to the result?

Division of 100 by 10 We see on that row, that we are already at ten if we subtract Number / 64. If the number would be 255, we would need to additionally subtract Number / 128 to be at the correct result. Larger numbers need more right shifts and subtractions.

Interesting is the row of numbers in column 3 of the table. These say if the number divided by the two's power has to be subtracted (1) or not (0). Starting from 16 two divided numbers are not subtracted (16 and 32), but the two following (64 and 128) have to be subtracted. This 0011 row also appears at the higher powers of two. Even if we go far beyond the 8-bit horizon we'll see this row re-appearing.

Division of 55,000 by 10 To prove that here is the same for 55,000. Now we'll have to divide the number eight times more often, in total 23 times.

The same 0011 row appears over the subtracted dividings. I leave it to the mathematicians to find out why this row re-appears over and over. For me it is sufficient that it works correct. And it makes a perfect, simple and aesthetic algorithm, perfect for an AVR.

To divide the number by the powers of two needs an extra register at 8 bit and two additional registers at 16 bit. Your divider register adds further two resp. four registers to that. In total you'll need three times the number of registers that your number has.

Because dividing the number by additional two's and subtracting are always the same, execution times are also the same, no matter how small or large a number is. If you use a loop instead of divisioning all by single instructions, your dependancy on number sizes increases a little bit.

Execution times

With this assembler source code I have made the three diffent versions for 8 as well as for 16 bits executable and tested all in the avr_sim simulator. Here are the results:

BitsMethodNumberTime (cycles)
8Repeated subtraction10056
Binary division10099
Schmidtdoesn't matter59
16Repeated subtraction10.0006.025
Binary division10.000220
Schmidtdoesn't matter144

When simulating I have saved all additional registers on the stack and restored their content after dividing. This is for clear comparison only, you do not need that if you have plenty registers available.

Correct rounding with Schmidt's method

One issue in dividing is correct rounding: if the rest of the division is equal or larger than 0.5 the result should be rounded up.

A simple and cheap method would be to add 5 to the number. As this would lead to an overflow if the 16-bit number is 65,531 or larger, the 5 can only be added after the number has already been divided by 8. Therefore the adder at this point is 5 / 8, which is hexadecimal 0x0000A000 and 0xA0 is added to byte 1 of the divided result.

By experiments I found that up to an adder between 5.0 and 5.9988 the rounding works correct. A value of 5.5 gives the best distance to the limits, which is hexadecimal 0x0000AFFB. This is also applied in the source code.

Here you'll find the source code for this method for 16 bit numbers. Because the row 0011 appears over and over, I've made a loop for that. The loop ends, if nothing is left to subtract. Smaller numbers reach that earlier (after four or five loop executions, large numbers need up to eight executions und all divider/subtractor registers are empty.

The code consists of only 54 instruction words and is very compact. The reason why I decided to check that all three lower registers are empty comes from the fact that in very rare cases the two least significant registers can be empty, while the third still isn't empty. That can happen with very special rounding adders. The priorization of the three registers is oriented on their relevance.

Even more rounding

To do rounding even more exact, the repeated division by 2 has to consider transfers to the carry flag following the ror rN0 instruction: This should round the result up, too. As this rounding up of the last byte can lead to additional actions, if it leads to zero, the next upper bytes can be affected from that, too. Practice has shown that this rounding up can lead to a slight overrun of the subtraction, so that the result is rounded up from 6 on upward. To correct this, 6 / 8 or hexadecimal 0xC0 is added instead of 5+.

Division of 16 bit numbers with correct rounding The source code for this is here. On the top you can add the number to be divided as constant, the result comes in register pair R17:R16.

The code has been optimized in that it uses the loop, just like in version 1. Note that the relative branching at the end of the loop is already at its limit, so do not add more than a few words in between the loop.

The basic decision to add the source code for dividing by four without subtraction and by another four with subtraction the consequence that smaller numbers need less execution clock cycles. This can be seen from the table: the shortest execution time is 182 clock cycles for dividing zero, the longest are 364 clock cycles for dividing 65,535. Each loop repetition adds roughly 47 cycles.

What can also be seen from the table: rounding is absolutely correct, no matter how small or large the numbers are!

Due to the optimization the source code is only 90 words long, slightly longer than the previous version, of course. That fits also into very small flash memory. Note that saving and re-storing needed registers is not included here. If you need it, add some pushes on top and the same number of pops at the end (and, of course, initiation of the stack on top).

Note that execution times now are longer than for the respective binary division and with the previous version 1. So there is a lot of optimization potential.

  1. For numbers smaller than 100 and 8 bit wide only the repeated subtraction of 10 is the fastest.
  2. In all other cases my new method is the fastest, binary division is a little bit slower than my method.

To the page top

©2021 by