﻿ Dividing binaries by N in AVR assembler Path: Home => AVR-EN => Assembler introduction => Div by N    (Diese Seite in Deutsch: )

# Beginner's introduction to AVR assembler language

## Dividing binaries by N in assembler language

Dividing by a constant can be performed by other methods than classical dividing. As already shown here for dividing by 10, one can use successive approximation for that, too. This involves repeated division by two and subtractions, something that controllers are best in.

### How it works

The Schmidt-method works as follows:
1. The number to be divided is multiplied by 256 or 65,536 and stored in the upper one or two bytes of the result. The additionally needed lower 8 or 16 bits are cleared.
2. These two (8-bit numbers) or four (16-bit numbers) bytes are copied to additional two or four bytes.
3. Now the copy of the number is repeatedly divided by two, until nothing is left over. Following each division by 2 the divided number is either subtracted from the result or not. Depending from the subtraction pattern applied, the divisor value is selected. As an example: one with subtraction and one without subtraction divides by three.
4. If the copied and divided number is empty, the result is in the two or four registers: if an 8-bit number was divided, the MSB of the result is the division result, if 16-bit were treated, the upper two of the four-byte result are the division result.
5. If the next lower byte is larger or equal 0x80, the result has to be rounded up by adding one.
If the number, divided by 2, is to be subtracted or not can be seen from a bit pattern. This bit pattern is generated using the spreadsheet calculation here (LibreOffice calc file), in the sheet "8bit-row".

The number, here 255, is divided by the divisor, which yields the target value.

Now the number we started with is divided by two, four, eight and sixteen, etc. in the first and second row. The third row determines if the divided number has to be subtracted: if the subtracted value is smaller than the target value, then this is not subtracted (0). If the subtracted value is larger or equal the target value, then subtraction has to be done. The fourth column holds the result. With each further subtraction the result value comes nearer to the target value (column delta).

If the lower byte of the result is equal or larger than 0.5, the result is to be rounded up.

The row of ones and zeros in the column 4 is always the same, no matter with which number we started with. It is a specific pattern for this divider. The start number must not be too small, otherwise you loose some lower bits at the end. 128 is in any case sufficient. As can be seen, the bit pattern for three is simply 10 and repeats over and over. It's period length is two. Not all bit pattern have such simple periods, as we will see later on.

The bit pattern can simply be applied in a very simple assembler source code. Only the two- or four-byte bit pattern is needed, together with two or four additional register bytes. The source code demonstrating this is here. Just add the bit pattern on top of the code, and it divides by another divider. Just copy the hex text from the LibreOffice sheet and run the program in a simulator, such as avr_sim.

After lots of steps, shifting and subtracting, the result is in rRH (R17). It should be correct. The time needed for that, 165 clock cycles, shows that the division is not very fast. But we'll see later on how we can accelerate that.

This here shows selected rows for dividing 8-bit numbers. Given are
1. the divider,
2. the bit pattern in binary and hexadecimal format,
3. the number of clock cycles when dividing 255, plus
4. the number of clock cycles that simple repeated subtraction of the divisor from the number would need.
You see that this software needs between 138 and 177 clock cycles, depending from the number of one's in the bit pattern and if and how often the divisor and the result needs rounding.

As additional example the source code for dividing by 1.1 is given here. The program assembles to 20 words flash code and needs 96 clock cycles for dividing 255 by 1.1.

Compared to the method with repetitive subtraction of the divider my method is faster for all dividers equal or below six. Only if accelerated by first dividing by a larger divider and then shifting the result right (/2, /4, /8, /16) plus rounding the repetitive subtraction method is faster.

The Schmidt-method is only attractive for cases where non-integer dividers have to be used or when no acceleration is possible for the repetitive subtraction method.

Or: if we accelerate the Schmidt-method.

### Acceleration of the Schmidt-method

We'll get it much faster with some optimization. And that goes as follows.

If the number to be divided is small, its division by two reaches zero after a few cycles. If the divided number is zero, we can end earlier and do not have to divide 16 times and subtract the zeros from the result. And further: if the upper byte is already zero and the lower byte of the subtractor is smaller than the lower byte of the result, any further subtraction cannot alter the MSB any more. This saves you a lot of dividing and subtracting without any effect on the result.

But this has the disadvantage that rounding the result is not possible at the early end, because the accelerated break leaves results with a too high LSB. So, rounding is not done at the end but from the early beginning on: we add little more than half the divisor to the number. That leaves an irrelevant LSB at the end, but a correct up-rounded MSB.

That adding the round-up from the beginning has the disadvantage that not the whole 8- or 16-bit range can be used. Depending from the divider, the upper part of the range gets unusable. I gues you can live with that in practice.

The optimized dividing can be tested in the spreadsheet here, in the sheet "8bit-optimized". Dividing by 3 means adding roughly 1.5 to the number, so the range of possible numbers does not include 255. So we start with 254. The exact adder can be copied from the spreadsheet directly into the source code file.

Now it can be seen in the first column that the two criteria for the break already stops the execution after 10 cycles, we don't have to do all 16 (even though we are starting with a high number). If the number is smaller, then the break appears even earlier.

As already mentioned above, the bit pattern of a three, as divider, repeats already after two divisions. Therefore we really do not need the whole pattern: 14 out of 16 bits are useless information. So we can skip the two registers for the bit pattern and the LSR/ROR instructions, too.

This is the algorithm that the source code div3_8.asm here demonstrates. With 19 instruction words and in ultrafast 56 clock cycles the whole procedure is done. And: with 254 as divident.

The picture to the right illustrates the steps that the software performs. When dividing 100, one further clock cycle is necessary. The
3. copying needs only two cycles,
4. twi right rotations and subtraction needs two clock cycles each,
5. a further right rotation without subtraction needs a little bit longer, because of the included checks if the MSB is empty and the LSB is smaller than the LSB of the result,
6. further executions after these conditions are met can be skipped.
This shortens the whole program execution by 19 clock cycles. This justifies the additional testing of the two conditions. The savings involve:
• three subtractions, and
• five divisions by 2, plus
• two further condition tests.
If we start with 50 instead of 100, we need only 47 clock cycles, at ten only 38. This is unbeatably fast and cannot be reached by tricky repeated subtraction any more.

The drawing is available in the graphic file here in LibreOffice draw format.

Promised too much? Nope.

Here are some further source codes with fixed dividers for your own experiments.
Divide byperiodsBinary bit patternSource codeCyclesWords
1.520b1010101010101010div1_5_85215
2.540b1001100110011001div2_5_85625
61+30b0100100100100101div6_84923
730b1011011011011011div7_85025

Much success with the new and ultrafast 8-bit division codes.

### 16-bit dividing after Schmidt

16-bit dividing after Schmidt goes like 8-bit, but the numbers get a little bit larger. Here an excerpt from the spreadsheet "16bit" in the LibreOffice-Calc file here: on the left side the bit pattern is determined for a given divider, on the right side the number 65,533 is divided using that bit pattern.

The bit pattern now has 32 bits. The start value for that determination is a large number, to avoid any rounding errors in the last bits. Following the last division by two, / 4.294.967.296, always zero results. Applying the same formula as in the cases above would lead to always one, the first binary digit of the bit pattern would be one. Because it doesn't matter if you subtract a zero from the result or not, this highest bit is insignificant. In later tables it is set to whatever value that fits best to the period length and pattern.

When dividing on the right side of the spreadsheet, the bit pattern on the left is applied. In a first step the rounding adder is determined. For numbers smaller than 10 this is the 0.51-fold of the divider, here that is 1.52999 or hexadecimal 0x000187AE. Just copy the hexadecimal formatted number to the source code's rounding adder cAdd. If the divider is larger than 10, cAdd is the 0.501-fold, then the 0.5001-fold, etc.

Those adders, and all result numbers, are down-rounded binaries, converted to decimals. When dividing by two and during subtraction no rounding up of the result is applied, to avoid these unnecessary rounding instructions. It works as it is, so rounding is unnecessary.

The following table lists the bit patterns and the rounding adders for the dividers 1.1, 1.5 and 2.5 plus those for 3 to 30 (except those for 4, 8 and 16). Those who need other dividers, such as 3.141592654, can use the spreadsheet here to determine the bit pattern and the rounding adder.

Divide
by
Bit pattern: row of binary dividers to be subtracted
0=no subtraction, 1=subtract
Rounding
binaryhexPeriod-
length
decimalhex
1.10b010011101000101110100010111010000x4E8BA2E8100.5610050x00008F9E
1.50b101010101010101010101010101010100xAAAAAAAA20.7649990x0000C3D7
2.50b100110011001100110011001100110010x9999999941.2749940x00014666
30b010101010101010101010101010101010x5555555521.5299990x000187AE
50b001100110011001100110011001100110x3333333342.5500030x00028CCD
70b110110110110110110110110110110110xDB6DB6DB33.5700070x000391EC
90b110001110001110001110001110001110xC71C71C764.5899960x0004970A
100b011001100110011001100110011001110x666666671 + 45.0099950x0005028F
110b010100010111010001011101000101110x51745D17105.5110020x000582D1
120b010101010101010101010101010101110x555555572+26.0119930x00060312
130b001101110010001101110010001101110x37237237126.5130000x00068354
140b101101101101101101101101101101110xB6DB6DB71+37.0140080x00070396
150b011101110111011101110111011101110x7777777747.5149990x000783D7
170b000011110000111100001111000011110x0F0F0F0F88.5169980x0008845A
180b010011100011100011100011100011110x4E38E38F1+69.0180050x0009049C
190b000001010011110101100001010011110x053D614F189.5189970x000984DD
200b110011001100110011001100110011110xCCCCCCCF2+410.0200040x000A051F
210b110011110011110011110011110011110xCF3CF3CF1210.5209960x000A8560
220b011000101110100010111010001011110x62E8BA2F1+1011.0220030x000B05A2
230b110010111101100101111011001011110xCBD97B2F1111.5229950x000B85E3
240b101010101010101010101010101011110xAAAAAAAF3+212.0240020x000C0625
250b001110101111000101000011101011110x3AF143AF2012.5249940x000C8666
260b001011100100011011100100011011110x2E46E46F1+1213.0260010x000D06A8
270b100001011011110100100001011011110x85BD216F1813.5269930x000D86E9
280b011011011011011011011011011011110x6DB6DB6F2+314.0280000x000E072B
290b111100101100010000110100111011110x72C434EF3214.5290070x000E876D
300b111011101110111011101110111011110xEEEEEEEF1+415.0299990x000F07AE

As you can see from the table, only 29 does not have a periodic pattern. With all others the period length is shorter than 32. In some cases, one or two steps prior need to be considered to have a period, e.g. in the cases 12, 14, 18, 22, 24, 26 and 30. Those cases have a "1 +" etc.

Again, the rounding adder is 0.501 rather than 0.51 for dividers from 10 upward.

The following table links assembler source code files for selected dividers and bit patterns, so you can use them as starting point for your own. The necessary clock cycles for dividing zero (minimum) and the largest possible (maximum) number to be divided. Also given are the number of instruction words.

DividerPeriodsSource codeClock cyclesInstruction
words
min.max.
32div3_p2_169520036
73div7_p3_1611220444
141+3div14_1p3_1616628152
54div5_p4_1610118248
div5_p4v_169921550
101+4div10_1p4_1610919056
div10_1p4v_1610520866
216div21_p6_1610819665
1110div11_p10_168822689
2932div29_p32_1644743
19 discretediv29_14_16110111
ClassicaldivN_16_821724

If two different source code files are given for the same divider, the second one has an extra break detection added in between, which causes extra clock cycles. As you see in the cases 5 and 10, such additional break detections make not much difference with small numbers, but add relevant execution time for larger numbers.

When dividing by 29, the first code linked packs the whole bit pattern 0x72C434EF into four registers and shifts those one position to the right, to determine whether to subract or not. If the lowest byte of those four registers is zero, the further execution is skipped. In that case all 32 divisions by two together with 17 subtractions are executed. Therefore this needs 447 clock cycles, the longest of all. But this is the one with the lowest number of instruction words of all Schmidt-divisions.

The second case for 29 demonstrates that there is lots of optimization potencial here. The 29 needs at maximum only 17 divisions by two, of which the last three do not need to subtract. So we are already done at 14 divisions by 2 and 10 subtractions. No need to check break conditions in between adds further execution speed. This yields the longest program with 111 instruction words, but it performs all this in very fast 110 clock cycles. And it needs these cycles independant of the number to be divided. One can see that there is much to win in the Schmidt-method with some intelligence.

### Conclusion

The table also lists the source code for the classical 16-by-8 division method. One can see, that
1. the classical method produces the shortest controller code (whoever needs to save flash uses this),
2. the execution times of the Schmidt-method are mostly shorter than with the classical method, except for large numbers and dividers of 11 and 14 (execution time in the classical style is always the same),
3. you can save lots of execution time if you fit the source code to exactly your own needs, as has been demonstrated with the division by 29.

To the page top