Binary calculations => 64bit binaries
(Diese Seite in Deutsch:
Binary calculations with 64 bit unsigned integers in AVR assembler
You need to handle large numbers, e.g. unsigned integers? This page demonstrates
the handling of 64-bit unsigned integer values with an 8-bit AVR.
An 8 bit controller can handle 8 bit numbers only? No, it can handle nearly
any length. As long there is enough memory space for storing and handling.
- The structure of 64 bit numbers in an 8 bit controller
- Converting decimal strings to binary
- Binary adding of two large numbers
- Subtracting a large number from another large number
- Multiplication of a large number by an 8 bit binary
- Listing decimals in binary
- Converting large numbers to a decimal string
- Counting long times with 64-bit counter
16 bits require two eight bit registers, e.g. the registers R0 and R1. In
assembler you have the freedom to place those two bytes to wherever you want
and in any row. Whether the most significant byte (MSB) of the two is in R0
or in R1 is up to the assembler programmer. By simple convention we write
this either as R1:R0 (if the MSB is in R1) or as R0:R1 (if
the MSB is in R0).
Extending the 16 bits to 64 bits is easy: just add more registers. The 64 bits
are e.g. stored like this: R7:R6:R5:R4:R3:R2:R1:R0. You can also use
R3:R7:R6:R0:R2:R1:R4:R3:R5, if you can remember the row. It is your
own choice to make life a little more complicated.
To write understandable code I named R7:R6:R5:R4:R3:R2:R1:R0 as N1 and
abbreviate this as R7:...:R0. If you need a second one of the same
64 bit size: R15:R14:R13:R12:R11:R10:R9:R8 would be a good choice. N2
or the abbreviation R14:...:R8 are simpler to remember.
As the AVRs have plenty registers available (not so in other controllers such
as PICs or in the 3GHz monsters in the PC), it is not necessary to store
those numbers in SRAM. If you run out of space in registers you can as well
copy those to the SRAM of the controller. If your desired location in SRAM
is sN1 (define this within the .DSEG data segment), you can perform
that with the following code:
If you want to copy this 64 bit number to a different location, e.g. sN2,
you'll have to copy all that and replace sN1 by sN2.
sts sN1,R0 ; Store register R0 on location sN1
As copying always involves 8 bytesin a row, you can choose another way of
doing that: by the use of pointers. Here we use the pointer registers
Pointing X to the register R0 is simple: just set the address to zero.
E.g. with the instructions clr XH and clr XL.
With the instruction ld R16,X+ the content of register R0 is
temporarily copied to register R16, with st Z+,R16 its content
is written to the address in SRAM that Z points to. The "+"
in X+ and Z+ means autoincrement the address: following load and store
the addresses in X and Z already point to the next register resp. SRAM
location. If X reaches 8, the copy action is over, otherwise additional
runs are done.
- Z (R31:R30) to point to the SRAM location sN1, and
- X (R27:R26) to point to the registers, starting with R0.
The source code here provides a routine called
CopyN1ToSramN1: that copies the content of the registers in N1
to a location sN1 in SRAM:
That routine is fast: at 1 MHh clock rate it consumes 68 µs
time (including an RCALL and RET instruction. And it can
be used for any SRAM location: just point Z to another location and
call CopyN1ToSram with that.
; Copy N1 to SRAM location
ldi ZH,High(sN1) ; Point Z to Sram N1, MSB
ldi ZL,Low(sN1) ; dto., LSB
clr XH ; Point X to N1, MSB
clr XL ; dto., LSB
ld rmp,X+ ; Read register
st Z+,rmp ; Write to SRAM
cpi XL,8 ; All 8 bytes copied?
Where did I get the 68 µs from? Now,
This was a short introduction to simulation. If you want to know more, consult
the handbook of avr_sim.
Such large numbers cannot be defined with an .EQU directive because the assembler
uses only 32-bit integers internally. How can we start with a number of e.g.
- install the simulator
on your computer (pre-compiled 64-bit Windows and Linux executables available,
otherwise compile your own from the source code provided using Lazarus),
- start avr_sim and open the source code 64bit.asm
by selecting Project and New from asm from the menu,
- assemble the source code with Assemble,
- then in the listing search the line rcall CopyN1ToSramN1 in the
Init code section of the listing, right-click on that line and
- then go to the next line and again set a breakpoint on that line,
- start Simulate, in the new window click on Run,
- after a while when simulation stops on the first breakpoint line, click
on Stopwatch in the Simulation status window and tag
SRAM in the Show internal hardware section,
- then let the simulator execute the subroutine by clicking Run,
check that the SRAM locations marked in yellow follow exactly the content
of R0 to R7 in the Register window,
- if the simulator stops on the next breakpoint line, read the Stopwatch's
time and you know exactly how long it lasted.
Now, the answer is simple: define a string in the source code
This places a string of ASCII characters into flash memory at location
ANumber1:, starting with a '1' as LSB and '2' as MSB, and so on.
The number string ends with a zero byte, so our number conversion routine
Dec2BinN1: recognizes the end of the number string. If the number
of characters on the .db line is odd, just add another 0 at the end (or
start the number with a blank). This routine also accepts commas, dots or
spaces as thousands separators and ignores those ASCII characters.
.db "123,456,789,012,345,678",0 ; A very large first number
Number conversion starts with clearing N1 (R7:...:R0). Pointer register
Z points to the first character at the doubled address of ANumber1:.
Doubled because the LPM instruction has to be able to read the LSB and
the MSB at that 16-bit address. The first ASCII character is read, which
is the '1' or hexadecimal 0x31. Only ASCII numbers between '0' and '9'
and separator characters are correct, other characters lead to a return
with the carry flag set, to signal an error to the calling routine.
The '1' is then converted to binary by subtracting 0x30 ('0') from it.
Now, the previous N1 has to be multiplied by 10 to add the next digit
to it. As N1 was zero, nothing will change with multiplicating by 10.
If the next following numbers are read, the multiplication by 10 goes
Now we can add the incoming new binary to N1: just add R0,R16.
To account for a carry, we load R16 with 0 (with LDI, not with CLR
because that would clear the carry flag, too) and add this zero to
all other registers too by using adc Rn,R16.
- copy the eight bytes of N1 to another location (either to
R15:...:R8 or to the SRAM),
- multiply N1 by two (left shifting R0, then rotate left R1 - with
the carry bit going into bit 0 of the next byte -, then R2,
and so on until R7),
- if the last rotate ends with a carry flag of one, an error has
occurred (overflow error, number is too large to fit into 64 bits),
- then again multiply by two, so that N1 is multiplied by four
now (check again for an overflow),
- now add the prevously stored number, by using the instruction
add R0,R8 and adc R1,R9 and so on to account for
any carry flags from the lower byte addition,
- if after the last adc R7,R15 the carry is set, signal an
- then multiply the five-fold of the previous number by two again
and yield the result of the multiplication by 10.
If the next ASCII character is 0x00, the number is on its end and
Dec2BinN1 can return with a cleared carry flag (no error condition).
To simulate the conversion of 123.456.789.012.345.678 to a 64 bit
binary just single-step through the code from the beginning. Note
that the conversion of that number consumes 2,235 Milliseconds
or needs 2,235 instructions to be executed.
Later on in the source code a smaller number is converted. The time
consumed is nearly proportional to the number of digits to be
converted, because the multiplication by 10 is the most time
All very fast for a controller handling 8 bits only.
That is a simple task: just add the LSB by add R0,R8, then
adc R1,R9 to add the carry also and continue until adc
R7,R15. If the last ADC ends with the carry flag set, an overflow
Adding takes only 16 µs and is very fast.
The source code then copies the result to
location sN3, so that the binaries N1 and N2 and the result N3 are
all listed in SRAM and can be viewed there when simulating. Place
a breakpoint there behind the rcall to stop at this point.
This is also simple: just use sub R0,R8 and sbc R1,R9
etc. to care for a carry. In that case an underflow occurs if N2
is larger than N1, as signalled by the carry flag being set.
Subtraction is performed in very fast 16 µs.
The result of the subtraction is also written to sN3 in the SRAM.
A little more sophisticated is multiplication. The first steps are to
copy N1 to N2 and to clear N1.
Then bit by bit is right-shifted out of R16 to the carry flag. If it is
a one, the number N2 is added to N1. Error checking is done after the
Then it is checked if R16 is already zero (no more to multiplicate).
If that is the case, the subroutine resturns with a clear carry flag.
If not, N2 is multiplied by two. Error checking is done after the
last ROL, signalling an overflow.
Multiplication with an 8-bit binary of 255 (all bits to be added)
needs very fast 205 µs. A hardware multiplicator would not
be very much faster than this software multiplication.
The result is also copied to sN3.
A convenient tool follows in routine ListDecimals:: it
calculates all binaries for decimals of 1, 100, 1000, etc., until
it does not fit into 64 bit any more, and lists those in SRAM.
The last decimal cannot be calculated using normal calculators
because those use signed 64 bit integers and interpret the 64th
bit as sign.
Calculation and listing needs 2.991 ms.
Now the master's discipline: converting 64 bit binaries back to
ASCII strings. Leading zeroes are blanked and thousands separators
added, so that 26 digits have to be handled.
This involves a DecimalTable:, as was generated from the
above listing. It starts with the binary of 1019, the
largest one to be handled in 64 bits, and goes down to
101. After that the table ends with a 0xFF to signal
The math of conversion of the 64 bit binary in N1 goes like this:
The last digit is converted to ASCII by adding 0x30 to it and
by writing the resulting character to the SRAM.
- the 64 bit binary representation of the decimal is read
from the flash table to N2, if the first byte is 0xFF the
end of conversion has been reached,
- it is subtracted from N1 until an underflow occurs,
- the number of successful subtractions is converted to
an ASCII digit subi R16,-'0' adds 0x30 to the digit,
- then it is checked if the T flag is set, which signals
blanking of leading zeroes, if not set the characters is
written directly to the SRAM buffer, if set it is checked
whether the digit is still a zero, if yes, a blank is
issued, if not, the T flag is cleared and the number is
written to SRAM,
- a counter is decreased (initial from two downwards,
later from three), if it reaches zero, a thousands
separator is added.
If you need the number left-adjusted just copy the X pointer
when the T flag is cleared. On the beginning set this pointer
to the last digit in SRAM to make sure that a single digit
number is covered.
Conversion to an ASCII string requires 2.682 ms for the
larger of the two numbers and 2.052 ms for the smaller
Of course, if you need extremely long times, a 64-bit-wide
counter is the correct solution. You can count any time between
some tens of microseconds and up to several hundreds of years
100 Years are
t [µs] = 100 [Years] * 365,25 [Days/Year] * 24 [Hours/Day] * 60 [Minutes/Day] * 60 [Seconds/Minute] * 1.000.000 [µs/second] = 9,467,280,000,000,000
Loop counter = t / 3 - t/3 DIV 256 - t/3 DIV 2562 - ... - t/3 DIV 2565 = 1,047,794,823,776,258 = 0x0003.B8F6.BE44.C802
and is by the 5,555-fold below the maximum number to be handled
in an eight byte wide loop counter. If only the battery is enough
longlived to work for 100 years!
more on long counting.
Handling of 64 bit unsigned integers is a rather simple task
over all. It involves some add/subtract with or without carry,
some shift/rotate instructions and lots of error detection.
Not at all operations that are beyond the intellect of a Java
or C programmer, just something he does not care about when he
Assembler provides fast (in the ms range) and efficient code
to do all those operations. No need to involve lengthy and
space consuming libraries or to change to a 16, 32, 48 or 64
One reason for that efficient code is that the AVRs have
plenty of registers and a very efficient instruction set.
Without that, the routines would be much longer and would
consume very much longer times to execute.
To the top of that page
(©)2019 by http://www.avr-asm-tutorial.net