• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..--

READMED21-Oct-20179.2 KiB282205

add_n_sub_n.asmD21-Oct-20178.2 KiB308291

addmul_1.asmD21-Oct-201713 KiB603568

addmul_2.asmD21-Oct-201717.4 KiB716674

aors_n.asmD21-Oct-201720.5 KiB853819

aorsorrlsh1_n.asmD22-Aug-20171.5 KiB4938

aorsorrlsh2_n.asmD22-Aug-20171.5 KiB4938

aorsorrlshC_n.asmD22-Aug-20179.7 KiB413396

bdiv_dbm1c.asmD21-Oct-20179.1 KiB517483

cnd_aors_n.asmD22-Aug-20176.4 KiB265246

copyd.asmD21-Oct-20173.5 KiB187174

copyi.asmD21-Oct-20173.3 KiB183170

dive_1.asmD21-Oct-20176.6 KiB237213

divrem_1.asmD21-Oct-201710.6 KiB478458

divrem_2.asmD21-Oct-20176.4 KiB281264

gcd_11.asmD26-Sep-20202.6 KiB11195

gmp-mparam.hD16-Jan-20219.8 KiB213158

hamdist.asmD21-Oct-20178.2 KiB366333

ia64-defs.m4D21-Oct-20174.1 KiB148126

invert_limb.asmD21-Oct-20173 KiB10693

logops_n.asmD21-Oct-20177 KiB293266

lorrshift.asmD21-Oct-20176.9 KiB359324

lshiftc.asmD21-Oct-20178.2 KiB464440

mod_34lsub1.asmD21-Oct-20175 KiB238215

mode1o.asmD21-Oct-201711.1 KiB343304

mul_1.asmD21-Oct-201711.3 KiB585546

mul_2.asmD21-Oct-201714.9 KiB626588

popcount.asmD21-Oct-20174.4 KiB201172

rsh1aors_n.asmD21-Oct-201711.2 KiB448412

sec_tabselect.asmD22-Aug-20173.3 KiB149139

sqr_diag_addlsh1.asmD21-Oct-20174.6 KiB157142

submul_1.asmD21-Oct-201712.3 KiB648613

README

1Copyright 2000-2005 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31                      IA-64 MPN SUBROUTINES
32
33
34This directory contains mpn functions for the IA-64 architecture.
35
36
37CODE ORGANIZATION
38
39          mpn/ia64          itanium-2, and generic ia64
40
41The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
42chips were ever sold, and Itanium 2 is more powerful, so the latter is what
43we concentrate on.
44
45
46
47CHIP NOTES
48
49The IA-64 ISA keeps instructions three and three in 128 bit bundles.
50Programmers/compilers need to put explicit breaks `;;' when there are WAW or
51RAW dependencies, with some notable exceptions.  Such "breaks" are typically
52at the end of a bundle, but can be put between operations within some bundle
53types too.
54
55The Itanium 1 and Itanium 2 implementations can under ideal conditions
56execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
57to do integer operations, while the Itanium 2 allows all 6 to be integer
58operations.
59
60Taken cloop branches seem to insert a bubble into the pipeline most of the
61time on Itanium 1.
62
63Loads to the fp registers bypass the L1 cache and thus get extremely long
64latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
65
66The software pipeline stuff using br.ctop instruction causes delays, since
67many issue slots are taken up by instructions with zero predicates, and
68since many extra instructions are needed to set things up.  These features
69are clearly designed for code density, not speed.
70
71Misc pipeline limitations (Itanium 1):
72* The getf.sig instruction can only execute in M0.
73* At most four integer instructions/cycle.
74* Nops take up resources like any plain instructions.
75
76Misc pipeline limitations (Itanium 2):
77* The getf.sig instruction can only execute in M0.
78* Nops take up resources like any plain instructions.
79
80
81ASSEMBLY SYNTAX
82
83.align pads with nops in a text segment, but gas 2.14 and earlier
84incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
85it come out as break instructions.  We use the ALIGN() macro in
86mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
87suppresses any .align if the problem is detected by configure.  Lack of
88alignment might hurt performance but will at least be correct.
89
90foo:: to create a global symbol is not accepted by gas.  Use separate
91".global foo" and "foo:" instead.
92
93.global is the standard global directive.  gas accepts .globl, but hpux "as"
94doesn't.
95
96.proc / .endp generates the appropriate .type and .size information for ELF,
97so the latter directives don't need to be given explicitly.
98
99.pred.rel "mutex"... is standard for annotating predicate register
100relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
101
102.pred directives can't be put on a line with a label, like
103".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
104gas is happy with it, and past versions of HP had seemed ok.
105
106// is the standard comment sequence, but we prefer "C" since it inhibits m4
107macro expansion.  See comments in ia64-defs.m4.
108
109
110REGISTER USAGE
111
112Special:
113   r0: constant 0
114   r1: global pointer (gp)
115   r8: return value
116   r12: stack pointer (sp)
117   r13: thread pointer (tp)
118Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
119Caller-saves but rotating: r32-
120
121
122================================================================
123mpn_add_n, mpn_sub_n:
124
125The current code runs at 1.25 c/l on Itanium 2.
126
127================================================================
128mpn_mul_1:
129
130The current code runs at 2 c/l on Itanium 2.
131
132Using a blocked approach, working off of 4 separate places in the operands,
133one could make use of the xma accumulation, and approach 1 c/l.
134
135          ldf8 [up]
136          xma.l
137          xma.hu
138          stf8  [wrp]
139
140================================================================
141mpn_addmul_1:
142
143The current code runs at 2 c/l on Itanium 2.
144
145It seems possible to use a blocked approach, as with mpn_mul_1.  We should
146read rp[] to integer registers, allowing for just one getf.sig per cycle.
147
148          ld8  [rp]
149          ldf8 [up]
150          xma.l
151          xma.hu
152          getf.sig
153          add+add+cmp+cmp
154          st8  [wrp]
155
156These 10 instructions can be scheduled to approach 1.667 cycles, and with
157the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
158ldfp8 we could approach 1.583 c/l.
159
160================================================================
161mpn_submul_1:
162
163The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
164ldfp8 with all alignment headache that implies.
165
166================================================================
167mpn_addmul_N
168
169For best speed, we need to give up using mpn_addmul_2 as the main multiply
170building block, and instead take multiple v limbs per loop.  For the Itanium
1711, we need to take about 8 limbs at a time for full speed.  For the Itanium
1722, something like mpn_addmul_4 should be enough.
173
174The add+cmp+cmp+add we use on the other codes is optimal for shortening
175recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
176recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
177better.
178
179/* First load the 8 values from v */
180          ldfp8               v0, v1 = [r35], 16;;
181          ldfp8               v2, v3 = [r35], 16;;
182          ldfp8               v4, v5 = [r35], 16;;
183          ldfp8               v6, v7 = [r35], 16;;
184
185/* In the inner loop, get a new U limb and store a result limb. */
186          mov                 lc = un
187Loop:     ldf8                u0 = [r33], 8
188          ld8                 r0 = [r32]
189          xma.l               lp0 = v0, u0, hp0
190          xma.hu              hp0 = v0, u0, hp0
191          xma.l               lp1 = v1, u0, hp1
192          xma.hu              hp1 = v1, u0, hp1
193          xma.l               lp2 = v2, u0, hp2
194          xma.hu              hp2 = v2, u0, hp2
195          xma.l               lp3 = v3, u0, hp3
196          xma.hu              hp3 = v3, u0, hp3
197          xma.l               lp4 = v4, u0, hp4
198          xma.hu              hp4 = v4, u0, hp4
199          xma.l               lp5 = v5, u0, hp5
200          xma.hu              hp5 = v5, u0, hp5
201          xma.l               lp6 = v6, u0, hp6
202          xma.hu              hp6 = v6, u0, hp6
203          xma.l               lp7 = v7, u0, hp7
204          xma.hu              hp7 = v7, u0, hp7
205          getf.sig  l0 = lp0
206          getf.sig  l1 = lp1
207          getf.sig  l2 = lp2
208          getf.sig  l3 = lp3
209          getf.sig  l4 = lp4
210          getf.sig  l5 = lp5
211          getf.sig  l6 = lp6
212          add+cmp+add         xx, l0, r0
213          add+cmp+add         acc0, acc1, l1
214          add+cmp+add         acc1, acc2, l2
215          add+cmp+add         acc2, acc3, l3
216          add+cmp+add         acc3, acc4, l4
217          add+cmp+add         acc4, acc5, l5
218          add+cmp+add         acc5, acc6, l6
219          getf.sig  acc6 = lp7
220          st8                 [r32] = xx, 8
221          br.cloop Loop
222
223          49 insn at max 6 insn/cycle:            8.167 cycles/limb8
224          11 memops at max 2 memops/cycle:        5.5 cycles/limb8
225          16 fpops at max 2 fpops/cycle:                    8 cycles/limb8
226          21 intops at max 4 intops/cycle:        5.25 cycles/limb8
227          11+21 memops+intops at max 4/cycle      8 cycles/limb8
228
229================================================================
230mpn_lshift, mpn_rshift
231
232The current code runs at 1 cycle/limb on Itanium 2.
233
234Using 63 separate loops, we could use the double-word shrp instruction.
235That instruction has a plain single-cycle latency.  We need 63 loops since
236this instruction only accept immediate count.  That would lead to a somewhat
237silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
238each cycle plus shl/shr going down I1 for a further limb every second
239cycle).
240
241================================================================
242mpn_copyi, mpn_copyd
243
244The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
245cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
246scheduling isn't great.  It might be best to actually use modulo scheduled
247loops, since that will allow us to do better load-use scheduling without too
248much unrolling.
249
250Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
2512, according to tune/speed.  Cache bank conflicts?
252
253
254
255REFERENCES
256
257Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
258Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
259includes an Itanium optimization guide.
260
261Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
262document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
263etc.
264
265Intel Itanium Architecture Assembly Language Reference Guide, Intel document
266248801-004, 2000-2002.  Describes assembly instruction syntax and other
267directives.
268
269Itanium Software Conventions and Runtime Architecture Guide, Intel document
270245358-003, May 2001.  Describes calling conventions, including stack
271unwinding requirements.
272
273Intel Itanium Processor Reference Manual for Software Optimization, Intel
274document 245473-003, November 2001.
275
276Intel Itanium-2 Processor Reference Manual for Software Development and
277Optimization, Intel document 251110-003, May 2004.
278
279All the above documents can be found online at
280
281    http://developer.intel.com/design/itanium/manuals.htm
282