2016年2月13日 星期六

GCC 使用 MIPS DSP ASE

MIPS DSP ASE

gcc 編譯選項
  • 加「-mdsp」(24KE, 34K) 或「-mdsp2」(74K, M14KE),不然會有「undefined reference to `__ssaddhq3'」之類的錯誤
  • 加「-mips32r2」可使用 INS 指令,存取 SIMD 變數的單元有較好的效能
  • M14KE 的 DSP instructions 可組譯成 microMIPS opcode (加「-mmicromips」)
C 定義
  • __mips_dsp 或 __mips_dsp_rev=1
  • __mips_dspr2 或 __mips_dsp_rev=2
  • __MIPSEL__
Data types 及變數初始化
  • typedef short q15;
    q15 a = 0.1234 * 32768.0;
  • typedef int q31;
    q31 b = 0.2468 * 2147483648.0;
  • typedef long long a64;
  • typedef signed char v4i8 __attribute__ ((vector_size(4)));
    v4i8 a = {1, 2, 3, 4};
    v4i8 b;
    b = (v4i8) {5, 6, 7, 8};
  • typedef short v2q15 __attribute__ ((vector_size(4)));
    v2q15 a = {0x0fcb, 0x3a75};
    v2q15 b;
    b = (v2q15) {0.1234 * 32768.0, 0.4567 * 32768.0};
  • endian 問題,詳見 MD00485 2.4 2.5
C 運算子
  • fractional data 的 +, - 跟 integer data 一樣 (addu, subu),但 * (乘法) 後需要位移來對齊小數點
  • SIMD 變數可以使用 +, -, *, /, unary minus, ^, |, &, ~ 作運算,但只有 +, - 有指令 (addu.qb, subu.qb, addq.ph, subq.ph),其它 GCC 則 synthesizes 一系列指令。
暫存器多三組 HI-LO (共四組) 及 DSPControl
DSP Control Register
  • 有 6 欄位:
    namebitsmask說明
    CCOND 27:2416condition code
    OUFLAG 23:168overflow/underflow。只能用 wrdsp 清掉
    16:HI-LO 0
    17:HI-LO 1
    18:HI-LO 2
    19:HI-LO 3
    20:
    21:
    22:
    23:
    EFI 1432extract fail indicator
    C134carry bit,用在 addsc, addwc
    SCOUNT 12:72size count
    POS5:01position bits
  • 專用來存取的指令:rddsp/wrdsp
  • SCOUNT 及 POS 視為全域變數,會改變它們的指令或函數不會優化掉,包括 wrdsp, extpdp, extpdpv、及 mthlip。
  • For correctness, programmers must assume that a function call clobbers all fields of the DSP control register. That is, programmers cannot depend on the values in CCOND, OUFLAG, EFI or C across a function-call boundary. They must re-initialize the values of CCOND, OUFLAG, EFI or C before using them.
比較:使用內建函數 (Intrinsic) 及組合語言巨集 (macro)
內建函數已對 pipeline 延遲做最佳化,而 macro 沒有,而造成 code scheduling 較差、及增加 stall。

C 內建函數 (Intrinsics)
  • Q31
    • q31 __builtin_mips_addq_s_w (q31, q31);
      q31 __builtin_mips_subq_s_w (q31, q31);
    • q31 __builtin_mips_absq_s_w (q31);
    • q31 __builtin_mips_shll_s_w (q31, imm0_31);
      q31 __builtin_mips_shll_s_w (q31, i32);
    • q31 __builtin_mips_shra_r_w (q31, imm0_31);
      q31 __builtin_mips_shra_r_w (q31, i32);
    • a64 __builtin_mips_dpaq_sa_l_w (a64, q31, q31); // Q63 + Q31 * Q31 => Q63
      a64 __builtin_mips_dpsq_sa_l_w (a64, q31, q31); // Q63 - Q31 * Q31
    • q31 __builtin_mips_mulq_rs_w (q31, q31); // DSPR2
    • q31 __builtin_mips_mulq_s_w (q31, q31); // DSPR2
    • q31 __builtin_mips_addqh_w (q31, q31); // DSPR2
      q31 __builtin_mips_addqh_r_w (q31, q31); // DSPR2
    • q31 __builtin_mips_subqh_w (q31, q31); // DSPR2
      q31 __builtin_mips_subqh_r_w (q31, q31); // DSPR2
  • Q15
    • v2q15 __builtin_mips_addq_ph (v2q15, v2q15);
      v2q15 __builtin_mips_addq_s_ph (v2q15, v2q15);
      v2q15 __builtin_mips_subq_ph (v2q15, v2q15);
      v2q15 __builtin_mips_subq_s_ph (v2q15, v2q15);
    • v2q15 __builtin_mips_absq_s_ph (v2q15);
    • v2q15 __builtin_mips_shll_ph (v2q15, imm0_15);
      v2q15 __builtin_mips_shll_ph (v2q15, i32);
      v2q15 __builtin_mips_shll_s_ph (v2q15, imm0_15);
      v2q15 __builtin_mips_shll_s_ph (v2q15, i32);
    • v2q15 __builtin_mips_shra_ph (v2q15, imm0_15);
      v2q15 __builtin_mips_shra_ph (v2q15, i32);
      v2q15 __builtin_mips_shra_r_ph (v2q15, imm0_15);
      v2q15 __builtin_mips_shra_r_ph (v2q15, i32);
    • v2q15 __builtin_mips_mulq_rs_ph (v2q15, v2q15); // Q15 * Q15 => Q15
    • a64 __builtin_mips_dpaq_s_w_ph (a64, v2q15, v2q15); // 累加兩個乘積
      a64 __builtin_mips_dpsq_s_w_ph (a64, v2q15, v2q15); // 累減兩個乘積
    • a64 __builtin_mips_mulsaq_s_w_ph (a64, v2q15, v2q15); // 累加第一個乘積,減第二個乘積,跟 endian 有關
    • a64 __builtin_mips_maq_s_w_phl (a64, v2q15, v2q15); // 只累加其中一個乘積
      a64 __builtin_mips_maq_s_w_phr (a64, v2q15, v2q15);
      a64 __builtin_mips_maq_sa_w_phl (a64, v2q15, v2q15);
      a64 __builtin_mips_maq_sa_w_phr (a64, v2q15, v2q15);
    • q31 __builtin_mips_muleq_s_w_phl (v2q15, v2q15); // Q15 * Q15 => Q31,只相乘一組
      q31 __builtin_mips_muleq_s_w_phr (v2q15, v2q15);
    • Replicate a Fixed Half-word into Elements
      v2q15 __builtin_mips_repl_ph (imm_n512_511);
      v2q15 __builtin_mips_repl_ph (i32);
    • void __builtin_mips_cmp_eq_ph (v2q15, v2q15);
      void __builtin_mips_cmp_lt_ph (v2q15, v2q15);
      void __builtin_mips_cmp_le_ph (v2q15, v2q15);
    • v2q15 __builtin_mips_pick_ph (v2q15, v2q15);
    • v2q15 __builtin_mips_packrl_ph (v2q15, v2q15);
    • v2q15 __builtin_mips_mulq_s_ph (v2q15, v2q15); // DSPR2
    • v2q15 __builtin_mips_addqh_ph (v2q15, v2q15); // DSPR2
      v2q15 __builtin_mips_addqh_r_ph (v2q15, v2q15); // DSPR2
    • v2q15 __builtin_mips_subqh_ph (v2q15, v2q15); // DSPR2
      v2q15 __builtin_mips_subqh_r_ph (v2q15, v2q15); // DSPR2
    • a64 __builtin_mips_dpaqx_s_w_ph (a64, v2q15, v2q15); // DSPR2
      a64 __builtin_mips_dpaqx_sa_w_ph (a64, v2q15, v2q15); // DSPR2
      a64 __builtin_mips_dpsqx_s_w_ph (a64, v2q15, v2q15); // DSPR2
      a64 __builtin_mips_dpsqx_sa_w_ph (a64, v2q15, v2q15); // DSPR2
  • 8-bit
    • v4i8 __builtin_mips_addu_qb (v4i8, v4i8);
      v4i8 __builtin_mips_addu_s_qb (v4i8, v4i8);
      v4i8 __builtin_mips_subu_qb (v4i8, v4i8);
      v4i8 __builtin_mips_subu_s_qb (v4i8, v4i8);
    • i32 __builtin_mips_raddu_w_qb (v4i8);
    • v4i8 __builtin_mips_shll_qb (v4i8, imm0_7);
      v4i8 __builtin_mips_shll_qb (v4i8, i32);
      v4i8 __builtin_mips_shrl_qb (v4i8, imm0_7);
      v4i8 __builtin_mips_shrl_qb (v4i8, i32);
    • a64 __builtin_mips_dpau_h_qbl (a64, v4i8, v4i8);
      a64 __builtin_mips_dpau_h_qbr (a64, v4i8, v4i8);
      a64 __builtin_mips_dpsu_h_qbl (a64, v4i8, v4i8);
      a64 __builtin_mips_dpsu_h_qbr (a64, v4i8, v4i8);
    • v4i8 __builtin_mips_repl_qb (imm0_255);
      v4i8 __builtin_mips_repl_qb (i32);
    • void __builtin_mips_cmpu_eq_qb (v4i8, v4i8);
      void __builtin_mips_cmpu_lt_qb (v4i8, v4i8);
      void __builtin_mips_cmpu_le_qb (v4i8, v4i8);
      i32 __builtin_mips_cmpgu_eq_qb (v4i8, v4i8);
      i32 __builtin_mips_cmpgu_lt_qb (v4i8, v4i8);
      i32 __builtin_mips_cmpgu_le_qb (v4i8, v4i8);
      i32 __builtin_mips_cmpgdu_eq_qb (v4i8, v4i8); // DSPR2
      i32 __builtin_mips_cmpgdu_lt_qb (v4i8, v4i8); // DSPR2
      i32 __builtin_mips_cmpgdu_le_qb (v4i8, v4i8); // DSPR2
    • v4i8 __builtin_mips_pick_qb (v4i8, v4i8);
    • v4i8 __builtin_mips_absq_s_qb (v4i8); // DSPR2
    • v4i8 __builtin_mips_adduh_qb (v4i8, v4i8); // DSPR2
      v4i8 __builtin_mips_adduh_r_qb (v4i8, v4i8); // DSPR2
    • v4i8 __builtin_mips_shra_qb (v4i8, imm0_7); // DSPR2
      v4i8 __builtin_mips_shra_r_qb (v4i8, imm0_7); // DSPR2
      v4i8 __builtin_mips_shra_qb (v4i8, i32); // DSPR2
      v4i8 __builtin_mips_shra_r_qb (v4i8, i32); // DSPR2
    • v4i8 __builtin_mips_subuh_qb (v4i8, v4i8); // DSPR2
      v4i8 __builtin_mips_subuh_r_qb (v4i8, v4i8); // DSPR2
  • a64
    • i32 __builtin_mips_extr_w (a64, imm0_31);
      i32 __builtin_mips_extr_w (a64, i32);
      i32 __builtin_mips_extr_r_w (a64, imm0_31);
      i32 __builtin_mips_extr_r_w (a64, i32);
      i32 __builtin_mips_extr_rs_w (a64, imm0_31);
      i32 __builtin_mips_extr_rs_w (a64, i32);
    • i32 __builtin_mips_extr_s_h (a64, imm0_31);
      i32 __builtin_mips_extr_s_h (a64, i32);
    • i32 __builtin_mips_extp (a64, imm0_31);
      i32 __builtin_mips_extp (a64, i32);
    • i32 __builtin_mips_extpdp (a64, imm0_31);
      i32 __builtin_mips_extpdp (a64, i32);
    • a64 __builtin_mips_shilo (a64, imm_n32_31);
      a64 __builtin_mips_shilo (a64, i32);
    • a64 __builtin_mips_mthlip (a64, i32);
  • Precision Reduce/Expand
    • v4i8 __builtin_mips_precrq_qb_ph (v2q15, v2q15);
    • v4i8 __builtin_mips_precrqu_s_qb_ph (v2q15, v2q15);
    • v2q15 __builtin_mips_precequ_ph_qbl (v4i8);
      v2q15 __builtin_mips_precequ_ph_qbr (v4i8);
      v2q15 __builtin_mips_precequ_ph_qbla (v4i8);
      v2q15 __builtin_mips_precequ_ph_qbra (v4i8);
    • v2q15 __builtin_mips_preceu_ph_qbl (v4i8);
      v2q15 __builtin_mips_preceu_ph_qbr (v4i8);
      v2q15 __builtin_mips_preceu_ph_qbla (v4i8);
      v2q15 __builtin_mips_preceu_ph_qbra (v4i8);
    • v4i8 __builtin_mips_precr_qb_ph (v2i16, v2i16); // DSPR2
    • v2q15 __builtin_mips_precrq_ph_w (q31, q31);
    • v2q15 __builtin_mips_precrq_rs_ph_w (q31, q31);
    • q31 __builtin_mips_preceq_w_phl (v2q15);
      q31 __builtin_mips_preceq_w_phr (v2q15);
  • Int8 * Q15 => Q15
    • v2q15 __builtin_mips_muleu_s_ph_qbl (v4i8, v2q15);
      v2q15 __builtin_mips_muleu_s_ph_qbr (v4i8, v2q15);
  • Add and Set Carry/Add with Carry
    i32 __builtin_mips_addsc (i32, i32);
    i32 __builtin_mips_addwc (i32, i32);
  • Modular Subtraction on an Index Value
    i32 __builtin_mips_modsub (i32, i32);
  • Bit Reverse a Half-word
    i32 __builtin_mips_bitrev (i32);
  • Insert Bit Field Variable
    i32 __builtin_mips_insv (i32, i32);
  • Load Unsigned Byte/Halfword/Word Indexed
    i32 __builtin_mips_lbux (void *, i32);
    i32 __builtin_mips_lhx (void *, i32);
    i32 __builtin_mips_lwx (void *, i32);
  • Signed Multiply and Add
    a64 __builtin_mips_madd (a64, i32, i32);
  • Unsigned Mulitply and Add
    a64 __builtin_mips_maddu (a64, ui32, ui32);
  • Signed Multiply and Subtract
    a64 __builtin_mips_msub (a64, i32, i32);
  • Unsigned Multiply and Subtract
    a64 __builtin_mips_msubu (a64, ui32, ui32);
  • Signed Multiply
    a64 __builtin_mips_mult (i32, i32);
  • Unsigned Multiply
    a64 __builtin_mips_multu (ui32, ui32);
  • Left Shift and Append Bits (DSPR2)
    i32 __builtin_mips_append (i32, i32, imm0_31); // DSPR2
  • Byte Align Contents from Two Registers (DSPR2)
    i32 __builtin_mips_balign (i32, i32, imm0_3); // DSPR2
  • Right Shift and Prepend Bits (DSPR2)
    i32 __builtin_mips_prepend (i32, i32, imm0_31); // DSPR2
  • 3.9, 3.10
其它方式:
  • 全組合語言,或夾雜在 C 裡面
  • 使用 Fixed-point data types (_Frac) 搭配 C 運算子
參考資料:
  1. MD00485 -- 主要參考來源
  2. MD00086 The MIPS32® Instruction Set
  3. MD00374 The MIPS® DSP Application-Specific Extension to the MIPS32® Architecture
  4. GCC: Using Vector Instructions through Built-in Functions
相關文章:
  1. GCC Inline Assembly

沒有留言:

張貼留言

SIP header Via

所有 SIP 訊息 都要有 Via,縮寫 v。一開始的 UAC 和後續途經的每個 proxy 都會疊加一個 Via 放傳送的位址,依序作為回應的路徑。 格式 sent-protocol sent-by [ ;branch= branch ][ ; 參數 ...] s...