Re: [問題] Cache size量測小程式數據解釋

看板C_and_CPP作者hsnuer1171 (humanforestQQ)時間4年前 (2019/08/24 15:07)推噓3(3推 0噓 9→)

留言12則, 6人參與討論串3/3 (看更多)

※ 引述《johnjohnlin ()》之銘言： : 上面有推文說這篇做 GPU 黑箱測試的方法 : 可以測試出 cache 大小、way 數等等的資訊 (Figure 1) : https://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf : 因為這篇有點久之前看的了，有點印象模糊 : 重新讀了一遍，想說自己來回一下好了感謝回覆, 這篇paper也是非常有幫助心得網址: https://tinyurl.com/yxrczcyv 原本以為在 4K 時最慢只是因為J大提到的 Collision Miss，但我又做了一些實驗，當我把 array size 降為 16K 時，發現速度仍然是在 4K 時最慢，這時我們只會去存取 4個 element，照理來說 Set 0 是放的下的，那為什麼還會最慢呢？ Google了很久才找到答案，這與Intel CPU 針對 data hazard 設計的機制有關。先複習一下 Data Hazard 是什麼，從下圖的例子可以看到 add r1,r2,r3 之後緊接著 sub r4,r1,r3 指令，這代表第二條 sub 指令會用前面的 r1 計算完 add 後的值，但因為 pipeline 設計，在 sub 指令進入execute階段時，這時 r1還未 writeback，所以會導致結果錯誤。為了解決這個問題，CPU會有 Forwarding 的機制，也就是圖中看到的紅線，如果將 add 算完後的值提早在 execute階段之後就能傳入給 sub ，這樣一來就不用等到 writeback 了。圖片來源: https://webdocs.cs.ualberta.ca/~amaral/courses/429/webslides/Topic3-Pipelining/sld033.htm Forwarding to avoid data hazard 課本裡面的例子講到的都是對於 Register的 forwarding ，但實際上應用 x64 架構可以直接存取 memory，例如 add BYTE PTR [rdi+rcx], 10 ，這時必須得要算出實際上的位置 (rdi+rcx)，才知道跟前面的指令是不是會有 hazard產生。而 Intel CPU的設計剛好紀錄這個位置的 memory order buffer 只能存 address 的 LSB 12 Bits，剛好就是 4KB，所以，在存取 array[4096] 時，Intel CPU 會以為我們在存取 array[0]，會試著把他forward給下一次的 add，而要直到 array[4096] 的位置被完全 decode 之後，CPU才發現原來之前的 forwarding 是錯的，得要重新 load 一次 array[4096]，此時會產生 5 cycles 的 delay。因此，在 4K 時一直不斷產生了 5 Cycles 的 delay ，但與 L2 Cache Fetch 的時間比起來還是較少的 (CPU : L2 Cache = 1:14)，所以導致些微上升。全文: https://software.intel.com/en-us/forums/intel-vtune-amplifier/topic/606846 When an earlier (in program order) load issued after a later (in program order) store, a potential WAR (write-after-read) hazard exists. To detect such hazards, the memory order buffer (MOB) compares the low-order 12 bits of the load and store in every potential WAR hazard. If they match, the load is reissued, penalizing performance. However, as only 12 bits are compared, a WAR hazard may be detected falsely on loads and stores whose addresses are separated by a multiple of 4096 (2^12). This metric estimates the performance penalty of handling such falsely aliasing loads and stores. This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines. -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 1.171.25.61 (臺灣) ※ 文章網址: https://www.ptt.cc/bbs/C_and_CPP/M.1566630443.A.346.html ※ 編輯: hsnuer1171 (1.171.25.61 臺灣), 08/24/2019 15:10:19

推

johnjohnlin

08/24 16:34, 4年前 , 1^F

08/24 16:34, 1^F

→

hsnuer1171

08/24 16:37, 4年前 , 2^F

08/24 16:37, 2^F

推

sarafciel

08/26 08:47, 4年前 , 3^F

08/26 08:47, 3^F