Re: [程式] 一題資料庫程式很簡單的

看板Statistics作者realtemper (一覺醒來傅鐘前)時間10年前 (2015/06/20 08:34)推噓0(0推 0噓 6→)

留言6則, 2人參與討論串3/3 (看更多)

※ 引述《cawaiilulu (across)》之銘言： (恕刪) : 沒關係這個都不用管假設就都兩位以下數字我主要是想問怎麼寫比較"快" 我嘗試寫了一段應用雙重 set 迴圈的 code 如下（及本文末，防連結失效） https://gist.github.com/anonymous/6f5b3f7f650a990bcfac 先講測試結果.... 測試條件：SAS 9.4 / win7 x64 / 2.9G 雙核心 / ramdisk 20G（SAS工作位置）以資料筆數 n=1k, m=500k 測試，耗時 real time 144s CPU time 133s 因此合理估計 n=100k, m=500k 時，將花去約13300s, 即3.7小時左右不曉得這個效率，是否符合您的期待？然後是最佳化的想法.... 首先您要求的任務，基本上時間複雜度一定是 >= O(nm) 的，這是因為您不但需要判定「較大變數個數之最大值 (絕大多數情況會是 5)」，還必須給出最大值的「發生次數」，所以每一筆 n 跟每一筆 m 都非得做過一次比較，才能取得發生次數之故。因此，能夠最佳化的部份，主要就是下面這些點了： 1.嘗試減少 I/O 次數（磁碟讀寫總是最慢的） 2.減少各種不必要的中間變數。 3.嘗試一邊讀取、一邊計算、一邊更新，僅用 n*m 次資料存取就解決問題。（i.e.就資料 I/O 的部份，把常數壓到 1）另外，這個 case 我認為 set 雙迴圈的作法一定遠遠優於 proc sql。因為這兩個表格的合併方式是不帶條件的 Cartesian join，並且用 proc sql 來處理的話，還需要製造 5 個 dummy variable，（case when then else end 的寫法，大概非這樣不可，至少我寫不出其他作法）所以一定會產生一個非常龐大的 (n*m) 中間資料集，產生大量不必要的I/O。相較之下，data step 的 set 雙迴圈，在條件判斷與資料輸出方面，就靈活許多。對 dataset b 的資料，可以輕易做到讀完即丟(見code)，除了結果之外，完全不需創造中間資料集，因此空間複雜度會永遠被壓縮在 1*1，而不是 n*m 當然如果我有想錯，也請諸位前輩不吝賜教～ m(_ _)m 附：測試程式碼 %let NA = 1000; * expect 100k; %let NB = 500000; * expect 500k; /* generate test dataset */ data a; ID = 0; do while (ID < &NA); ID + 1; V1 = ranuni(777); V2 = ranuni(777); V3 = ranuni(777); V4 = ranuni(777); V5 = ranuni(777); output; end; run; data b; ID = 0; * ID of b is of no use; do while (ID < &NB); ID + 1; V1 = ranuni(911); V2 = ranuni(911); V3 = ranuni(911); V4 = ranuni(911); V5 = ranuni(911); output; end; run; /* match */ data result(keep=ID CNT); set a; N = 0; * the current maximum value of "# of bigs"; CNT = 0; * the # of occurence of current N; i = 0; * pointer for random accessing dataset b; do while (i < NN); i+1; /* read i-th obs of dataset b */ set b(rename=(V1-V5=BV1-BV5) drop=ID) point=i nobs=NN; /* find "# of bigs" of this i-th obs */ N_this = 0; if V1 - BV1 > 0 then N_this = N_this + 1; if V2 - BV2 > 0 then N_this = N_this + 1; if V3 - BV3 > 0 then N_this = N_this + 1; if V4 - BV4 > 0 then N_this = N_this + 1; if V5 - BV5 > 0 then N_this = N_this + 1; /* take action based on the value of N_this */ * case 1: N_this < N -> discard all data of b (do nothing); if N_this < N then do; end; * case 2: N_this = N -> update CNT; else if N_this = N then CNT = CNT + 1; * case 3: N_this > N -> replace N by N_this and reset CNT value; else do; N = N_this; CNT = 1; end; /* output to see how N and CNT change gradually (should limit the value of &NA and &NB!) */ *output; end; run; -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 114.39.232.118 ※ 文章網址: https://www.ptt.cc/bbs/Statistics/M.1434760473.A.B0A.html ※ 編輯: realtemper (114.39.232.118), 06/20/2015 08:42:07

→

MOONY135

06/20 11:17, , 1^F

06/20 11:17, 1^F

→

realtemper

06/20 17:28, , 2^F

06/20 17:28, 2^F

→

MOONY135

06/20 22:02, , 3^F

06/20 22:02, 3^F