[問題] 如何才能一直使用原始數據計算cluster?

看板Python作者 (Beol)時間1年前 (2023/04/18 14:06), 編輯推噓2(206)
留言8則, 2人參與, 1年前最新討論串1/1
各位先進們好: 小妹有一個code是要用KNN計算向量相似度之後cluster 但發現他都會與最近點cluster完就會用新的平均座標去計算次近點 有時本來1跟3也是相似的,但1先與2 cluster完後就反而與3不相似了(以及資料順序也會有差QQ) 想請教該怎麼修改才能使每個i點都先用原始座標計算完相近的K點,全部資料都計算完後再一起cluster+平均座標呢?QQ 幾個可能有問題的function如下: def KNN(i, k, data_mid_point, tree): dist, ind = tree.query(np.expand_dims(data_mid_point[i], axis=0), k=k+1) nearest_ids = list(ind[0]) if i in nearest_ids: nearest_ids.remove(i) else: nearest_ids = nearest_ids[:-1] distances = [] for j in nearest_ids: distance = ((data_mid_point[j][0] - data_mid_point[i][0])**2 + (data_mid_point[j][1] - data_mid_point[i][1])**2)**0.5 distances.append(distance) print(f"The {k} nearest IDs to ID {i} are:") for j in range(len(nearest_ids)): print(f"ID: {nearest_ids[j]}, Distance: {(distances[j]/0.000009)} meters") return nearest_ids def calcClusterFlow(c, data): ox = 0 oy = 0 dx = 0 dy = 0 for k in c: ox += data[k][0]*data[k][8] oy += data[k][1]*data[k][8] dx += data[k][2]*data[k][8] dy += data[k][3]*data[k][8] d = 0 for k in c: d += data[k][8] ox /= d oy /= d dx /= d dy /= d return ox, oy, dx, dy #計算相似性 def flowSim(vi, vj, alpha): leni = math.sqrt((vi[0]**2+vi[1]**2)) lenj = math.sqrt((vj[0]**2+vj[1]**2)) dv = math.sqrt((vi[0] - vj[0]) ** 2 + (vi[1] - vj[1]) ** 2) if leni > lenj: return dv/(alpha*leni) else: return dv/(alpha*lenj) #計算clusterID為ci和cj的兩個cluster的相似性 def clusterSim(i, j, ci, cj, data, alpha): oix, oiy, dix, diy = data[ci[0]][4], data[ci[0]][5], data[ci[0]][6], data[ci[0]][7] ojx, ojy, djx, djy = data[cj[0]][4], data[cj[0]][5], data[cj[0]][6], data[cj[0]][7] vi = [dix-oix, diy-oiy] vj = [djx-ojx, djy-ojy] sim = flowSim(vi, vj, alpha) return sim #合併兩個clusters def merge(c, ci_ID, cj_ID, l): #保留小數字的clusterID if ci_ID > cj_ID : ci_ID, cj_ID = cj_ID, ci_ID for l_ID in c[cj_ID]: l[l_ID] = ci_ID c[ci_ID].append(l_ID) c.pop(cj_ID) 算式在這邊: for i in tqdm(range(dataLen)): neighbors = KNN(i, K, data_mid_point, tree) for j in neighbors: if (data_mid_point[i][0]-data_mid_point[j][0])**2+(data_mid_point[i][1]-data_mid_point[j][1])**2>(Radius*0.000009)**2: continue if l[i] != l[j]: if clusterSim(i, j, c[l[i]], c[l[j]], data, alpha) <= 1: new_cluster_ID = min(l[i],l[j]) num_of_flow_in_cluster=0 merge(c, l[i], l[j], l) for m in c[new_cluster_ID]: num_of_flow_in_cluster+=data[m][8] for m in c[new_cluster_ID]: cox, coy, cdx, cdy = calcClusterFlow(c[new_cluster_ID],data) data[m][4], data[m][5], data[m][6], data[m][7], data[m][9] = cox, coy, cdx, cdy, num_of_flow_in_cluster 目前感覺比較有問題的應該是merge那裡,問了chatGPT但好像也不太能理解我想要的結果 再請各位幫幫忙,感激不盡QQ -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 122.117.51.82 (臺灣) ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1681797969.A.54E.html

04/18 16:17, 1年前 , 1F
假設你拿X1點做KNN,拿到第一層 x_1j 們,你要存 x_1j
04/18 16:17, 1F

04/18 16:17, 1年前 , 2F
們的座標傳下去做第二層。所以可能是哪裡有 mean 把它
04/18 16:17, 2F

04/18 16:17, 1年前 , 3F
幹掉調整一下就好了。
04/18 16:17, 3F

04/18 16:19, 1年前 , 4F
然後為什麼你用 queue 實現…怪怪的。
04/18 16:19, 4F

04/18 20:40, 1年前 , 5F
樓上大大,感謝您的回答但我看不懂....我現在就是抓不出
04/18 20:40, 5F

04/18 20:40, 1年前 , 6F
來他哪裡cluster後把座標也merge了TT
04/18 20:40, 6F

04/18 20:40, 1年前 , 7F
對了我有用ball-tree唷
04/18 20:40, 7F

04/19 08:03, 1年前 , 8F
我講的是 Brute,如果是 ball-tree 我要想一下
04/19 08:03, 8F
文章代碼(AID): #1aFZDHLE (Python)