在資料視覺化中,「重疊 (overplotting)」 是一個常見的挑戰。當資料點數量龐大,或因為四捨五入、量測精度不足,導致許多觀測值完全相同時,單純的散佈圖往往無法清楚呈現訊息。Claus O. Wilke 在《Fundamentals of Data Visualization》中針對這個問題提供了幾個值得參考的方向,今天跟大家分享相關的參考心得。
最直觀的處理方式是讓點具備部分透明度,當重疊發生時,點會因為層層堆疊而顯得更深。這樣一來,我們至少可以觀察到密度的差異。
另一個輔助方法是抖動 (jittering),將點在座標軸上隨機偏移一點點,讓被完全蓋住的點浮現出來。
不過,抖動需要小心:過度偏移會扭曲資料的真實位置,反而可能誤導讀者。
ggplot(mpg, aes(displ, hwy, colour = drv)) +
geom_point(alpha = 0.4, position = position_jitter(width = 0.2, height = 0.2))
當資料量極大時,透明與抖動往往仍不足以解決問題。此時,可以將座標平面分割成小區塊:
另一個策略是估計資料的密度分布,再用等高線標示。這種方式在資料平滑變化時特別適合,能幫助讀者快速辨識集中區域與稀疏區域。
若需要比較群組(如不同性別的樣本),可以使用不同顏色的等高線。但要注意:群組過多時,圖形容易變成「毛線球」,反而難以解讀。此時,分面 (faceting) 就是更好的做法。
在處理這些資料過程中我特別有感的是:處理重疊資料沒有特別完美解方,而是必須依照資料量與分析目的來決定。
Overplotting is a frequent challenge in data visualization, particularly when datasets are large or when values are recorded with limited precision. In such cases, multiple observations share identical positions on the plot, making important details invisible. Several strategies can be applied to address this issue. For small to moderate datasets, applying partial transparency or adding jitter can help reveal hidden points by varying intensity or slightly displacing positions. For larger datasets, binning approaches such as 2D histograms or hexagonal binning (hexbin) are more effective, as they summarize density with color gradients. Alternatively, contour plots highlight regions of varying point density, making them particularly useful for smooth distributions or group comparisons. However, this technique requires careful use to avoid overly complex visualizations when groups overlap heavily. Ultimately, no single method works universally; the choice depends on dataset size and analytical purpose. The recent ggplot2 4.0.0 release further enhances these techniques by improving position adjustments and aesthetic mappings, giving users greater flexibility in tackling overplotting.
IT邦幫忙