【學術亮點】特徵選擇方法對以機器學習與堆疊集成法估測 SO 濃度變化之影響
【學術亮點】Effects of feature selection methods in estimating SO2 concentration variations using machine learning and stacking ensemble approach
Ecological Agriculture: Assessment of Forest Carbon Sink and Ecological Economy under Climate Change       Department of Geomatics, National Cheng Kung University / Wu, Chih-Da / Professor
生態農業:氣候變遷下森林碳匯與生態經濟評估【國立成功大學測量及空間資訊學系/吳治達教授】
論文篇名 英文:Effects of feature selection methods in estimating SO2 concentration variations using machine learning and stacking ensemble approach
中文:特徵選擇方法對以機器學習與堆疊集成法估測 SO 濃度變化之影響
期刊名稱 Environmental Technology & Innovation
發表年份,卷數,起迄頁數 2025, 37, no.103996
作者 Wong, Pei-Yi; Zeng, Yu-Ting; Su, Huey-Jen; Lung, Shih-Chun Candice; Chen, Yu-Cheng; Chen, Pau-Chung; Hsiao, Ta-Chih; Adamkiewicz, Gary; Wu, Chih-Da(吳治達)*
DOI 10.1016/j.eti.2024.103996
中文摘要 本研究針對 SO 濃度變化的空間預測,系統性比較統計式與機器學習式的特徵選擇方法對模型效能的影響。研究蒐集 1994–2018 年間的 SO 日觀測資料,並整合土地利用/覆蓋、道路、地標、氣象因子及衛星影像等共 428 個地理因子。透過 SelectKBest、逐步迴歸、Elastic Net 以及隨機森林等特徵選擇方法,分別與梯度提升、CatBoost、XGBoost 及堆疊集成模型進行結合,建立多種預測模式,同時利用 SHAP 方法解析各特徵對模型貢獻度。結果顯示,堆疊集成模型表現最佳,而以隨機森林進行特徵選擇時,能在訓練模型中達到最高準確度(R²=0.80),優於逐步迴歸(R²=0.75)、SelectKBest(R²=0.75)與 Elastic Net(R²=0.72)。多項驗證測試亦確認了模型的穩健性。此成果不僅證實隨機森林特徵選擇在空氣污染預測的優勢,也凸顯了多源地理因子與機器學習結合的潛力,為都市空氣污染管理與決策提供了更具解釋力與精準度的工具。
英文摘要 Statistical-based feature selection methods have been used for dimension reduction, but only a few studies have explored the impact of selected features on machine learning models. This study aims to investigate the effects of statistical and machine learning-based feature selection methods on spatial prediction models for estimating variations in SO2 concentrations. We collected daily SO2 observations from 1994 to 2018 along with predictor variables such as land-use/land cover allocations, roads, landmarks, meteorological factors, and satellite images, resulting in a total of 428 geographic predictors. Important features were identified using statistical-based feature selection methods including SelectKBest, stepwise feature selection, elastic net, and machine learning-based methods such as random forest. The selected features from the four feature selection methods were fitted to machine learning algorithms including gradient boosting, CatBoost, XGBoost, and stacking ensemble to establish prediction models for estimating SO2 concentrations. SHapley Additive exPlanations (SHAP) was applied to explain the contribution of each selected feature to the model's prediction capability. The results showed that stacking ensemble model outperformed the three single machine learning algorithms. Among the four feature selection methods, the random forest method yielded the highest prediction accuracy (R2=0.80) in the training model, followed by stepwise selection (R2=0.75), SelectKBest (R2=0.75), and elastic net (R2=0.72) in the stacking ensemble model. These results were robust after several validation tests. Our findings suggested that the random forest feature selection method was more suitable for developing machine learning models for air pollution estimation. The identified features also provide important information for urban air pollution management.
發表成果與本中心研究主題相關性 本研究成果對永續農業具有重要助益。二氧化硫(SO)是影響作物生長與農田生態的重要污染因子,長期暴露不僅會降低農作物產量與品質,亦會改變土壤與植物的微生物環境。本研究透過隨機森林等特徵選擇方法結合堆疊集成模型,成功提升 SO 濃度的空間預測精度,並能辨識出影響空氣污染的重要地理因子。這使農業決策者能更精準掌握污染來源與高風險區域,進而規劃適當的農作物種植區位、調整農業管理策略,並推動低污染的農業環境建設。同時,研究所建立的模型與方法也可延伸應用於其他污染物或農業相關環境因子的監測,為減少農業風險、提升糧食安全與實現農業永續發展提供科學依據。