HU14 CISC 520 Data Analytics Final Project
HU14 CISC 520 Data Analytics Final Project
Wikipedia Articles
1. Introduction/Background
1.1 Context and Motivation
Geolocated Wikipedia articles are a new socioeconomic mirror with real-time pictures of trends
in the development of local areas. The project investigates 100,000 geotagged Wikipedia pages
(2018-2025) to:
Test the hypothesis that article density and contributor activity are related to GDP
(deductive approach)
Uncover underlying patterns in regional economic-Wikipedia relationships (inductive
approach)
Critical Steps
1. Missing Data Treatment: 3% missing values filled with median
2. Normalization: Min-Max scaling all numerical features
3. Train-Test Split: 70-30 by tertiles of GDP
SVM Classification:
Split regions into Low/Medium/High GDP
Accuracy: 78%
3. Results
3.1 Key findings
Cluster Analysis:
Cluster Characteristics Examples
1 High GDP and High engagement NYC, Toyo
2 Medium GDP and Medium engagement Mumbai
3 Low GDP and low engagement Africa
Predictive performance:
Model Metric Score
Random Forest R2 0.82
SVM Accuracy 0.78
4. Discussion
4.1 Interpretation
Confirmed hypothesis: Wikipedia statistics predict economic numbers
Surprising finding: Contributing activity exhibits U-shaped relationship with
unemployment
4.3 Limitations
Urban Wikipedia coverage bias
Small temporal analysis sample size
Uncertainty on causality due to correlations
5. Conclusion
5.1 Contributions
Demonstrated that Wikipedia can be used as an economic proxy
Presented reproducible pipeline for analysis
5.2 Future Work
Short-term: Add Wikidata relationships
Long-term: Real-time dashboard monitoring
Appendices
Appendix A: Full Results
pce pop psavert uempmed unemploy
pce 1.000000 -0.144964 -0.441787 0.072213 0.119294
pop -0.144964 1.000000 0.493400 0.619920 0.202564
psavert -0.441787 0.493400 1.000000 0.108721 -0.134804
uempmed 0.072213 0.619920 0.108721 1.000000 0.143291
unemploy 0.119294 0.202564 -0.134804 0.143291 1.000000
contributors 0.090419 -0.241508 -0.020580 -0.196259 0.210101
article_density 0.116026 0.805527 0.304827 0.643407 0.260295
gdp 0.202035 0.275095 0.225200 -0.037835 -0.482935
Table 1: Full correlation matrix