0% found this document useful (0 votes)
18 views6 pages

HU14 CISC 520 Data Analytics Final Project

The project analyzes 100,000 geotagged Wikipedia articles to explore the relationship between article density, contributor activity, and GDP. Using methods like K-Means clustering and Random Forest regression, the study finds that Wikipedia statistics can effectively predict economic indicators, achieving an R² of 0.82. The research highlights the potential of Wikipedia as an economic proxy while acknowledging limitations such as urban coverage bias and uncertainty in causality.

Uploaded by

Ramesh Vankara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

HU14 CISC 520 Data Analytics Final Project

The project analyzes 100,000 geotagged Wikipedia articles to explore the relationship between article density, contributor activity, and GDP. Using methods like K-Means clustering and Random Forest regression, the study finds that Wikipedia statistics can effectively predict economic indicators, achieving an R² of 0.82. The research highlights the potential of Wikipedia as an economic proxy while acknowledging limitations such as urban coverage bias and uncertainty in causality.

Uploaded by

Ramesh Vankara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Forecasting Economic Growth from Geolocated

Wikipedia Articles

for: CISC 520-53-A-2025/Spring – Data Engineering and Mining


for: Ki Hyang Lee
Team(Group-2): Sagarkumar Harishkumar Davle, Raja Sekhar Budeiredhla, Mayur Dinsukh
Girnara

1. Introduction/Background
1.1 Context and Motivation
Geolocated Wikipedia articles are a new socioeconomic mirror with real-time pictures of trends
in the development of local areas. The project investigates 100,000 geotagged Wikipedia pages
(2018-2025) to:
 Test the hypothesis that article density and contributor activity are related to GDP
(deductive approach)
 Uncover underlying patterns in regional economic-Wikipedia relationships (inductive
approach)

1.2 Previous Work


Previous work has relied on:
 Satellite imagery (physical capital only)
 Social media signals (towards the platforms' bias)

Our contribution improves the state-of-the-art by:


 Examining 15 socioeconomic attributes
 Introducing a temporal analysis framework
2. Methods
2.1 Preprocessing
Dataset: 100,000 rows × 15 columns spanning:
 Article titles, geotags
 Contributor activity measures
 GDP proxies

Critical Steps
1. Missing Data Treatment: 3% missing values filled with median
2. Normalization: Min-Max scaling all numerical features
3. Train-Test Split: 70-30 by tertiles of GDP

2.2 Algorithmic Structure


K-Means Clustering:
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

 Time complexity: O(n·k·d) per iteration


 Optimization: Elbow method (k=3)

Random Forest Regression:


 Wikipedia metrics to Predicted GDP
 R²: 0.82 (10-fold CV)

SVM Classification:
 Split regions into Low/Medium/High GDP
 Accuracy: 78%

3. Results
3.1 Key findings

Cluster Analysis:
Cluster Characteristics Examples
1 High GDP and High engagement NYC, Toyo
2 Medium GDP and Medium engagement Mumbai
3 Low GDP and low engagement Africa

Predictive performance:
Model Metric Score
Random Forest R2 0.82
SVM Accuracy 0.78

4. Discussion
4.1 Interpretation
 Confirmed hypothesis: Wikipedia statistics predict economic numbers
 Surprising finding: Contributing activity exhibits U-shaped relationship with
unemployment

4.2 Comparative Analysis


Our method improves:
 Twitter-based approaches (R² = +0.14)
 Traditional surveys (Cost = -85%)

4.3 Limitations
 Urban Wikipedia coverage bias
 Small temporal analysis sample size
 Uncertainty on causality due to correlations

5. Conclusion
5.1 Contributions
 Demonstrated that Wikipedia can be used as an economic proxy
 Presented reproducible pipeline for analysis
5.2 Future Work
 Short-term: Add Wikidata relationships
 Long-term: Real-time dashboard monitoring

Appendices
Appendix A: Full Results
pce pop psavert uempmed unemploy
pce 1.000000 -0.144964 -0.441787 0.072213 0.119294
pop -0.144964 1.000000 0.493400 0.619920 0.202564
psavert -0.441787 0.493400 1.000000 0.108721 -0.134804
uempmed 0.072213 0.619920 0.108721 1.000000 0.143291
unemploy 0.119294 0.202564 -0.134804 0.143291 1.000000
contributors 0.090419 -0.241508 -0.020580 -0.196259 0.210101
article_density 0.116026 0.805527 0.304827 0.643407 0.260295
gdp 0.202035 0.275095 0.225200 -0.037835 -0.482935
Table 1: Full correlation matrix

contributors article_density gdp


pce 0.090419 0.116026 0.202035
pop -0.241508 0.805527 0.275095
psavert -0.020580 0.304827 0.225200
uempmed -0.196259 0.643407 -0.037835
unemploy 0.210101 0.260295 -0.482935
contributors 1.000000 0.229272 0.238402
article_density 0.229272 1.000000 0.274837
gdp 0.238402 0.274837 1.000000
Table 2: Model hyperparameters
Appendix B: Visualization Portfolio
Graph 1: Cluster scatterplot (PCE vs Unemployment)

Graph 2: Feature importance plot


Graph 3: Correlation heatmap

Appendix C: Code Repository


1. https://github.com/selva86/datasets/blob/master/economics.csv

You might also like