Xujia Wei - Data Science Portfolio
Xujia Wei - Data Science Portfolio
Portfolio
Xujia Wei
This portfolio highlights my experience in data analysis, statistical modeling,
and machine learning, applied to real-world challenges in urban planning,
finance, and predictive modeling. Through projects on housing market
analysis, economic inequality, and spam classification, I have developed
strong skills in Python, SQL, data visualization, and predictive analytics to
extract meaningful insights from complex datasets.
Page 01
EXAMPLE #1
Bike Sharing Data Analysis and
Visualization
Summary
This project analyzes a bike-sharing dataset from Washington, DC, to understand user
behaviors, trends, and key factors affecting bike rentals. Through data wrangling, visualization,
and exploratory data analysis (EDA), insights were derived about rental patterns, peak usage
times, and seasonal variations.
Problem and Approach
Bike-sharing systems are widely used in urban environments, but understanding user behavior
and operational efficiency requires detailed data analysis. This project aimed to clean, process,
and visualize bike rental data to uncover trends and insights. The approach involved:
Data Cleaning: Processing raw data, handling missing values, and structuring it for
analysis.
Exploratory Data Analysis (EDA): Using statistical summaries and visualizations to identify
rental patterns.
Data Visualization: Creating informative plots to illustrate key trends, such as
daily/seasonal demand and correlations with weather conditions.
Contribution
Data Wrangling: Processed and transformed raw rental data using Pandas.
Visualization & EDA: Created multiple visualizations using Matplotlib and Seaborn to
analyze trends.
Statistical Insights: Identified key rental patterns, including peak usage times, seasonal
variations, and user preferences.
Critical Thinking: Answered open-ended analytical questions about the impact of various
factors on bike rental trends.
Results and Impact
Discovered that bike rental demand peaks during commuting hours, indicating a strong
usage pattern among working professionals.
Found a clear correlation between weather conditions and rental demand, where adverse
weather reduced usage.
Highlighted the importance of weekend vs. weekday demand differences, providing
insights for bike-sharing system optimizations.
Page 03
EXAMPLE #2
COVID-19 Data Analysis and Estimation
Models
Summary
This project examines a dataset of daily COVID-19 cases across U.S. counties, incorporating
vaccination rates and related metadata to understand the factors influencing case trends. The
analysis involved statistical modeling techniques, including bootstrap sampling, bias-variance
tradeoff analysis, and multicollinearity detection. By leveraging these methods, the project aimed
to improve predictive accuracy and assess pandemic-related trends for data-driven insights.
Problem and Approach
Understanding COVID-19 trends and predicting case numbers is critical for public health planning.
However, estimating trends from noisy real-world data presents challenges such as bias,
variance, and data dependencies.
The approach involved:
Bootstrap Sampling: Generating resampled datasets to estimate the distribution of statistics.
Bias-Variance Tradeoff: Evaluating predictive models to balance complexity and
generalization.
Multicollinearity Analysis: Identifying and mitigating redundant features in regression models.
Contribution
Data Wrangling & Cleaning: Processed COVID-19 data using Pandas and NumPy.
Statistical Modeling: Applied Scipy and Sklearn to perform bias-variance decomposition and
evaluate estimators.
Visualization & Insights: Used Matplotlib and Seaborn to create meaningful graphs explaining
pandemic trends.
Machine Learning Techniques: Explored regression models and feature selection to improve
prediction accuracy.
Results and Impact
Demonstrated how bootstrap resampling reduces variability in estimator calculations.
Highlighted the tradeoff between bias and variance, optimizing model performance.
Identified multicollinearity issues in COVID-19 predictors, leading to better feature selection.
Developed key takeaways for public health decision-making based on case and vaccination
trends.
Page 05
EXAMPLE #3
IMDB DATA ANALYSIS WITH SQL
Summary
This project utilizes SQL to analyze the Internet Movie Database (IMDb), extracting insights
into movies, actors, and ratings. The goal was to formulate and execute SQL queries to
explore trends, relationships, and anomalies in the dataset.
Contribution
SQL Query Development: Executed complex SQL queries using SQLite to extract insights
from IMDb data.
Database Management: Leveraged SQLAlchemy and pandas to manipulate relational
data.
Visualization & Reporting: Used Matplotlib, Seaborn, and Plotly to present key trends in
movies and ratings.
Trend Analysis: Investigated factors affecting movie ratings, including actor
collaborations and genre patterns.
Page 07
EXAMPLE #4
COOK COUNTY HOUSING MARKET
ANALYSIS
Summary
This project examines housing data from Cook County to analyze market trends, identify
influential property factors, and build predictive models for housing prices. The first phase (A1)
focused on exploratory data analysis (EDA), understanding pricing patterns, and ensuring
fairness in valuation. The second phase (A2) applied machine learning techniques to predict
property prices based on real estate attributes.
Problem and Approach
The real estate market involves complex pricing structures influenced by location, property
size, economic conditions, and other factors. The project addressed this by:
Cleaning and processing real estate data to detect trends.
Identifying key price determinants, such as square footage, neighborhood, and location.
Evaluating fairness in valuation across different neighborhoods.
Developing regression models to estimate home prices.
Performing feature engineering to improve model accuracy.
Evaluating model performance using statistical validation techniques.
Contribution
Data Wrangling & Preparation: Processed and structured Cook County housing datasets
using Pandas and NumPy.
Exploratory Data Analysis (EDA): Analyzed housing market trends and influential property
features.
Machine Learning Models: Built linear regression models to predict home prices with high
accuracy.
Bias & Fairness Analysis: Assessed whether property valuations were equitable across
different regions.
Results and Impact
Uncovered price trends across Cook County, highlighting key pricing drivers.
Improved housing price prediction accuracy through feature engineering and model
refinement.
Identified valuation biases, ensuring fairness in predictive pricing models.
Demonstrated the power of data science in real estate valuation and investment analysis.
Machine Learning & Regression Models Data Visualization & Statistical Analysis
Built predictive models (linear Utilized Seaborn, Matplotlib, and
regression, feature engineering) to Scikit-learn to analyze pricing patterns
estimate housing prices. and model performance.
Page 08
EXAMPLE #4
Cook County Housing Market Analysis (Continued)
Page 09
EXAMPLE #5
SPAM EMAIL CLASSIFICATION USING
MACHINE LEARNING
Summary
This project develops a binary classification model to distinguish between spam (junk,
commercial, or bulk) emails and ham (regular emails). The first phase (B1) focuses on
exploratory data analysis, feature engineering, and initial logistic regression modeling, while
the second phase (B2) builds on this foundation to optimize classification models, perform
cross-validation, and analyze model performance using real-world email datasets.
Problem and Approach
Spam detection is a crucial application in cybersecurity and email filtering, requiring robust
machine learning techniques to identify patterns in text data. The approach involved:
Extracting features from email text using NLP techniques (word frequency, n-grams,
stopword filtering).
Applying logistic regression to develop a baseline spam classifier.
Evaluating model accuracy and initial performance using confusion matrices and
precision-recall metrics.
Implementing advanced classification models (e.g., Naïve Bayes, Random Forest, SVM).
Performing hyperparameter tuning and cross-validation to improve model performance.
Generating ROC curves and AUC scores to assess classifier effectiveness.
Contribution
Feature Engineering for Text Data: Extracted relevant features from email text using TF-
IDF, bag-of-words, and tokenization techniques.
Supervised Machine Learning: Built and optimized classification models using Scikit-learn.
Model Performance Evaluation: Analyzed confusion matrices, precision-recall, and ROC-
AUC curves to assess model effectiveness.
Overfitting Prevention & Validation: Applied cross-validation and hyperparameter tuning to
ensure model generalization.
Results and Impact
Achieved high classification accuracy using optimized spam detection models.
Identified key patterns in spam emails based on word frequency and NLP-based feature
extraction.
Improved model precision-recall tradeoff to reduce false positives in email classification.
Provided a scalable framework for real-world spam detection in email filtering systems.
Page 11
EXAMPLE #6
CITY PLANNING IN SAN FRANCISCO
Summary
This project investigates the relationship between urban planning, economic disparity, and
accessibility in San Francisco through a data-driven approach. By applying Marxist theory, the
study examines how capitalist-driven urban development has shaped economic stratification,
labor force trends, and infrastructure accessibility. The analysis is conducted using historical
labor force and income inequality data, spatial visualizations, and regression models to
highlight systemic inequalities and propose equitable planning strategies.
Problem and Approach
San Francisco's urban development reflects a growing divide between affluent communities
and marginalized populations, driven by economic cycles and city planning decisions. This
project explores:
Labor Force & Income Inequality Trends: Analyzed changes in the civilian labor force over
time and its correlation with wealth disparity.
Impact of Economic Crises on Employment: Studied how the 2008 financial crisis and the
COVID-19 pandemic disrupted employment trends and exacerbated inequality.
Spatial Analysis of Unemployment Rates: Mapped unemployment distribution by zip code
from 2019 to 2022 to highlight areas most affected by economic downturns.
Accessibility in Urban Design: Examined the distribution of curb ramps to assess whether
public infrastructure investments are equitably allocated.
Contribution
Data Collection & Processing: Cleaned and analyzed datasets.
Statistical Analysis: Conducted regression modeling to quantify the relationship between
workforce participation and economic disparity.
Geospatial Visualization: Created heatmaps of unemployment rates and accessibility
distributions using Python-based mapping tools.
Urban Policy Recommendations: Proposed equitable city planning strategies that balance
economic growth with social inclusion.
Results and Impact
Confirmed strong correlation between labor force changes and income inequality in SF.
Identified unemployment disparities across zoning districts, advocating targeted
interventions.
Found curb ramp distribution uniform, challenging assumptions of wealth-based favoritism.
Highlighted economic crises' impact on spatial inequalities, urging inclusive urban planning.
Page 13