0% found this document useful (0 votes)
17 views13 pages

Xujia Wei - Data Science Portfolio

Xujia Wei's portfolio showcases expertise in data analysis, statistical modeling, and machine learning across various projects, including housing market analysis and COVID-19 data trends. The portfolio emphasizes skills in Python, SQL, data visualization, and predictive analytics, demonstrating a strong problem-solving mindset and the ability to derive meaningful insights from complex datasets. Each project highlights contributions to data wrangling, statistical modeling, and visualization, aimed at informing business strategies and optimizing processes.

Uploaded by

xujiaweijessica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

Xujia Wei - Data Science Portfolio

Xujia Wei's portfolio showcases expertise in data analysis, statistical modeling, and machine learning across various projects, including housing market analysis and COVID-19 data trends. The portfolio emphasizes skills in Python, SQL, data visualization, and predictive analytics, demonstrating a strong problem-solving mindset and the ability to derive meaningful insights from complex datasets. Each project highlights contributions to data wrangling, statistical modeling, and visualization, aimed at informing business strategies and optimizing processes.

Uploaded by

xujiaweijessica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Science

Portfolio

Xujia Wei
This portfolio highlights my experience in data analysis, statistical modeling,
and machine learning, applied to real-world challenges in urban planning,
finance, and predictive modeling. Through projects on housing market
analysis, economic inequality, and spam classification, I have developed
strong skills in Python, SQL, data visualization, and predictive analytics to
extract meaningful insights from complex datasets.

My experience demonstrates a problem-solving mindset that combines


exploratory data analysis (EDA), business intelligence, and data-driven
decision-making to optimize processes and inform strategy. Whether through
building predictive models, analyzing trends, or presenting insights to
stakeholders, these projects showcase my ability to transform data into
meaningful business solutions, making me well-equipped for data and
business analyst roles.

XUJIA WEI PORTFOLIO ⸺ LAST UPDATED: FEB 2025

Page 01
EXAMPLE #1
Bike Sharing Data Analysis and
Visualization
Summary
This project analyzes a bike-sharing dataset from Washington, DC, to understand user
behaviors, trends, and key factors affecting bike rentals. Through data wrangling, visualization,
and exploratory data analysis (EDA), insights were derived about rental patterns, peak usage
times, and seasonal variations.
Problem and Approach
Bike-sharing systems are widely used in urban environments, but understanding user behavior
and operational efficiency requires detailed data analysis. This project aimed to clean, process,
and visualize bike rental data to uncover trends and insights. The approach involved:
Data Cleaning: Processing raw data, handling missing values, and structuring it for
analysis.
Exploratory Data Analysis (EDA): Using statistical summaries and visualizations to identify
rental patterns.
Data Visualization: Creating informative plots to illustrate key trends, such as
daily/seasonal demand and correlations with weather conditions.
Contribution
Data Wrangling: Processed and transformed raw rental data using Pandas.
Visualization & EDA: Created multiple visualizations using Matplotlib and Seaborn to
analyze trends.
Statistical Insights: Identified key rental patterns, including peak usage times, seasonal
variations, and user preferences.
Critical Thinking: Answered open-ended analytical questions about the impact of various
factors on bike rental trends.
Results and Impact
Discovered that bike rental demand peaks during commuting hours, indicating a strong
usage pattern among working professionals.
Found a clear correlation between weather conditions and rental demand, where adverse
weather reduced usage.
Highlighted the importance of weekend vs. weekday demand differences, providing
insights for bike-sharing system optimizations.

Skills Gained and Tools Used

Data Analysis & EDA Data Visualization


Applied Pandas and NumPy to clean, Utilized Matplotlib and Seaborn to
process, and analyze large-scale bike- create clear, informative charts
sharing data for meaningful insights. highlighting trends in bike rental
behaviors.

Statistical Thinking Python & Data Wrangling


Interpreted data trends, identified Used Python libraries (Pandas,
correlations between variables (e.g., Matplotlib, Seaborn) to manipulate
weather conditions, time of day), and datasets, handle missing values, and
extracted business insights. structure data for analysis.
Page 02
EXAMPLE #1
Bike Sharing Data Analysis and Visualization (Continued)

Page 03
EXAMPLE #2
COVID-19 Data Analysis and Estimation
Models
Summary
This project examines a dataset of daily COVID-19 cases across U.S. counties, incorporating
vaccination rates and related metadata to understand the factors influencing case trends. The
analysis involved statistical modeling techniques, including bootstrap sampling, bias-variance
tradeoff analysis, and multicollinearity detection. By leveraging these methods, the project aimed
to improve predictive accuracy and assess pandemic-related trends for data-driven insights.
Problem and Approach
Understanding COVID-19 trends and predicting case numbers is critical for public health planning.
However, estimating trends from noisy real-world data presents challenges such as bias,
variance, and data dependencies.
The approach involved:
Bootstrap Sampling: Generating resampled datasets to estimate the distribution of statistics.
Bias-Variance Tradeoff: Evaluating predictive models to balance complexity and
generalization.
Multicollinearity Analysis: Identifying and mitigating redundant features in regression models.
Contribution
Data Wrangling & Cleaning: Processed COVID-19 data using Pandas and NumPy.
Statistical Modeling: Applied Scipy and Sklearn to perform bias-variance decomposition and
evaluate estimators.
Visualization & Insights: Used Matplotlib and Seaborn to create meaningful graphs explaining
pandemic trends.
Machine Learning Techniques: Explored regression models and feature selection to improve
prediction accuracy.
Results and Impact
Demonstrated how bootstrap resampling reduces variability in estimator calculations.
Highlighted the tradeoff between bias and variance, optimizing model performance.
Identified multicollinearity issues in COVID-19 predictors, leading to better feature selection.
Developed key takeaways for public health decision-making based on case and vaccination
trends.

Skills Gained and Tools Used

Statistical Modeling Bootstrap Sampling


Applied bias-variance tradeoff, Implemented resampling techniques
regression, and probability estimators to assess uncertainty in statistical
for COVID-19 case analysis. estimates.

Python for Data Science Data Visualization


Used Pandas, NumPy, Scipy, Sklearn Created Seaborn and Matplotlib plots
for data manipulation and statistical to convey insights on case trends and
computation. model performance.
Page 04
EXAMPLE #2
COVID-19 Data Analysis and Estimation Models (Continued)

Page 05
EXAMPLE #3
IMDB DATA ANALYSIS WITH SQL
Summary
This project utilizes SQL to analyze the Internet Movie Database (IMDb), extracting insights
into movies, actors, and ratings. The goal was to formulate and execute SQL queries to
explore trends, relationships, and anomalies in the dataset.

Problem and Approach


The IMDb database contains vast amounts of structured data on movies, ratings, and actors,
but extracting meaningful insights requires querying relational databases efficiently. The
approach involved:
Database Querying: Writing SQL queries to retrieve relevant data.
Exploratory Data Analysis: Analyzing patterns in movie ratings, genres, and actor
collaborations.
Data Visualization: Using Python tools to illustrate key findings.

Contribution
SQL Query Development: Executed complex SQL queries using SQLite to extract insights
from IMDb data.
Database Management: Leveraged SQLAlchemy and pandas to manipulate relational
data.
Visualization & Reporting: Used Matplotlib, Seaborn, and Plotly to present key trends in
movies and ratings.
Trend Analysis: Investigated factors affecting movie ratings, including actor
collaborations and genre patterns.

Results and Impact


Identified highly-rated movie genres and their trends over time.
Analyzed actor collaborations and their impact on movie ratings.
Highlighted discrepancies in rating distributions, providing insights into IMDb rating
biases.
Demonstrated how SQL can efficiently extract valuable insights from structured
databases.

Skills Gained and Tools Used

SQL & Database Management Data Visualization


Developed and optimized SQL queries Used Matplotlib, Seaborn, and Plotly to
to retrieve and analyze structured IMDb create insightful visualizations of IMDb
data. trends.

Exploratory Data Analysis (EDA) Python for SQL Integration


Performed statistical analysis of IMDb Utilized SQLAlchemy, Pandas, and
movie ratings, genres, and actor Jupyter Notebooks to interact with and
collaborations. manipulate database records.
Page 06
EXAMPLE #3
IMDb Data Analysis with SQL (Continued)

Page 07
EXAMPLE #4
COOK COUNTY HOUSING MARKET
ANALYSIS
Summary
This project examines housing data from Cook County to analyze market trends, identify
influential property factors, and build predictive models for housing prices. The first phase (A1)
focused on exploratory data analysis (EDA), understanding pricing patterns, and ensuring
fairness in valuation. The second phase (A2) applied machine learning techniques to predict
property prices based on real estate attributes.
Problem and Approach
The real estate market involves complex pricing structures influenced by location, property
size, economic conditions, and other factors. The project addressed this by:
Cleaning and processing real estate data to detect trends.
Identifying key price determinants, such as square footage, neighborhood, and location.
Evaluating fairness in valuation across different neighborhoods.
Developing regression models to estimate home prices.
Performing feature engineering to improve model accuracy.
Evaluating model performance using statistical validation techniques.
Contribution
Data Wrangling & Preparation: Processed and structured Cook County housing datasets
using Pandas and NumPy.
Exploratory Data Analysis (EDA): Analyzed housing market trends and influential property
features.
Machine Learning Models: Built linear regression models to predict home prices with high
accuracy.
Bias & Fairness Analysis: Assessed whether property valuations were equitable across
different regions.
Results and Impact
Uncovered price trends across Cook County, highlighting key pricing drivers.
Improved housing price prediction accuracy through feature engineering and model
refinement.
Identified valuation biases, ensuring fairness in predictive pricing models.
Demonstrated the power of data science in real estate valuation and investment analysis.

Skills Gained and Tools Used

Real Estate Market Analysis Data Wrangling & Processing


Analyzed housing prices, valuation Applied Pandas, NumPy, and SQL to
fairness, and market trends using data- clean and structure real estate data for
driven methods. analysis.

Machine Learning & Regression Models Data Visualization & Statistical Analysis
Built predictive models (linear Utilized Seaborn, Matplotlib, and
regression, feature engineering) to Scikit-learn to analyze pricing patterns
estimate housing prices. and model performance.
Page 08
EXAMPLE #4
Cook County Housing Market Analysis (Continued)

Page 09
EXAMPLE #5
SPAM EMAIL CLASSIFICATION USING
MACHINE LEARNING
Summary
This project develops a binary classification model to distinguish between spam (junk,
commercial, or bulk) emails and ham (regular emails). The first phase (B1) focuses on
exploratory data analysis, feature engineering, and initial logistic regression modeling, while
the second phase (B2) builds on this foundation to optimize classification models, perform
cross-validation, and analyze model performance using real-world email datasets.
Problem and Approach
Spam detection is a crucial application in cybersecurity and email filtering, requiring robust
machine learning techniques to identify patterns in text data. The approach involved:
Extracting features from email text using NLP techniques (word frequency, n-grams,
stopword filtering).
Applying logistic regression to develop a baseline spam classifier.
Evaluating model accuracy and initial performance using confusion matrices and
precision-recall metrics.
Implementing advanced classification models (e.g., Naïve Bayes, Random Forest, SVM).
Performing hyperparameter tuning and cross-validation to improve model performance.
Generating ROC curves and AUC scores to assess classifier effectiveness.
Contribution
Feature Engineering for Text Data: Extracted relevant features from email text using TF-
IDF, bag-of-words, and tokenization techniques.
Supervised Machine Learning: Built and optimized classification models using Scikit-learn.
Model Performance Evaluation: Analyzed confusion matrices, precision-recall, and ROC-
AUC curves to assess model effectiveness.
Overfitting Prevention & Validation: Applied cross-validation and hyperparameter tuning to
ensure model generalization.
Results and Impact
Achieved high classification accuracy using optimized spam detection models.
Identified key patterns in spam emails based on word frequency and NLP-based feature
extraction.
Improved model precision-recall tradeoff to reduce false positives in email classification.
Provided a scalable framework for real-world spam detection in email filtering systems.

Skills Gained and Tools Used

Natural Language Processing (NLP) Machine Learning & Classification Models


Extracted and processed text-based Built and optimized spam detection
features for classification using TF-IDF, models using Logistic Regression,
bag-of-words, and n-grams. Naïve Bayes, and Random Forest.

Model Evaluation & Performance Metrics Python for Data Science


Assessed classification effectiveness Utilized Scikit-learn, NumPy, Pandas,
using ROC-AUC, confusion matrices, and Seaborn for feature engineering,
and precision-recall analysis. modeling, and visualization.
Page 10
EXAMPLE #5
Spam Email Classification Using Machine Learning (Continued)

Page 11
EXAMPLE #6
CITY PLANNING IN SAN FRANCISCO
Summary
This project investigates the relationship between urban planning, economic disparity, and
accessibility in San Francisco through a data-driven approach. By applying Marxist theory, the
study examines how capitalist-driven urban development has shaped economic stratification,
labor force trends, and infrastructure accessibility. The analysis is conducted using historical
labor force and income inequality data, spatial visualizations, and regression models to
highlight systemic inequalities and propose equitable planning strategies.
Problem and Approach
San Francisco's urban development reflects a growing divide between affluent communities
and marginalized populations, driven by economic cycles and city planning decisions. This
project explores:
Labor Force & Income Inequality Trends: Analyzed changes in the civilian labor force over
time and its correlation with wealth disparity.
Impact of Economic Crises on Employment: Studied how the 2008 financial crisis and the
COVID-19 pandemic disrupted employment trends and exacerbated inequality.
Spatial Analysis of Unemployment Rates: Mapped unemployment distribution by zip code
from 2019 to 2022 to highlight areas most affected by economic downturns.
Accessibility in Urban Design: Examined the distribution of curb ramps to assess whether
public infrastructure investments are equitably allocated.
Contribution
Data Collection & Processing: Cleaned and analyzed datasets.
Statistical Analysis: Conducted regression modeling to quantify the relationship between
workforce participation and economic disparity.
Geospatial Visualization: Created heatmaps of unemployment rates and accessibility
distributions using Python-based mapping tools.
Urban Policy Recommendations: Proposed equitable city planning strategies that balance
economic growth with social inclusion.
Results and Impact
Confirmed strong correlation between labor force changes and income inequality in SF.
Identified unemployment disparities across zoning districts, advocating targeted
interventions.
Found curb ramp distribution uniform, challenging assumptions of wealth-based favoritism.
Highlighted economic crises' impact on spatial inequalities, urging inclusive urban planning.

Skills Gained and Tools Used

Data Analysis & Visualization Geospatial Mapping


Utilized Python (Pandas, Matplotlib, Created heatmaps and spatial
Seabornl, Plotly, Geopandas) to visualizations to examine
analyze historical labor force and unemployment distribution and curb
income inequality data. ramp accessibility.

Regression & Statistical Modeling Urban Policy & Planning


Developed predictive models to Applied Marxist theory and economic
measure the relationship between analysis to critique city planning
workforce trends and economic strategies and propose equitable
inequality. solutions.
Page 12
EXAMPLE #6
City Planning in San Francisco (Continued)

Page 13

You might also like