Jeeva Final
Jeeva Final
CLUSTERING OF EMPLOYEE
SALARIES BASED ON DEMOGRAPHIC
DATA
AN INTERNSHIP REPORT
Submitted by
JEEVA P(620822243041)
degree Of
BACHELOR OF TECHNOLOGY
in
DECEMBER 2024
ANNA UNIVERSITY: 600
025 BONAFIDE
CERTIFICATE
SIGNATURE SIGNATURE
INTERNAL EXAMINER
ACKNOWLEDG
EMENT
We are grateful to the almighty for the grace and sustained blessings
throughout the project and have given immense strength in executing the work
successfully. We would like to express our deep sense of heartiest thanks to our
beloved Chairman Dr.T.ARANGANNAL and Chairperson Mrs.P.MALALEENA,
and Vice Chairperson Ms.MADHUVANTHINIE ARANGANNAL, Gnyanamani
Educational Institutions, Namakkal, for giving an opportunity to do and complete
this project.
We would like to express deep sense of gratitude and profound thanks to our
Principal, Dr. T.K.KANNAN and our Academic Director, Dr. B.SANJAY
GANDHI, Gnanamani College of Technology, Namakkal, for creating a beautiful
atmosphere which in spired us to take over this summer internship.
[JEEVA p]
Gnanamani College of Technology
(An Autonomous Institution)
Accredited by NBA & NAAC with “A” grade
A.K.Samuthiram, Pachal (PO), Namakkal – 637
018
INSTITUTE VISION
Emerging as a technical institution of high standard and excellence to produce quality
Engineers, Researchers, Administrators and Entrepreneurs with ethical and moral values to
contribute the sustainable development of the society.
INSTITUTE MISSION
We facilitate our students
● To have in-depth domain knowledge with analytical and practical skills in cutting edge
technologies by imparting quality technical education.
● To be Industry ready and multi-skilled personalities to transfer technology to industries and rural
areas by creating interests among students in Research and Development and Entrepreneurship.
DEPARTMENT VISION
To be a Centre of Artificial Intelligence and Data Science by imparting quality education,
promoting research and innovation with global relevance.
DEPARTMENT MISSION
To impart holistic education with niche technologies for the enrichment of knowledge and
skills through updated curriculum and inspired learning.
To empower valued based AI education to the students for developing intelligent systems and
innovative products to address the societal problems with ethical value.
To work in close liaison with industry to achieve socio-economic development.
I. PROGRAM EDUCATIONAL OBJECTIVES (PEOs)
In today's competitive job market, understanding the factors that influence employee salaries is
crucial for organizations aiming to attract and retain talent. This project focuses on the predictive
analysis and clustering of employee salaries using demographic data, including age, gender,
education, job title, and years of experience. By leveraging advanced statistical techniques and
machine learning algorithms, we aim to uncover patterns and relationships within the data that can
inform salary structuring and workforce planning.
The project begins with comprehensive data collection and preprocessing, followed by exploratory
data analysis (EDA) to identify trends and anomalies. We employ regression models, such as
Linear Regression and Random Forest, to predict employee salaries based on demographic
features, evaluating model performance through metrics like Mean Squared Error (MSE) and R²
score.
This study focuses on the predictive analysis and clustering of employee salaries based on
demographic data, aiming to uncover patterns and trends that influence wage distribution. By
analyzing variables such as age, education, experience, and job classification, we employ machine
learning techniques to develop predictive models that estimate salary levels and identify clusters of
employees with similar salary characteristics. Utilizing historical salary data alongside
demographic information, the research seeks to enhance understanding of salary determinants and
address wage inequality within organizations. The expected outcomes include improved
compensation strategies, identification of at-risk employee groups for turnover, and informed
decision-making for HR policies. Ultimately, this analysis provides valuable insights that can help
organizations create equitable salary structures and enhance talent management initiatives.
This study focuses on the predictive analysis and clustering of employee salaries based on
demographic data, aiming to uncover patterns and trends that influence wage distribution. By
analyzing variables such as age, education, experience, and job classification, we employ machine
learning techniques to develop predictive models that estimate salary levels and identify clusters of
employees with similar salary characteristics. Utilizing historical salary data alongside
demographic information, the research seeks to enhance understanding of salary determinants and
address wage inequality within organizations.
The research utilizes a comprehensive dataset that combines historical salary information with
demographic attributes, allowing for a nuanced understanding of the factors that drive salary
variations. Expected outcomes include the identification of salary determinants, insights into wage
inequality, and the recognition of employee groups at risk of turnover due to compensation issues.
Ultimately, this analysis seeks to empower organizations with data-driven insights that inform
equitable compensation strategies, enhance employee retention efforts, and support effective talent
management initiatives, fostering a more inclusive and fair workplace environment.
TABLE OF CONTENTS
CONCLUSION 24
CHAPTER-1
INTRODUCTION
An internship is a professional learning experience that offers meaningful, practical work related to
a student’s field of study or career interest. An internship gives a student the opportunity for career
exploration and development, and to learn new skills. It offers the employer the opportunity to
bring new ideas and energy into the workplace, develop talent and potentially build a pipeline for
future full-time employees. It offers the employer the opportunity to bring new ideas and energy. It
is an official program offered by organisations to help train and provide work experience to
students and recent graduates. Although the idea of working as an intern has been around for a
while, it has undergone significant change. The early internships were performed by labourers who
took on young people and trained them in their craft or profession. The trainee would consent to
work for the labourer for a set period of time in return for being taught a skill. Even then, the goal
of an internship—or better still, an apprenticeship—was to acquire new skills in order to be able to
find employment in the future.
ADDRESS
Fantasy Solution
Fantasy Solution,
No16, SamnathPlazza,
Sweets, Melapudur,
Landline: 0431-4971630
Website:
www.fantasysolution.in
Company profile:
Fantasy solution as a leading IT solution and service provider, provides innovative
information technology- enabled solutions and services to meet the demand arising from social
transformation, shaping new life styles for individuals and creating values for the society.
Focusing on software technology, Fantasy solution provides industry solutions and product
engineering solutions, related software products & platforms, and services, through seamless
integration of software and services, software and manufacturing, as well as technology and
industrial management capacity.
Fantasy solution helps industry customers establish best practices in business development
and management. The fantasy solution serves include real time projects, web designing, web
hosting, software development and training etc, in many of which, has a leading market share.
Notably, Fantasy Solution has participated in the formulation of many national IT standards
and specifications.
Fantasy solution has the world’s leading product engineering capabilities, ranging from
consultation, design, R&D, and integration to testing of embedded software, in the fields of
automotive electronics, smart devices, digital home products, and IT products. The software
provided by fantasy solution runs in a number of globally renowned brands.
Our services:
In this ever-changing environment, keeping a competitive edge means being able to
anticipate and respond quickly to changing business conditions. Fantasy solution is a global
software development company providing IT solutions to enterprises worldwide. Combining
proven expertise in technology, and an understanding of emerging business Electronic Health
Records, CMS Software’s, Payment Gateway solutions, Timeand attendance tracking software’s,
Debt collection software’s, Appointment Reminder Solutions, Medical Transcription Services etc.
We study, design, develop, enhance, customize, implement, maintain and support various aspects
of information technology.
development in .NET, PHP and Java, Web designing & development, Mobile application
development in Android platforms, MATLAB for image processing, Network Simulator (NS2),
Data mining tools, Big data development using R tool and Python. We aim to carve a position in
the forefront, and it is our continuing goal to gain the trust of our clients. Our Motto is to serve the
purpose of our clients with perfection.
Mission
Providing high quality software development services, professional consulting and development
outsourcing that would improve our customers’ operations;
Gain experience:
Job listings often state that they prefer candidates with educational and job experience. If
you're new to the workforce or attending school, you may consider looking for an internship to
gain the experience required for most entry-level positions.
An internship can give you an authentic experience in a job role by providing you with an
introductory experience to a career path, its duties and daily operations. If you enjoy your
internship, this might indicate that your career is on the right path.
Strengthen a resume:
Internships can give you workplace experience before you actually enter the workforce.
They also may assist you with developing additional skills to list on your resume, which can
emphasize your value as a candidate.
Some internship opportunities may offer you college credit for your time as an intern. An
internship that offers you both college credit and experience can be ideal for those who are looking
to graduate with work experience.
1: Introduction to Data Science and Python Key Learnings:
Basics of Data Science.
Introduction to Python programming for data analysis.
Installed necessary libraries like Pandas, NumPy, and Matplotlib.
Data science is an interdisciplinary field that combines statistics, mathematics, programming, and
domain knowledge to extract insights from data.
Python is one of the most popular programming languages for data science due to its
simplicity, readability, and extensive libraries.
Data collection can involve various sources, including databases, APIs, web scraping, and
CSV files.
Data cleaning and preprocessing are crucial steps, involving handling missing values, removing
duplicates, and transforming data types.
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their
main characteristics, often using visual methods.
Understanding basic statistics, including measures of central tendency (mean, median, mode)
and variability (variance, standard deviation), is essential for data analysis.
Data visualization is key to communicating insights effectively, helping to identify
patterns, trends, and outliers in the data.
df = pd.DataFrame(data)
print(df) Output: markdown
Copy code
Name Age Salary
0 Alice 25 50000 1 Bob 30 60000 2 Charlie 35 70000
3 David 40 80000
4
1.1 : Descriptive Statistics and Data Types Key Learnings:
python
Copy code
# Descriptive Statistics print(df.describe())
Output:
shell Copy
code
Age Salary
count 4.000000 4.000000 mean 32.500000 67500.000000 std 6.454972
12990.381056 min 25.000000 50000.000000 25% 27.500000 55000.000000
50% 32.500000 65000.000000
75% 37.500000 75000.000000 max 40.000000 80000.000000
• Identifying missing values can be done using visualizations and summary statistics to detect
patterns in the dataset.
• Imputation involves filling in missing values using statistical methods, such as mean, median, or
mode imputation, as well as more advanced techniques like k-nearest neighbors or regression
imputation.
• Flagging can also be useful, where a new binary variable is created to indicate whether a value
was missing, allowing for further analysis of the impact of missing data on the results."
Data preprocessing is a crucial step in the data analysis pipeline that prepares raw data for
modeling. It involves cleaning the data, handling missing values, and transforming features to
improve model performance.
Understanding the importance of data normalization and standardization is essential, as these
techniques help to bring different features onto a similar scale, which can enhance the
convergence of optimization algorithms.
Feature scaling methods include min-max scaling, which rescales the data to a fixed range,
typically [0, 1], and z-score standardization, which centers the data around the mean with a
unit variance.
Choosing the right scaling method depends on the specific characteristics of the data and the
algorithms being used.
For instance, algorithms like k-nearest neighbors and support vector machines are sensitive to the
scale of the data, while tree-based algorithms are generally not affected by feature scaling.
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)
Output:
r
Copy code
Name Age Salary
0 Alice -1.095446 -1.0954461
2 CharlieBob -0.547723 -0.547723
NaN NaN
3 David 1.643169 1.643169
Data exploration is a critical phase in the data analysis process that involves examining datasets
to uncover patterns, trends, and relationships.
It helps in understanding the structure and characteristics of the data, which is essential for making
informed decisions about subsequent analysis and modeling.
Key learnings from data exploration include the importance of summarizing data through
descriptive statistics, such as mean, median, mode, and standard deviation, to gain insights into the
central tendency and variability of the data.
Visualizations play a vital role in data exploration, as they can reveal underlying patterns that
may not be immediately apparent in raw data. "
Techniques such as histograms, box plots, scatter plots, and heatmaps can help
identify distributions, correlations, and potential outliers.
Understanding the distribution of variables is crucial for selecting appropriate statistical methods
and algorithms for analysis.
Identifying and handling missing values is another important aspect of data exploration. It is
essential to assess the extent of missing data and decide on strategies for imputation or removal, as
this can significantly impact the results of the analysis.
Additionally, exploring relationships between variables through correlation analysis can
provide insights into how features interact with one another, guiding feature selection for
modeling.
"Image:
A scatter plot to visualize the relationship between Age and Salary.
python
Copy code
# Scatter plot to analyze the relationship between Age and Salary sns.scatterplot(x='Age',
y='Salary', data=df)
• Supervised Learning: The model is trained on labeled data, meaning that both the input data and
the corresponding output labels are provided. The algorithm learns the relationship between the
inputs and outputs to make predictions on unseen data. Examples include linear regression, logistic
find hidden patterns or relationships in the data. Common examples are clustering and
dimensionality reduction techniques like k-means clustering and PCA (Principal Component
Analysis).
Machine learning (ML) is a branch of artificial intelligence that enables computers to learn
from data and make predictions or decisions without being explicitly programmed.
It encompasses various techniques and approaches that allow systems to improve
their performance over time as they are exposed to more data.
There are three primary types of machine learning: supervised learning, unsupervised learning,
and reinforcement learning. Supervised learning involves training models on labeled datasets,
where the input-output pairs are known. This approach is commonly used for tasks such as
classification and regression. In contrast, unsupervised learning deals with unlabeled data, aiming
to identify patterns or groupings within the data, such as clustering or dimensionality reductiont
encompasses various techniques and approaches that allow systems to improve their performance
over time as they are exposed to more data.
There are three primary types of machine learning: supervised learning, unsupervised learning,
and reinforcement learning.
Supervised learning involves training models on labelled datasets, where the input-output pairs are
known.
This approach is commonly used for tasks such as classification and regression. In contrast,
unsupervised learning deals with unlabelled data, aiming to identify patterns or groupings within
the data, such as clustering or dimensionality reduction. Understanding linear regression for
predictive modelling.
# Simple linear regression example to predict Salary based on Age X = df[['Age']] # Feature y =
df['Salary'] # Target
Evaluating models using metrics like accuracy, mean squared error, and R- squared.
Use performance metrics such as accuracy, precision, recall, F1 score, and ROC-AUC to
assess model effectiveness.
Analyze the confusion matrix to summarize the model's performance, including true positives, true
negatives, false positives, and false negatives.
● Implement cross-validation techniques, like K-fold cross-validation, to evaluate how well the
model generalizes to unseen data.
● Be aware of overfitting, where the model learns noise in the training data, and
underfitting, where the model is too simplistic to capture underlying patterns.
● Consider the trade-off between model complexity and performance, as more complex
models may lead to overfitting.
● Identify feature importance to understand which variables contribute most to the model's
predictions, aiding in feature selection.
● Optimize hyperparameters using methods like Grid Search or Random Search to improve
model performance.
● Utilize learning curves to visualize model performance on training and validation datasets as
the training set size varies.
● Understand the bias-variance tradeoff, balancing the error from oversimplification (bias)
and excessive complexity (variance).
# Evaluate model
mse = mean_squared_error(y, y_pred) r2 = r2_score(y,
{r2}')
• The primary goal of K-Means is to divide the data into K clusters, where each data point is
assigned to the cluster with the nearest mean, known as the centroid.
• The algorithm operates through a series of iterative steps. Initially, K centroids are
chosen randomly from the dataset.
• In the assignment step, each data point is assigned to the nearest centroid, forming K clusters.
In the update step, the centroids are recalculated by taking the mean of all data points assigned
to each cluster.
• This process of assignment and updating continues until the centroids stabilize or a predetermined
number of iterations is reached
• Max depth of trees: Controls the maximum depth of individual trees in decision tree-based
models (e.g., Random Forest, XGBoost).
• Regularization strength: Controls how much the model is penalized for being too complex
(e.g., L1/L2 regularization in linear models).
• Batch size and epochs: In neural networks, these hyperparameters determine how the training
data is fed into the model and how many times the model is updated.
• Kernel type: In SVM (Support Vector Machine), the kernel type (linear, RBF, etc.) can
drastically change model performance.
2.3 : Feature Engineering Key Learnings:
• Improves Model Performance: Well-engineered features can make patterns in the data
more accessible to machine learning algorithms, leading to better model performance.
• Reduces Model Complexity: By creating more relevant features, you can sometimes reduce the
number of features needed, simplifying the model.
• Helps in Handling Missing Values: Feature engineering can help you handle missing values in
a way that doesn’t hurt model performance.
• Better Interpretability: In some cases, feature engineering can make the results more
interpretable for humans (e.g., creating features that directly correspond to business logic).
Time Features:
In time-series problems, extracting time-related features such as day, month, year, and
weekday can provide valuable information. For example, breaking down a timestamp into
hour, minute, and day can provide temporal patterns.
• Time series analysis is a statistical technique used to analyze time-ordered data points to
identify trends, seasonal patterns, and cyclical behaviours over time.
• One of the key learnings in time series analysis is the importance of understanding the underlying
components of the data, which typically include trend (the long-term movement), seasonality
(regular patterns that repeat over specific intervals), and noise (random variations).
• It is also crucial to assess the stationarity of the time series, as many statistical methods assume
that the data's statistical properties do not change over time; techniques like differencing can
be used to achieve stationarity
• a. Trend:
• The long-term direction of the data. It could be increasing, decreasing, or stable over time.
• A trend is not necessarily linear, and it could also be non-linear, such as exponential or
logistic growth.
• b. Seasonality:
• Seasonal variations refer to periodic fluctuations that occur at regular intervals within the data,
often caused by factors like weather, holidays, or business cycles.
• c. Noise (Irregular/Residual):
• Irregularities or random variations that cannot be explained by the trend or seasonality.
• It is assumed to be unpredictable and not part of the underlying pattern.
• d. Cyclic Patterns:
• Unlike seasonality, which has a fixed period, cyclical patterns are fluctuations that occur due
to economic, business, or other factors but do not have a fixed period.
2.5 : Advanced Data Visualization Key Learnings:
• Advanced data visualization involves the use of sophisticated techniques and tools to
create insightful and interactive representations of complex datasets.
• One of the key learnings in this field is the importance of choosing the right visualization type
based on the data characteristics and the story you want to convey.
• For instance, while bar charts are effective for comparing categorical data, line graphs are
better suited for showing trends over time.
• Additionally, incorporating interactivity through tools like dashboards allows users to explore
data dynamically, enabling them to filter, zoom, and drill down into specific areas of interest,
which enhances user engagement and understanding
• Another critical aspect is the use of color, shapes, and sizes to encode information
effectively. Thoughtful color palettes can help highlight key insights and differentiate
between categories,
while maintaining accessibility for color-blind users is essential.
• Furthermore, advanced visualizations often leverage techniques such as heatmaps, scatter plots,
and network graphs to reveal patterns and relationships that may not be immediately apparent in
traditional chart
python
Copy code
# Heatmap for correlation
corr = df[['Age', 'Salary']].corr()
sns.heatmap(corr, annot=True,
cmap='coolwarm') plt.title('Correlation
Heatmap') plt.show()
2.6 : Project Implementation and Final Review Key Learnings:
• One of the key learnings during project implementation is the importance of effective
communication among team members and stakeholders.
• Clear communication helps ensure that everyone is aligned with project goals, timelines, and
responsibilities, which can prevent misunderstandings and delays.
Final Thoughts:
The internship was a valuable learning experience in data science. I gained a solid understanding of
Python, data manipulation, statistical analysis, machine learning, and visualization techniques. The
hands-on projects and examples helped reinforce my learning, and I look forward to applying these
The conclusion of this Data Science Internship Report encapsulates a transformative journey
marked by substantial learning and professional growth. Throughout the internship, I gained
practical experience in data collection, cleaning, analysis, and visualization, using tools such as
Python, R, and SQL. I had the opportunity to work on diverse projects that honed my problem-
solving skills and deepened my understanding of machine learning algorithms and statistical
methods. Collaborating with seasoned professionals and contributing to real-world projects
enriched my knowledge and provided valuable insights into the dynamic field of data science. This
internship has solidified my passion for data-driven decisionmaking and has equipped me with the
essential skills and confidence to pursue a successful career in data science. As I conclude this
report, I am grateful for the mentorship and experiences that have significantly shaped my
professional trajectory.