0% found this document useful (0 votes)
5 views

DSciHomeworkAssignmentV4

The assignment requires the development of a salary prediction system for job postings that lack salary information, utilizing provided CSV files containing job metadata and salary data. The deliverables include a CSV file with predicted salaries, the code used for the prediction, and answers to specific questions regarding the methodology and data preparation. The accuracy of the predictions will be evaluated using root-mean-square error (RMSE).

Uploaded by

mahmutluck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DSciHomeworkAssignmentV4

The assignment requires the development of a salary prediction system for job postings that lack salary information, utilizing provided CSV files containing job metadata and salary data. The deliverables include a CSV file with predicted salaries, the code used for the prediction, and answers to specific questions regarding the methodology and data preparation. The accuracy of the predictions will be evaluated using root-mean-square error (RMSE).

Uploaded by

mahmutluck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Science Interview Assignment

Assignment Description
Being a job search engine, it’s helpful if we can suggest an approximate salary to job seekers
for a given job post. Unfortunately, not all job postings include the salary. This is where you
come in: your first task as an Indeed Data Scientist is to develop a salary prediction system.
The goal: provide estimated salaries for a new job posting.

Data Supplied
You are given three CSV (comma-separated) data files:
● train_features.csv​: Each row represents metadata for an individual job posting.
The “jobId” column represents a unique identifier for the job posting. The remaining
columns describe features of the job posting.
● train_salaries.csv​: Each row associates a “jobId” with a “salary”.
● test_features.csv​: Similar to ​train_features.csv​, each row represents
metadata for an individual job posting
The first row of each file contains headers for the columns. Keep in mind that the metadata
and salaries have been extracted by our aggregation and parsing systems. As such, it’s
possible that the data is dirty (may contain errors).

The Task
You must build a model to predict the salaries for the job postings contained in
test_features.csv​. The output of your system should be a CSV file entitled
test_salaries.csv​ where each row has the following format:
jobId,salary
As a reference, your output should mirror the format of ​train_salaries.csv​.

To judge the accuracy, we will compare your salary predictions to a ground-truth using the
root-mean-square error (RMSE).

As a guideline, you should expect to spend around 4 hours to complete this exercise
(including model training time). The assignment does not have to be completed all at once.
Please do not share the assignment, the data, or your solutions with anyone other than your
recruiter.

Deliverables
The following deliverables must be submitted to Indeed:
● Your ​test_salaries.csv​ file containing the salary predictions and job Ids for the
test data set (please use .zip or .gz compression).
● The code that you wrote to solve the problem
● Answers to the questions below [in ​.pdf​ or ​.txt​].
● Any related files such as figures, etc…

Please do not include your name in any of the deliverables, since evaluation is double blind.

Questions
Please answer the following questions.
1. How long did it take you to solve the problem?
2. What software language and libraries did you use to solve the problem? Why did you
choose these languages/libraries?
3. What steps did you take to prepare the data for the project? Was any cleaning
necessary?
4. a) What machine learning method did you apply?
b) Why did you choose this method?
c) What other methods did you consider?
5. Describe how the machine learning algorithm that you chose works.
6. Was any encoding or transformation of features necessary? If so, what
encoding/transformation did you use?
7. Which features had the greatest impact on salary? How did you identify these to be
most significant? Which features had the least impact on salary? How did you identify
these?
8. How did you train your model? During training, what issues concerned you?
9. a) Please estimate the RMSE that your model will achieve on the test dataset.
b) How did you create this estimate?
10. What metrics, other than RMSE, would be useful for assessing the accuracy of salary
estimates? Why?

You might also like