Data Engineering Lab
Data Engineering Lab
B.Tech. VI Semester
COURSE OBJECTIVES:
To explore and understand various data management and handling methods
To understand the concept of data engineering
To explore Hadoop framework and its components
To use Big Data tools and techniques for data processing
COURSE OUTCOMES: After completion of the course, the student should be able to
CO-1: Implement data management and handling methods
CO-2: Implement data engineering methods
CO-3: Acquire the skills to work with Hadoop framework activities
CO-4: Acquire the skills to work with NOSQL
CO-5: Use various Big Data tools and techniques for data management
CO-1 3 3 3 3 3 2 - - 1 - 1 1 3 3 3
CO-2 3 3 3 3 3 2 - - 1 - 1 1 3 2 2
CO-3 3 3 3 3 3 2 - - 1 - 1 1 2 - 2
CO-4 3 3 3 3 3 2 - - 1 - 1 1 3 - 3
CO-5 3 3 3 3 3 2 - - 1 - 1 1 3 3 3
WEEK-1:
Basic Data Handling Commands:
1. Read data from the csv file
2. Dimension of the data
3. Display data (top 5 rows and total data)
4. List the column names of a data frame
5. Change columns of a data frame
6. Display specific single column or multiple columns of a data frame
7. Bind sets of rows of dataframes
8. Bind sets of columns of data frames.
9. Find missing values in the dataset
WEEK-2:
1. Measures of central tendency mean, median, mode
2. Measures of data spread
3. Dispersion of data variance, standard deviation
4. Position of the different data values quartiles, inter-quartile range (IQR).
WEEK-3:
Basic Plots for Data Exploration (Use Iris dataset):
1. Generate box plot for each of the four predictors.
2. Generate box plot for a specific feature
3. Generate histogram for a specific feature
4. Generate Scatter plot of petal length vs. sepal length
WEEK-4:
Data Pre-Processing Methods: Use AutoMPG Dataset and Perform the Following Tasks
1. Removing outliers / missing values.
2. Inputing standard values
3. Capping of values
WEEK-5:
Feature Construction: (Use packages that are applicable)
1. Dummy coding categorical(nominal) variables.
2. Encoding categorical(ordinal) variables.
3. Transforming numeric(continuous)features to categorical features
WEEK-6:
Feature Extraction: (Use packages that are applicable)
1. Principal Component Analysis (PCA)
2. Singular Value Decomposition (SVD)
3. Linear Discriminant Analysis (LDA)
4. Feature Subset Selection
WEEK-7:
HDFS(Storage)
A. Hadoop Storage File System
Your first objective is to create a directory structure in HDFS using HDFS commands.
Create the local files using Linux commands and move the files to HDFS directoryand
vice versa.
i. Write a command to create the directory structure in HDFS.
ii. Write a Command to move file from local unix/linux machine to HDFS.
Lab Instructions:
i. Your objective is to use HDFS commands to move data to HDFS for processing
data.
WEEK-8 & 9:
Map Reduce Programming (Processing data).
Hadoop Map-Reduce framework is developed using Java, but the framework
allowsyou to write programs in other languages as well.
Word Count
The word count problem is the most famous using map reduce program. Same thing
we can do with java but takes lot of time with huge file, in MR it will process less
timeeven with huge and distributed files. The objective is to count the frequency of
words of a large text.
Lab Instructions:
Develop MapReduce example program in a MapReduce environment to find outthe
number of occurrences of each word in a text file
Lab Instructions:
1. Create the table in HIVE using hive nosql based query.
2. Fill the table with sample data by using some sample data bases.
3. Write a program that produces a list of properties with minimum value(min_value),
largest value(max_value) and number of unique values. Before you start, execute
the prepare step to load the data into HDFS.
4. Generate a count per state.
5. Now that extracted the properties, calculate the number of records per state.
Lab Instructions:
1. Write a program that lists the states and their count from the data input.
TEXT BOOKS:
1. Machine Learning, Saikar Dutt, Subramanian Chandramouli, Amit Kumar Das,
Pearson
2. Big Data and Analytics, Seema Acharya, Subhasini Chellappan,
Wiley
3. Machine Learning, Tom M. Mitchell, McGraw-Hill Education