Lab - Eti Mannual
Lab - Eti Mannual
Preface
The primary focus of any engineering laboratory/field work in the technical education system
is to develop the much-needed industry relevant competencies and skills. With this in view,
MSBTE embarked on this innovative ‘I’ Scheme curricula for engineering Diploma
programmes with outcome-based education as the focus and accordingly, relatively large
amount of time is allotted for the practical work. This displays the great importance of
laboratory work making each teacher, instructor and student to realize that every minute of
the laboratory time need to be effectively utilized to develop these outcomes, rather than
doing other mundane activities. Therefore, for the successful implementation of this
outcome-based curriculum, every practical has been designed to serve as a ‘vehicle’ to
develop this industry identified competency in every student. The practical skills are difficult
to develop through ‘chalk and duster’ activity in the classroom situation. Accordingly, the ‘I’
scheme laboratory manual development team designed the practicals to focus on outcomes,
rather than the traditional age-old practice of conducting practical’s to ‘verify the theory’
(which may become a byproduct along the way).
This laboratory manual is designed to help all stakeholders, especially the students, teachers
and instructors to develop in the student the pre-determined outcomes. It is expected from
each student that at least a day in advance, they have to thoroughly read the concerned
practical procedure that they will do the next day and understand minimum theoretical
background associated with the practical. Every practical in this manual begins by identifying
the competency, industry relevant skills, course outcomes and practical outcomes which
serve as a key focal point for doing the practical. Students will then become aware about the
skills they will achieve through procedure shown there and necessary precautions to be taken,
whichwill help them to apply in solving real-world problems in their professional life.
This manual also provides guidelines to teachers and instructors to effectively facilitate
student-centered lab activities through each practical exercise by arranging and managing
necessary resources in order that the students follow the procedures and precautions
systematically ensuring the achievement of outcomes in the students. This manual is intended
for the Third-Year students of artificial intelligence and machine learning.
This manual typically contains practical’s related to bid data and various aspects related to
the subject for enhanced understanding. Students are advised to thoroughly go through this
manual rather than only topics mentioned in the curriculum. This course is designed to
introduce and familiarize students of computer engineering with such a popular environment
so that respective skills
Although all care has been taken to check for mistakes in this laboratory manual, yet it is
impossible to claim perfection especially as this is the first edition. Any such errors and
suggestions for improvement can be brought to our notice and are highly welcome to be
achieved through Practical’s of this Course Following programme outcomes are expected to
be achieved significantly out of the ten programme outcomes and artificial intelligence and
machine learning programme specific outcomes through the practical’s of the course on big
data analytics
PO 3. Experiments and practice: Plan to perform experiments and practices to use the results
to solve broad-based Computer related problems.
PO 5. The engineer and society: Assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to practice in field of Computer engineering.
PO 8. Individual and teamwork: Function effectively as a leader and team member in diverse/
multidisciplinary teams.
PO 10. Life-long learning: Engage in independent and life-long learning activities in the
context of technological changes in the Computer engineering field and allied industry
Co c. Co e.
Sr.no Tile of the pracical Co a. Co b. Co d.
Case study on big data and big data analysis
1
(Walmart, Uber, Netflix, eBay, etc)
2 Write a Pandas program
The following industry relevant skills of the competency “Create simple Android
applications” are expected to be developed in you by performing practicals of this laboratory
manual.
1. Teacher shall explain prior concepts to the students before starting each experiment.
2. For practical’s requiring tools to be used, teacher should provide the demonstration of the
practical emphasizing the skills, which the student should achieve.
4. Teachers should give opportunity to students for hands-on after the demonstration.
5. Assess the skill achievement of the students and COs of each unit.
6 . Teacher is expected to share the skills and competencies to be developed in the students.
7. Teacher should ensure that the respective skills and competencies are developed in the
students after the completion of the practical exercise.
8. Teacher may provide additional knowledge and skills to the students even though that may
not be covered in the manual but are expected from the students by the industries.
9. Teacher may suggest the students to refer additional related literature of the reference
books/websites/seminar proceedings.
10. During assessment teacher is expected to ask questions to the students to tap their
knowledge and skill related to that practical.
Student shall read the points given below for understanding the theoretical concepts and
practical applications.
1. Students shall listen carefully the lecture given by teacher about importance of subject,
learning structure, course outcomes.
2. Students shall organize the work in the group of two or three members and make a record
of all observations.
3. Students shall understand the purpose of experiment and its practical implementation.
5. Student should feel free to discuss any difficulty faced during the conduct of practical.
7. Student shall attempt to develop related hands-on skills and gain confidence.
8. Students shall refer technical magazines; websites related to the scope of the subjects and
update their knowledge and skills.
10. Students should develop habit to submit the write-ups on the scheduled dates and time.
Content Page
Practical no 1: Case Study on Big Data and Big Data Analysis: Uber, Walmart,
Netflix, eBay
Practical significance:
Big data refers to large, complex sets of data that traditional data-processing software cannot
handle efficiently. This concept is revolutionizing businesses by providing deeper insights,
improving decision-making, enhancing customer experiences, and optimizing operations.
Companies like Uber, Walmart, Netflix, and eBay are leveraging big data to stay competitive
in their respective industries.
Practical outcomes
Information of the various systems like Uber, Walmart, E bay
Big Data Applications: Uber has revolutionized the transportation industry by using big data
analytics to optimize the ride-hailing experience. Uber's system generates massive volumes
of data from riders, drivers, and their interactions.
• Dynamic Pricing (Surge Pricing): Uber uses real-time data from traffic patterns,
weather, and demand to adjust pricing in specific areas. This ensures a balance
between rider demand and driver availability.
• Route Optimization: Uber uses data to determine the fastest and most efficient
routes for drivers, taking into account real-time traffic conditions and historical data.
• Driver and Rider Behavior Analysis: Uber tracks user behavior to improve services
and personalize the customer experience. For example, Uber can use ride history to
suggest preferred drivers or routes for users.
• Supply-Demand Prediction: By analyzing historical ride data, Uber can predict
Big Data Applications: Walmart uses big data across its supply chain, inventory
management, and customer relationship management systems. The company collects data
from point-of-sale systems, customer interactions, and its global supply chain.
• Inventory Management: Walmart utilizes big data analytics to track sales in real
time and optimize stock levels. By analyzing data on customer buying patterns, the
company can predict what products will be in demand, where, and when.
• Predictive Analytics: The company uses predictive models to forecast demand and
ensure products are available when and where customers need them, reducing
stockouts and overstock situations.
• Personalized Marketing: Walmart analyzes customer purchase data to create
personalized offers and promotions. The insights gained from big data help to target
the right customer with the right message at the right time.
• Supply Chain Optimization: Big data helps Walmart streamline its supply chain by
providing real-time insights into shipments, deliveries, and inventory levels. This
helps reduce costs and improve the efficiency of deliveries.
Big Data Applications: Netflix, a leader in streaming services, uses big data to analyze user
behavior, optimize content delivery, and recommend personalized content.
Big Data Applications: eBay uses big data to enhance the marketplace experience, improve
decision-making, and optimize pricing and seller performance.
• Dynamic Pricing and Auction Optimization: eBay uses big data algorithms to
adjust auction prices in real time based on factors like demand, competition, and time
left in the auction. This ensures that buyers get competitive prices, and sellers
maximize revenue.
• Fraud Detection: Big data analytics help eBay identify and prevent fraudulent
activities by analyzing transaction patterns and behaviors that deviate from the norm.
• User Behavior Analysis: By analyzing user interactions, eBay can personalize
product recommendations, search results, and advertisements. The platform also uses
behavioral data to improve the user interface and make it more engaging.
• Seller Performance and Analytics: eBay provides sellers with insights into their
sales data, customer feedback, and market trends. This helps sellers optimize their
offerings and improve their customer service.
• Uber: Dynamic pricing and route optimization based on traffic and demand.
• Walmart: Predicting product demand to optimize inventory.
• Netflix: Recommending content based on user behavior.
• EBay: Pricing optimization and fraud detection.
Processure:
1. Uber: Dynamic Pricing and Demand Forecasting
Uber uses data to forecast demand and apply dynamic pricing (surge pricing) based on real-
time data. Let's simulate this with Python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Predict demand
y_pred = model.predict(X_test)
Walmart uses data to predict product demand to optimize inventory levels. Let's simulate this
prediction using Python.
# DataFrame
df = pd.DataFrame({
'product_id': product_id,
'week_of_year': week_of_year,
'promotion': promotion,
'historical_sales': historical_sales,
'product_demand': product_demand
})
# Feature selection
X = df[['product_id', 'week_of_year', 'promotion', 'historical_sales']]
y = df['product_demand']
predictions = model.predict(new_data)
print("Predicted Product Demand:", predictions)
df = pd.DataFrame(data)
# Train-test split
trainset, testset = train_test_split(data, test_size=0.25)
eBay uses big data for fraud detection. We will simulate fraud detection using anomaly
detection techniques with Isolation Forest.
# DataFrame
df = pd.DataFrame({
'transaction_amount': transaction_amount,
'item_category': item_category,
'user_behavior': user_behavior,
'fraudulent': fraudulent_transactions
})
# Feature selection
X = df[['transaction_amount', 'item_category', 'user_behavior']]
Conclusion:
2 How could Uber further optimize its pricing model using big data insights during extreme weather
or major events
3 .What role does big data play in shaping Walmart's customer marketing strategies and product
placements
d. To find the sum, mean, max, min value of a specific column of a given excel file
.f. To select the specified columns and rows from a given data frame
Practical outcomes
Information for importing pandas program
Resource required
Below is a Pandas program that addresses the various requirements you've listed:
sum_value = df[column_name].sum()
mean_value = df[column_name].mean()
max_value = df[column_name].max()
min_value = df[column_name].min()
print(f"Sum: {sum_value}")
print(f"Mean: {mean_value}")
print(f"Max: {max_value}")
print(f"Min: {min_value}")
print(df_selected)
Conclusion:
2)how to find sum and mean values in specific column in the pandas code
Practical no 3
The ETL (Extract, Transform, Load) process is a crucial part of data integration. In this
process:
Practical outcomes
Information for ETL
Resource required
Theoretical backgraound
To perform an ETL process based on your specific requirements, I'll write a Python script
that incorporates the following steps:
import os
import zipfile
import requests
import pandas as pd
from datetime import datetime
# Example usage
download_url = 'https://example.com/source.zip' # Replace with the actual URL to download the zip file
zip_file_path = 'source.zip' # Path to save the downloaded zip file
extract_to = './extracted_files' # Directory to extract the zip content
file_paths = ['./extracted_files/data1.csv', './extracted_files/data2.csv'] # Paths of extracted CSV files
target_file_path = './target_data.csv' # The target file where data will be loaded
1. Log Function: log() logs the progress at each phase (Download, Extract, Transform,
Load).
2. Download the Source File:
o The download_file() function downloads a zip file from the provided URL using
the requests library.
3. Extract the Zip File:
o The extract_zip() function extracts the contents of the downloaded zip file into
the specified directory using zipfile.ZipFile.
4. Set the Path for the Target Files:
o This is handled through the file_paths and target_file_path variables, where
file_paths holds the locations of the extracted CSV files to read, and
target_file_path is where the final data will be saved.
5. Extract Data:
o The extract_data() function loads CSV files into pandas DataFrames, combines
them, and returns a single DataFrame.
6. Transform Data:
o The transform_data() function performs transformations on the data, such as
removing missing values and renaming columns.
7. Load Data:
o The load_data() function saves the transformed data into a target CSV file.
8. ETL Pipeline:
o The etl_pipeline() function runs the entire ETL process by calling the other
functions in sequence: downloading, extracting, transforming, and loading.
Conclusion :
2) What types of transformations might be needed on each dataset (e.g., cleaning, normalization,
joining)?
Practical significance: Hadoop is a framework used for storing and processing large
datasets in a distributed computing environment. A common use case for Hadoop involves
processing massive amounts of data (like logs or transactional data) in parallel across a
cluster of machines.
In this example, we'll walk through a simple Hadoop use case: processing a large text file
(like a log file) to count the frequency of each word. We'll implement this using Python and
the Hadoop ecosystem.
Practical outcomes
Information about Hadoop system
Resource required
For this use case, we will write a Python program to count the frequency of each word in a
large text file using Hadoop’s MapReduce framework. Hadoop’s pydoop library is commonly
used to interact with the Hadoop Distributed File System (HDFS) and Hadoop MapReduce
jobs using Python.
Steps:
1. Setup Hadoop: Make sure you have a running Hadoop cluster (single-node or multi-
node).
2. Install Pydoop: This library allows Python to interact with Hadoop. Install it via pip:
3. Input File: Let's assume you have a large text file (input.txt) stored in HDFS. You can
use the following Python program to count the words.
We'll write a MapReduce program using Hadoop's MapReduce framework and Python.
1. Mapper: It will take each line of the text, split it into words, and emit each word with
a count of 1.
2. Reducer: It will aggregate the counts for each word and output the final count.
wordcount.py
import sys
import pydoop.hdfs as hdfs
if __name__ == "__main__":
from pydoop import hadoop
hadoop.run(mapper, reducer)
Explanation:
• The mapper function processes each line of input and emits each word followed by a
count of 1.
• The reducer function receives the word and its associated counts and sums them up
to give the total count for that word.
2. Run the Hadoop Job: Execute the word count MapReduce job by running the
following command.
3. Check the Output: Once the job completes, check the output directory.
This will give you a list of words and their corresponding counts in the text file.
• -input: Specifies the HDFS input path (the location of the file to be processed).
• -output: Specifies the HDFS output path (where the results will be stored).
• -mapper and -reducer: Specify the mapper and reducer programs. In this case, we are
running the Python script wordcount.py as both the mapper and reducer.
Conclusion
2) How would you handle a situation where the HDFS is running out of space due to large datasets?
Practical no 5
Create Hive table: a. Create Hive External Table. b. Load data into Hive table. c. Create Hive
Internal Table.
Below are the steps to create Hive tables (both external and internal), and load data into them.
These steps assume you have a Hadoop ecosystem with Hive configured and running. You
can execute these commands from the Hive command line interface (CLI) or from a script.
Practical outcomes
Information for HIVE
Resource required
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily
be analyzed to make informed, data driven decisions, and therefore it is a critical component
of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage
on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.
After creating the external table, you can load data into it using one of the following methods:
Alternatively, you can use the LOCAL keyword if your data is on the local filesystem:
An internal table (also known as a managed table) means Hive will manage the data and the
metadata. The data will be stored in Hive's default warehouse directory unless specified
otherwise. If you drop the table, both the data and the metadata will be deleted.
Again, you can use LOCAL if the data resides on the local filesystem:LOAD DATA LOCAL INPATH
Conclusion
Practicle no 06
To load data into a Hive table, there are multiple ways depending on where the data is stored
(local file system, HDFS, etc.) and the tools you use. Here are the steps for each of the
approaches you mentioned:
Practical outcomes
Information for HIVE
Resource required
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily
be analyzed to make informed, data driven decisions, and therefore it is a critical component
of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage
on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.
To load data from the local file system into a Hive table, you need to first copy the data to
HDFS (Hive typically uses HDFS to store the data). Here's the general process:
1. Copy the file from local file system to HDFS: You first need to copy the file from
bash
Copy
hadoop fs -put /path/to/local/file /user/hive/warehouse/hive_table_name/
Here, /path/to/local/file is the path to the local file you want to load, and
/user/hive/warehouse/hive_table_name/ is the HDFS directory where you want to store the
data.
2. Load the data from HDFS to Hive Table: After uploading the file to HDFS, use the
LOAD DATA command in Hive to load the data into your table.
sql
Copy
LOAD DATA INPATH '/user/hive/warehouse/hive_table_name/file' INTO TABLE your_hive_table;
This loads the file from the specified HDFS path into your Hive table.
If your data is already stored on HDFS, you can load it directly into a Hive table with the
LOAD DATA command:
sql
Copy
LOAD DATA INPATH '/user/hive/warehouse/hive_table_name/file' INTO TABLE your_hive_table;
This command loads the data from the specified HDFS location into your Hive table.
If you want to move data directly into the Hive table's directory location (e.g., using the file
system or a shell), follow these steps:
1. Find the Hive table's storage location: When a Hive table is created, it typically has
a corresponding directory in HDFS under /user/hive/warehouse/ (unless you specified a
different location).
sql
Copy
DESCRIBE FORMATTED your_hive_table;
This will show you the location of the table's directory in HDFS (under the Location
section).
2. Copy the data into this directory: After you know the location, you can use HDFS
bash
Copy
hadoop fs -put /path/to/local/file /user/hive/warehouse/your_hive_table/
Alternatively, you could copy the data using hadoop fs -copyFromLocal or use an HDFS
tool like hdfs dfs -cp.
If you want to import data from a relational database (e.g., MySQL, PostgreSQL) into a Hive
table, you can use Sqoop. Here's how to perform a Hive import using Sqoop:
bash
Copy
sqoop import --connect jdbc:mysql://<hostname>:<port>/<db_name> \
--username <username> --password <password> \
--table <source_table_name> \
--hive-import \
--hive-table <hive_table_name> \
--create-hive-table
sql
Copy
SELECT * FROM your_hive_table;
These methods cover different ways of loading data into Hive from local file systems, HDFS,
and relational databases using Sqoop. You can choose the one that best fits your use case.
Conclusion :
Practical no 07
To create Hive tables with the specified storage formats, you can define the tables using
different STORED AS clauses for each format. Below are the SQL statements for creating Hive
tables using the storage formats you requested.
Practical outcomes
Information for HIVE
Resource required
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily
be analyzed to make informed, data driven decisions, and therefore it is a critical component
of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage
on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.
column3 DOUBLE
)
STORED AS SEQUENCEFILE;
Conclusion
• For TextFile, we use the ROW FORMAT DELIMITED clause to specify how the data is
delimited (commonly comma-separated, but can be changed).
• For SequenceFile, RCFile, Avro, ORC, and Parquet, no row format details are
needed since these formats define their structure internally.
• Ensure the appropriate libraries are loaded in your Hive environment for working with
Avro, Parquet, and ORC.
You can modify the column names and data types according to your actual dataset.
Practical no 8
Write the spark application to count the total number of WARN lines in
the logs.txt file using a Spark application,
Practical outcomes
Information for SPARK
Resource required
To count the total number of WARN lines in the logs.txt file using a Spark application, we can
implement it using either Scala or Python. I'll show you how to do this in both languages.
You can use PySpark to read the logs.txt file, filter the lines that contain WARN, and then
count the number of those lines.
python
Copy
from pyspark.sql import SparkSession
Explanation:
1. SparkSession is initialized to create a Spark context.
2. The logs.txt file is read as text using spark.read.text().
3. We use filter() to keep only the rows where the line starts with WARN.
4. The count() method is used to get the number of WARN lines.
5. Finally, the Spark session is stopped using spark.stop().
scala
Copy
import org.apache.spark.sql.SparkSession
object WarnLineCounter {
def main(args: Array[String]): Unit = {
// Initialize Spark session
val spark = SparkSession.builder
.appName("Count WARN messages")
.getOrCreate()
Explanation:
1. SparkSession is created in a similar way to the Python example.
2. The logs.txt file is read using spark.read.text().
3. The filter() method is used to keep the lines that begin with WARN.
Conclusion
Practical no:09
Create the data frame of the create log file
Practical outcomes
Resource required
To implement the creation of the log file and then load it into a Spark DataFrame, you can
follow these steps:
You can write the given data to a log file using Python or Scala. Below is an example of how
to write the data to a CSV file (logdata.log).
Python Code:
python
Copy
# Writing the data to a CSV file logdata.log
log_data = '''10:24:25,10.192.123.23,http://www.google.com/searchString,ODC1
10:24:21,10.123.103.23,http://www.amazon.com,ODC
10:24:21,10.112.123.23,http://www.amazon.com/Electronics,ODC1
10:24:21,10.124.123.24,http://www.amazon.com/Electronics/storagedevices,ODC1
10:24:22,10.122.123.23,http://www.gmail.com,ODC2
10:24:23,10.122.143.21,http://www.flipkart.com,ODC2
10:24:21,10.124.123.23,http://www.flipkart.com/offers,ODC1'''
Scala Code:
After saving the logdata.log file, you can use Spark to load it into a DataFrame. Below is the
code to do that:
# Initialize SparkSession
spark = SparkSession.builder \
.appName("LogData") \
.getOrCreate()
// Initialize SparkSession
val spark = SparkSession.builder()
.appName("LogData")
.getOrCreate()
Explanation:
Expected Output:
less
Copy
+--------+-------------+-----------------------------------------+--------+
| Time| IP Address| URL|Location|
+--------+-------------+-----------------------------------------+--------+
|10:24:25| 10.192.123.23| http://www.google.com/searchString| ODC1|
|10:24:21| 10.123.103.23| http://www.amazon.com| ODC|
|10:24:21| 10.112.123.23| http://www.amazon.com/Electronics| ODC1|
|10:24:21| 10.124.123.24|http://www.amazon.com/Electronics/storagedevices| ODC1|
|10:24:22| 10.122.123.23| http://www.gmail.com| ODC2|
|10:24:23| 10.122.143.21| http://www.flipkart.com| ODC2|
|10:24:21| 10.124.123.23| http://www.flipkart.com/offers| ODC1|
Practicle no 11
To read and write data stored in Apache Hive through Spark SQL using either Scala or
Python, we need to use the Spark SQL API, which allows us to interact with Hive tables in a
Spark cluster. Below are the steps for both Scala and Python implementations.
Practical outcomes
Information spark
Resource required
Prerequisites:
Scala Implementation
To use Spark SQL with Hive in Scala, you will need to import the necessary libraries and
configure your SparkSession to connect to Hive.
scala
Copy
import org.apache.spark.sql.SparkSession
scala
Copy
// Reading data from a Hive table into a DataFrame
val df = spark.sql("SELECT * FROM your_hive_table_name")
df.show()
scala
Copy
// Writing a DataFrame to a Hive table (overwriting if the table exists)
df.write
.mode("overwrite") // Options: append, overwrite, ignore, error
.saveAsTable("your_hive_table_name")
scala
Copy
// Creating a table in Hive if it doesn't exist
spark.sql("""
CREATE TABLE IF NOT EXISTS your_hive_table_name (
id INT,
name STRING,
age INT
) USING hive
""")
Python Implementation
To use Spark SQL with Hive in Python, you need to follow similar steps. You can use
PySpark's SparkSession to interact with Hive.
python
Copy
from pyspark.sql import SparkSession
python
Copy
# Reading data from a Hive table into a DataFrame
df = spark.sql("SELECT * FROM your_hive_table_name")
df.show()
python
Copy
# Writing a DataFrame to a Hive table (overwriting if the table exists)
df.write \
.mode("overwrite") # Options: append, overwrite, ignore, error
.saveAsTable("your_hive_table_name")
python
Copy
# Creating a table in Hive if it doesn't exist
spark.sql("""
CREATE TABLE IF NOT EXISTS your_hive_table_name (
id INT,
name STRING,
age INT
) USING hive
""")
python
Copy
spark.stop()
Conclusion :
• The configuration spark.sql.warehouse.dir refers to the location where Spark will store metadata
and data for Hive tables.
• The method enableHiveSupport() is critical to connect Spark with the Hive metastore.
• For reading and writing, you can also use different data formats like parquet, orc, avro, etc.,
depending on how your Hive tables are set up.
• When writing data to Hive, you can specify the mode: overwrite, append, ignore, or error.
With this setup, you can interact with Hive tables using Spark SQL in both Scala and Python,
making data processing and querying more efficient in big data environments.