0% found this document useful (0 votes)
17 views57 pages

Lab - Eti Mannual

The document outlines a laboratory manual for a Big Data Analytics course aimed at third-year students in artificial intelligence and machine learning, emphasizing the development of industry-relevant skills through practical work. It details the importance of outcome-based education, providing guidelines for both students and teachers to ensure effective learning and skill acquisition. The manual includes practical exercises related to big data applications in companies like Uber, Walmart, Netflix, and eBay, along with expected program outcomes and competencies.

Uploaded by

gl727541
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views57 pages

Lab - Eti Mannual

The document outlines a laboratory manual for a Big Data Analytics course aimed at third-year students in artificial intelligence and machine learning, emphasizing the development of industry-relevant skills through practical work. It details the importance of outcome-based education, providing guidelines for both students and teachers to ensure effective learning and skill acquisition. The manual includes practical exercises related to big data applications in companies like Uber, Walmart, Netflix, and eBay, along with expected program outcomes and competencies.

Uploaded by

gl727541
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

BIG DATA ANALYLITICS 22684

Preface

The primary focus of any engineering laboratory/field work in the technical education system
is to develop the much-needed industry relevant competencies and skills. With this in view,
MSBTE embarked on this innovative ‘I’ Scheme curricula for engineering Diploma
programmes with outcome-based education as the focus and accordingly, relatively large
amount of time is allotted for the practical work. This displays the great importance of
laboratory work making each teacher, instructor and student to realize that every minute of
the laboratory time need to be effectively utilized to develop these outcomes, rather than
doing other mundane activities. Therefore, for the successful implementation of this
outcome-based curriculum, every practical has been designed to serve as a ‘vehicle’ to
develop this industry identified competency in every student. The practical skills are difficult
to develop through ‘chalk and duster’ activity in the classroom situation. Accordingly, the ‘I’
scheme laboratory manual development team designed the practicals to focus on outcomes,
rather than the traditional age-old practice of conducting practical’s to ‘verify the theory’
(which may become a byproduct along the way).

This laboratory manual is designed to help all stakeholders, especially the students, teachers
and instructors to develop in the student the pre-determined outcomes. It is expected from
each student that at least a day in advance, they have to thoroughly read the concerned
practical procedure that they will do the next day and understand minimum theoretical
background associated with the practical. Every practical in this manual begins by identifying
the competency, industry relevant skills, course outcomes and practical outcomes which
serve as a key focal point for doing the practical. Students will then become aware about the
skills they will achieve through procedure shown there and necessary precautions to be taken,
whichwill help them to apply in solving real-world problems in their professional life.

This manual also provides guidelines to teachers and instructors to effectively facilitate
student-centered lab activities through each practical exercise by arranging and managing
necessary resources in order that the students follow the procedures and precautions
systematically ensuring the achievement of outcomes in the students. This manual is intended
for the Third-Year students of artificial intelligence and machine learning.

This manual typically contains practical’s related to bid data and various aspects related to
the subject for enhanced understanding. Students are advised to thoroughly go through this
manual rather than only topics mentioned in the curriculum. This course is designed to
introduce and familiarize students of computer engineering with such a popular environment
so that respective skills

Although all care has been taken to check for mistakes in this laboratory manual, yet it is
impossible to claim perfection especially as this is the first edition. Any such errors and
suggestions for improvement can be brought to our notice and are highly welcome to be
achieved through Practical’s of this Course Following programme outcomes are expected to
be achieved significantly out of the ten programme outcomes and artificial intelligence and
machine learning programme specific outcomes through the practical’s of the course on big
data analytics

DEPARTMENT OF AIML (AITRC, VITA)


1
BIG DATA ANALYLITICS 22684

PO 1. Basic knowledge: Apply knowledge of basic mathematics, sciences and basic


engineering to solve the broad-based Computer related problems.

PO 2. Discipline knowledge: Apply Computer engineering discipline - specific knowledge to


solve core computer engineering related problems.

PO 3. Experiments and practice: Plan to perform experiments and practices to use the results
to solve broad-based Computer related problems.

PO 4. Engineering tools: Apply relevant Computer programming tools with an understanding


of the limitations.

PO 5. The engineer and society: Assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to practice in field of Computer engineering.

PO 6. Environment and sustainability: Apply Computer engineering solutions also for


sustainable development practices in societal and environmental contexts and demonstrate the
knowledge and need for sustainable development

PO 7. Ethics: Apply ethical principles for commitment to professional ethics, responsibilities


and norms of the practice also in the field of Computer engineering.

PO 8. Individual and teamwork: Function effectively as a leader and team member in diverse/
multidisciplinary teams.

PO 9. Communication: Communicate effectively in oral and written form.

PO 10. Life-long learning: Engage in independent and life-long learning activities in the
context of technological changes in the Computer engineering field and allied industry

DEPARTMENT OF AIML (AITRC, VITA)


2
BIG DATA ANALYLITICS 22684

Practical- course outcome matrix (co)

a. Interpret features of android operating system


b. Configure android envioment and development tool
c. Develop rich user interface by using layouts and controls.
d. Use user interface components for android application development
e. Create android application using database
f. Publish android application

Co c. Co e.
Sr.no Tile of the pracical Co a. Co b. Co d.
Case study on big data and big data analysis
1
(Walmart, Uber, Netflix, eBay, etc)
2 Write a Pandas program

3 Perform the External transform load process

4 Study any one Hadoop use case


Create Hive table
a. Create Hive External table
5
b. Load data into Hive table
c. Create Hive Internal table
6 Load the data into Hive table
Create Hive table with following storage format
specification
7 a.Hive text format
b. Hive sequence file format
c.Hive RC file format
Consider the sample logs txt shown in fig. write a
8 spark application to count the total number of
WARN line in the log’s txt file
Implement using scala /python programming
a. Create the data as logdata .log with
9 comma delimiters
b. Create a dataframe of the created log file
using spark.read.csv.
Write and run Spark SQL queries
10
programmatically

Read and write data stored in Apache Hive


11
through spark SQL

DEPARTMENT OF AIML (AITRC, VITA)


3
BIG DATA ANALYLITICS 22684

List of Industry Relevant Skills

The following industry relevant skills of the competency “Create simple Android
applications” are expected to be developed in you by performing practicals of this laboratory
manual.

1. Interpret features of Android operating system.

2. Configure Android environment and development tools.

3. Develop rich user Interfaces by using layouts and controls.

4. Use User Interface components for android application development.

5. Create Android application using database.

6. Publish Android applications.

Brief Guidelines to Teachers Hints regarding strategies to be used

1. Teacher shall explain prior concepts to the students before starting each experiment.

2. For practical’s requiring tools to be used, teacher should provide the demonstration of the
practical emphasizing the skills, which the student should achieve.

3. Involve students in the activities during the conduct of each experiment.

4. Teachers should give opportunity to students for hands-on after the demonstration.

5. Assess the skill achievement of the students and COs of each unit.

6 . Teacher is expected to share the skills and competencies to be developed in the students.

7. Teacher should ensure that the respective skills and competencies are developed in the
students after the completion of the practical exercise.

8. Teacher may provide additional knowledge and skills to the students even though that may
not be covered in the manual but are expected from the students by the industries.

9. Teacher may suggest the students to refer additional related literature of the reference
books/websites/seminar proceedings.

10. During assessment teacher is expected to ask questions to the students to tap their
knowledge and skill related to that practical.

DEPARTMENT OF AIML (AITRC, VITA)


4
BIG DATA ANALYLITICS 22684

Instructions for Students

Student shall read the points given below for understanding the theoretical concepts and
practical applications.

1. Students shall listen carefully the lecture given by teacher about importance of subject,
learning structure, course outcomes.

2. Students shall organize the work in the group of two or three members and make a record
of all observations.

3. Students shall understand the purpose of experiment and its practical implementation.

4. Students shall write the answers of the questions during practical.

5. Student should feel free to discuss any difficulty faced during the conduct of practical.

6. Students shall develop maintenance skills as expected by the industries.

7. Student shall attempt to develop related hands-on skills and gain confidence.

8. Students shall refer technical magazines; websites related to the scope of the subjects and
update their knowledge and skills.

9. Students shall develop self-learning techniques.

10. Students should develop habit to submit the write-ups on the scheduled dates and time.

DEPARTMENT OF AIML (AITRC, VITA)


5
BIG DATA ANALYLITICS 22684

Content Page

List of Practicals and Progressive Assessment Sheet

Date of Date of Dated


Page Assessment
Sr.no Tile of the pracical perform submissi sign remark
no marks
ance on
Case study on big data and big
1 data analysis (Walmart, Uber,
Netflix, eBay, etc)
2 Write a Pandas program
Perform the External transform
3
load process
4 Study any one Hadoop use case
Create Hive table
d. Create Hive External
table
5 e. Load data into Hive
table
f. Create Hive Internal
table
6 Load the data into Hive table
Create Hive table with
following storage format
specification
7
a.Hive text format
b. Hive sequence file format
c.Hive RC file format
Consider the sample logs txt
shown in fig. write a spark
8 application to count the total
number of WARN line in the
log’s txt file
Implement using scala /python
programming
c. Create the data as
logdata .log with
9 comma delimiters
d. Create a dataframe of
the created log file
using spark.read.csv.
Write and run Spark SQL
10
queries programmatically
Read and write data stored in
11 Apache Hive through spark
SQL

DEPARTMENT OF AIML (AITRC, VITA)


6
BIG DATA ANALYLITICS 22684

Practical no 1: Case Study on Big Data and Big Data Analysis: Uber, Walmart,
Netflix, eBay

Practical significance:

Big data refers to large, complex sets of data that traditional data-processing software cannot
handle efficiently. This concept is revolutionizing businesses by providing deeper insights,
improving decision-making, enhancing customer experiences, and optimizing operations.
Companies like Uber, Walmart, Netflix, and eBay are leveraging big data to stay competitive
in their respective industries.

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Write case study on big data and big data analytics

Practical outcomes
Information of the various systems like Uber, Walmart, E bay

Minimum theoretical background:

1. Uber: Optimizing Ride-Hailing with Big Data

Big Data Applications: Uber has revolutionized the transportation industry by using big data
analytics to optimize the ride-hailing experience. Uber's system generates massive volumes
of data from riders, drivers, and their interactions.

Key Uses of Big Data:

• Dynamic Pricing (Surge Pricing): Uber uses real-time data from traffic patterns,
weather, and demand to adjust pricing in specific areas. This ensures a balance
between rider demand and driver availability.
• Route Optimization: Uber uses data to determine the fastest and most efficient
routes for drivers, taking into account real-time traffic conditions and historical data.
• Driver and Rider Behavior Analysis: Uber tracks user behavior to improve services
and personalize the customer experience. For example, Uber can use ride history to
suggest preferred drivers or routes for users.
• Supply-Demand Prediction: By analyzing historical ride data, Uber can predict

DEPARTMENT OF AIML (AITRC, VITA)


7
BIG DATA ANALYLITICS 22684

demand spikes in different geographic areas and deploy drivers accordingly

2. Walmart: Leveraging Big Data for Inventory and Customer Insights

Big Data Applications: Walmart uses big data across its supply chain, inventory
management, and customer relationship management systems. The company collects data
from point-of-sale systems, customer interactions, and its global supply chain.

Key Uses of Big Data:

• Inventory Management: Walmart utilizes big data analytics to track sales in real
time and optimize stock levels. By analyzing data on customer buying patterns, the
company can predict what products will be in demand, where, and when.
• Predictive Analytics: The company uses predictive models to forecast demand and
ensure products are available when and where customers need them, reducing
stockouts and overstock situations.
• Personalized Marketing: Walmart analyzes customer purchase data to create
personalized offers and promotions. The insights gained from big data help to target
the right customer with the right message at the right time.
• Supply Chain Optimization: Big data helps Walmart streamline its supply chain by
providing real-time insights into shipments, deliveries, and inventory levels. This
helps reduce costs and improve the efficiency of deliveries.

3. Netflix: Content Recommendations and Viewer Engagement

Big Data Applications: Netflix, a leader in streaming services, uses big data to analyze user
behavior, optimize content delivery, and recommend personalized content.

Key Uses of Big Data:

• Personalized Recommendations: Netflix analyzes millions of data points, including


viewing history, ratings, and preferences, to provide tailored content
recommendations. It uses machine learning algorithms to improve the
recommendation system over time.
• Content Creation Decisions: Netflix leverages big data to determine which types of
content will resonate with audiences. By analyzing user data, Netflix can predict
trends and create original content with a high likelihood of success.
• Streaming Quality Optimization: Netflix uses data on network speeds, device types,
and other factors to ensure smooth streaming. Data analytics helps optimize the video
delivery process to ensure minimal buffering and the best quality.
• Churn Prediction: By analyzing user engagement metrics, Netflix can identify
patterns that lead to customer churn. This allows the company to take proactive
measures to retain subscribers, such as offering personalized content or promotions.

DEPARTMENT OF AIML (AITRC, VITA)


8
BIG DATA ANALYLITICS 22684

4. EBay: Big Data for Market Insights and Pricing Optimization

Big Data Applications: eBay uses big data to enhance the marketplace experience, improve
decision-making, and optimize pricing and seller performance.

Key Uses of Big Data:

• Dynamic Pricing and Auction Optimization: eBay uses big data algorithms to
adjust auction prices in real time based on factors like demand, competition, and time
left in the auction. This ensures that buyers get competitive prices, and sellers
maximize revenue.
• Fraud Detection: Big data analytics help eBay identify and prevent fraudulent
activities by analyzing transaction patterns and behaviors that deviate from the norm.
• User Behavior Analysis: By analyzing user interactions, eBay can personalize
product recommendations, search results, and advertisements. The platform also uses
behavioral data to improve the user interface and make it more engaging.
• Seller Performance and Analytics: eBay provides sellers with insights into their
sales data, customer feedback, and market trends. This helps sellers optimize their
offerings and improve their customer service.

Case Study Overview:

• Uber: Dynamic pricing and route optimization based on traffic and demand.
• Walmart: Predicting product demand to optimize inventory.
• Netflix: Recommending content based on user behavior.
• EBay: Pricing optimization and fraud detection.

DEPARTMENT OF AIML (AITRC, VITA)


9
BIG DATA ANALYLITICS 22684

Processure:
1. Uber: Dynamic Pricing and Demand Forecasting

Uber uses data to forecast demand and apply dynamic pricing (surge pricing) based on real-
time data. Let's simulate this with Python.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Simulating data: Demand vs Time and Weather (for simplicity)


np.random.seed(42)
time_of_day = np.random.randint(0, 24, 100) # Hours of the day (0-23)
weather_condition = np.random.choice([0, 1], size=100) # 0: Bad weather, 1: Good weather
demand = (time_of_day * 2 + weather_condition * 10 + np.random.normal(0, 10, 100)) # Simplified demand
formula

# DataFrame with the simulated data


df = pd.DataFrame({'time_of_day': time_of_day, 'weather_condition': weather_condition, 'demand': demand})

# Split the data into training and testing sets


X = df[['time_of_day', 'weather_condition']]
y = df['demand']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Predict demand
y_pred = model.predict(X_test)

# Visualize the results


plt.scatter(y_test, y_pred)
plt.xlabel('Actual Demand')
plt.ylabel('Predicted Demand')
plt.title('Uber Demand Prediction (Dynamic Pricing)')
plt.show()

# Evaluate the model


from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

DEPARTMENT OF AIML (AITRC, VITA)


10
BIG DATA ANALYLITICS 22684

2. Walmart: Predicting Product Demand

Walmart uses data to predict product demand to optimize inventory levels. Let's simulate this
prediction using Python.

from sklearn.ensemble import RandomForestRegressor


import seaborn as sns

# Simulating data for product demand


np.random.seed(42)
product_id = np.random.randint(1, 10, 200)
week_of_year = np.random.randint(1, 53, 200)
promotion = np.random.choice([0, 1], size=200) # 1: Promotion, 0: No promotion
historical_sales = np.random.normal(500, 100, 200) # Sales based on week and promotion
product_demand = historical_sales + week_of_year * 3 + promotion * 200 + np.random.normal(0, 50, 200)

# DataFrame
df = pd.DataFrame({
'product_id': product_id,
'week_of_year': week_of_year,
'promotion': promotion,
'historical_sales': historical_sales,
'product_demand': product_demand
})

# Feature selection
X = df[['product_id', 'week_of_year', 'promotion', 'historical_sales']]
y = df['product_demand']

# Train a random forest regressor


model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Predict demand for a new set of data (simulation)


new_data = pd.DataFrame({
'product_id': [1, 2, 3],
'week_of_year': [5, 10, 15],
'promotion': [0, 1, 0],
'historical_sales': [480, 500, 530]
})

predictions = model.predict(new_data)
print("Predicted Product Demand:", predictions)

DEPARTMENT OF AIML (AITRC, VITA)


11
BIG DATA ANALYLITICS 22684

3. Netflix: Content Recommendation

Netflix uses collaborative filtering or content-based filtering for recommendation systems.


Below is a simplified example of collaborative filtering using the surprise library.

from surprise import SVD, Dataset, Reader


from surprise.model_selection import train_test_split
from surprise import accuracy

# Sample data: user ratings for movies (user, movie, rating)


data = {
'user_id': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
'movie_id': ['M1', 'M2', 'M3', 'M4', 'M1', 'M2', 'M3', 'M4'],
'rating': [5, 4, 3, 2, 4, 5, 2, 1]
}

df = pd.DataFrame(data)

# Prepare data for Surprise library


reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating']], reader)

# Train-test split
trainset, testset = train_test_split(data, test_size=0.25)

# Train SVD model (Matrix Factorization)


model = SVD()
model.fit(trainset)

# Predictions on the test set


predictions = model.test(testset)
accuracy.rmse(predictions)

# Example of predicting a rating for a new user-movie pair


predicted_rating = model.predict('A', 'M3')
print(f"Predicted rating for user A on movie M3: {predicted_rating.est:.2f}")

DEPARTMENT OF AIML (AITRC, VITA)


12
BIG DATA ANALYLITICS 22684

4. eBay: Fraud Detection

eBay uses big data for fraud detection. We will simulate fraud detection using anomaly
detection techniques with Isolation Forest.

from sklearn.ensemble import IsolationForest


import numpy as np

# Simulating transaction data (amount, item category, user behavior)


np.random.seed(42)
transaction_amount = np.random.normal(100, 50, 1000)
item_category = np.random.choice([1, 2, 3, 4], size=1000)
user_behavior = np.random.choice([1, 0], size=1000) # 1: Suspicious, 0: Normal

# Fraudulent transactions (simulated)


fraudulent_transactions = np.random.choice([0, 1], size=1000, p=[0.95, 0.05])

# DataFrame
df = pd.DataFrame({
'transaction_amount': transaction_amount,
'item_category': item_category,
'user_behavior': user_behavior,
'fraudulent': fraudulent_transactions
})

# Feature selection
X = df[['transaction_amount', 'item_category', 'user_behavior']]

# Apply Isolation Forest for anomaly detection (fraud detection)


model = IsolationForest(contamination=0.05)
model.fit(X)

# Predict anomalies (fraudulent transactions)


df['fraud_predicted'] = model.predict(X)

# Anomalies are marked as -1 (fraudulent) and 1 (normal)


fraud_transactions = df[df['fraud_predicted'] == -1]
print(f"Number of suspected fraudulent transactions: {len(fraud_transactions)}"

Conclusion:

DEPARTMENT OF AIML (AITRC, VITA)


13
BIG DATA ANALYLITICS 22684

Practical related questions:


1How does Netflix’s big data strategy affect its content production and user engagement?

2 How could Uber further optimize its pricing model using big data insights during extreme weather
or major events

3 .What role does big data play in shaping Walmart's customer marketing strategies and product
placements

DEPARTMENT OF AIML (AITRC, VITA)


14
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


15
BIG DATA ANALYLITICS 22684

Practical no 2: Write a Pandas program


Practical significance: In the using Pandas, which is a Python library widely used for
data analysis and manipulation, practical significance refers to how effectively you can
apply Pandas to solve real-world problems. This can be understood as the actual value or
impact that using Pandas has in different areas, such as business, science, finance, or social
research.

a. To import given excel data into a Pandas Dataframe.

b. To get the data types of the given excel data fields.

c. To read specific columns from a given excel file.

d. To find the sum, mean, max, min value of a specific column of a given excel file

.e. To import some excel data skipping some rows or columns

.f. To select the specified columns and rows from a given data frame

.g. To Delete Rows and Columns from DataFrame.

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Write various pandas program import the exel data

Practical outcomes
Information for importing pandas program

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

DEPARTMENT OF AIML (AITRC, VITA)


16
BIG DATA ANALYLITICS 22684

Below is a Pandas program that addresses the various requirements you've listed:

1. Importing Excel Data into a Pandas DataFrame


import pandas as pd

# a. Import the given excel data into a Pandas DataFrame


file_path = 'your_excel_file.xlsx'
df = pd.read_excel(file_path)

print(df.head()) # Print the first few rows to verify the data

2. Getting Data Types of Excel Data Fields


python
Copy
# b. Get the data types of the given excel data fields
print(df.dtypes)

3. Reading Specific Columns from the Excel File


# c. Read specific columns from the given excel file
columns_to_read = ['Column1', 'Column2'] # Replace with actual column names
df_specific_columns = pd.read_excel(file_path, usecols=columns_to_read)

print(df_specific_columns.head()) # Print the first few rows to verify the data

4. Finding the Sum, Mean, Max, Min Value of a Specific Column


# d. Find the sum, mean, max, and min value of a specific column
column_name = 'Column1' # Replace with the name of the column you're interested in

sum_value = df[column_name].sum()
mean_value = df[column_name].mean()
max_value = df[column_name].max()
min_value = df[column_name].min()

print(f"Sum: {sum_value}")
print(f"Mean: {mean_value}")
print(f"Max: {max_value}")
print(f"Min: {min_value}")

5. Importing Excel Data by Skipping Rows or Columns


# e. Import some excel data skipping some rows or columns
df_skipped = pd.read_excel(file_path, skiprows=3, usecols="A:D") # Skipping first 3 rows and reading only
columns A to D

print(df_skipped.head()) # Print the first few rows to verify the data

DEPARTMENT OF AIML (AITRC, VITA)


17
BIG DATA ANALYLITICS 22684

6. Selecting Specific Columns and Rows from a DataFrame


# f. Select the specified columns and rows from the DataFrame
# For example, selecting rows 1 to 5 and columns 'Column1' and 'Column2'
df_selected = df.loc[1:5, ['Column1', 'Column2']]

print(df_selected)

7. Deleting Rows and Columns from DataFrame


# g. Delete Rows and Columns from DataFrame
# Deleting specific rows (e.g., rows 2 to 4)
df_dropped_rows = df.drop([2, 3, 4], axis=0)

# Deleting specific columns (e.g., 'Column1' and 'Column2')


df_dropped_columns = df.drop(['Column1', 'Column2'], axis=1)

print("DataFrame after dropping rows:")


print(df_dropped_rows)

print("DataFrame after dropping columns:")


print(df_dropped_columns)

Conclusion:

Practical related questions:


1)steps to write a pandas code

2)how to find sum and mean values in specific column in the pandas code

DEPARTMENT OF AIML (AITRC, VITA)


18
BIG DATA ANALYLITICS 22684

DEPARTMENT OF AIML (AITRC, VITA)


19
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


20
BIG DATA ANALYLITICS 22684

Practical no 3

The ETL (Extract, Transform, Load) process is a crucial part of data integration. In this
process:

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Write a program to Extract transform load

Practical outcomes
Information for ETL

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Theoretical backgraound

• Extract: Data is extracted from various source systems or files.


• Transform: The data is cleaned, filtered, and transformed to meet the desired format
or structure.
• Load: The transformed data is loaded into a destination storage or database (like a
data warehouse).

To perform an ETL process based on your specific requirements, I'll write a Python script
that incorporates the following steps:

1. Import the functions and required modules.


2. Download the source file.
3. Extract the zip file.
4. Set the path for the target files.
5. Use the extract() function to extract data from multiple sources.
6. Use the transform() function to transform the data.
7. Load the data into the target file.
8. Call the log() function for each phase (Extract, Transform, Load).

DEPARTMENT OF AIML (AITRC, VITA)


21
BIG DATA ANALYLITICS 22684

Here's an example script to implement this:

import os
import zipfile
import requests
import pandas as pd
from datetime import datetime

# Function to log the progress


def log(phase, message):
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print(f"[{timestamp}] {phase}: {message}")

# Function to download the source file


def download_file(url, target_path):
log("Download", f"Downloading file from {url}...")
response = requests.get(url)
with open(target_path, 'wb') as f:
f.write(response.content)
log("Download", f"File downloaded successfully to {target_path}.")

# Function to extract the zip file


def extract_zip(zip_file_path, extract_to):
log("Extract", f"Extracting zip file: {zip_file_path}...")
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(extract_to)
log("Extract", f"Extraction completed. Files are available in {extract_to}.")

# Function to extract data from CSV or other sources


def extract_data(file_paths):
log("Extract", "Extracting data from the source files...")
dfs = []
for file_path in file_paths:
df = pd.read_csv(file_path) # Assuming CSV files for simplicity
dfs.append(df)
combined_df = pd.concat(dfs, ignore_index=True)
log("Extract", f"Data extracted from {len(file_paths)} files.")
return combined_df

# Function to transform the data


def transform_data(df):
log("Transform", "Transforming data...")
# Example transformations (you can customize this)
df_cleaned = df.dropna() # Drop rows with missing values
df_transformed = df_cleaned.rename(columns={'old_column_name': 'new_column_name'}) # Rename
columns
log("Transform", "Data transformation completed.")
return df_transformed

# Function to load the data into a target file


def load_data(df, target_file_path):
log("Load", f"Loading data into {target_file_path}...")
df.to_csv(target_file_path, index=False)
log("Load", f"Data loaded into {target_file_path} successfully.")

# ETL pipeline that runs the entire process


def etl_pipeline(download_url, zip_file_path, extract_to, file_paths, target_file_path):
# Step 1: Download the source file
download_file(download_url, zip_file_path)

DEPARTMENT OF AIML (AITRC, VITA)


22
BIG DATA ANALYLITICS 22684

# Step 2: Extract the zip file


extract_zip(zip_file_path, extract_to)

# Step 3: Extract data from multiple files


extracted_data = extract_data(file_paths)

# Step 4: Transform the data


transformed_data = transform_data(extracted_data)

# Step 5: Load the data into the target file


load_data(transformed_data, target_file_path)

# Example usage
download_url = 'https://example.com/source.zip' # Replace with the actual URL to download the zip file
zip_file_path = 'source.zip' # Path to save the downloaded zip file
extract_to = './extracted_files' # Directory to extract the zip content
file_paths = ['./extracted_files/data1.csv', './extracted_files/data2.csv'] # Paths of extracted CSV files
target_file_path = './target_data.csv' # The target file where data will be loaded

# Run the ETL pipeline


etl_pipeline(download_url, zip_file_path, extract_to, file_paths, target_file_path)

Explanation of Each Step:

1. Log Function: log() logs the progress at each phase (Download, Extract, Transform,
Load).
2. Download the Source File:
o The download_file() function downloads a zip file from the provided URL using
the requests library.
3. Extract the Zip File:
o The extract_zip() function extracts the contents of the downloaded zip file into
the specified directory using zipfile.ZipFile.
4. Set the Path for the Target Files:
o This is handled through the file_paths and target_file_path variables, where
file_paths holds the locations of the extracted CSV files to read, and
target_file_path is where the final data will be saved.
5. Extract Data:
o The extract_data() function loads CSV files into pandas DataFrames, combines
them, and returns a single DataFrame.
6. Transform Data:
o The transform_data() function performs transformations on the data, such as
removing missing values and renaming columns.
7. Load Data:
o The load_data() function saves the transformed data into a target CSV file.
8. ETL Pipeline:
o The etl_pipeline() function runs the entire ETL process by calling the other
functions in sequence: downloading, extracting, transforming, and loading.

Conclusion :

DEPARTMENT OF AIML (AITRC, VITA)


23
BIG DATA ANALYLITICS 22684

Practical related questions:


1)What tools or technologies would you use for extracting the data

2) What types of transformations might be needed on each dataset (e.g., cleaning, normalization,
joining)?

DEPARTMENT OF AIML (AITRC, VITA)


24
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


25
BIG DATA ANALYLITICS 22684

Practical no 4: Study any one Hadoop Use Case.

Practical significance: Hadoop is a framework used for storing and processing large
datasets in a distributed computing environment. A common use case for Hadoop involves
processing massive amounts of data (like logs or transactional data) in parallel across a
cluster of machines.

In this example, we'll walk through a simple Hadoop use case: processing a large text file
(like a log file) to count the frequency of each word. We'll implement this using Python and
the Hadoop ecosystem.

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Write a program to study use case of hadoop

Practical outcomes
Information about Hadoop system

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background:

Use Case: Word Count with Python on Hadoop

For this use case, we will write a Python program to count the frequency of each word in a
large text file using Hadoop’s MapReduce framework. Hadoop’s pydoop library is commonly
used to interact with the Hadoop Distributed File System (HDFS) and Hadoop MapReduce
jobs using Python.

Steps:

1. Setup Hadoop: Make sure you have a running Hadoop cluster (single-node or multi-
node).

DEPARTMENT OF AIML (AITRC, VITA)


26
BIG DATA ANALYLITICS 22684

2. Install Pydoop: This library allows Python to interact with Hadoop. Install it via pip:

pip install pydoop

3. Input File: Let's assume you have a large text file (input.txt) stored in HDFS. You can
use the following Python program to count the words.

Word Count Python Program

We'll write a MapReduce program using Hadoop's MapReduce framework and Python.

1. Mapper: It will take each line of the text, split it into words, and emit each word with
a count of 1.
2. Reducer: It will aggregate the counts for each word and output the final count.

wordcount.py

import sys
import pydoop.hdfs as hdfs

def mapper(_, line):


# Split the line into words
words = line.strip().split()
for word in words:
# Emit each word with a count of 1
print(f"{word}\t1")

def reducer(_, values):


total_count = 0
# Sum up all the counts for this word
for value in values:
total_count += int(value)
# Emit the word with its total count
print(f"{_}\t{total_count}")

if __name__ == "__main__":
from pydoop import hadoop
hadoop.run(mapper, reducer)

Explanation:

• The mapper function processes each line of input and emits each word followed by a
count of 1.
• The reducer function receives the word and its associated counts and sums them up
to give the total count for that word.

Running the Program on Hadoop

1. Upload Input File to HDFS:

Before running the job, upload your text file to HDFS.

hadoop fs -put input.txt /user/hadoop/input

DEPARTMENT OF AIML (AITRC, VITA)


27
BIG DATA ANALYLITICS 22684

2. Run the Hadoop Job: Execute the word count MapReduce job by running the
following command.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \


-input /user/hadoop/input/input.txt \
-output /user/hadoop/output \
-mapper "python wordcount.py" \
-reducer "python wordcount.py"

3. Check the Output: Once the job completes, check the output directory.

hadoop fs -cat /user/hadoop/output/part-00000

This will give you a list of words and their corresponding counts in the text file.

Explanation of the Command:

• -input: Specifies the HDFS input path (the location of the file to be processed).
• -output: Specifies the HDFS output path (where the results will be stored).
• -mapper and -reducer: Specify the mapper and reducer programs. In this case, we are
running the Python script wordcount.py as both the mapper and reducer.

Conclusion

Practical related questions:


1Write features of HDFS

2) How would you handle a situation where the HDFS is running out of space due to large datasets?

3) How can you implement data security in a Hadoop ecosystem?

DEPARTMENT OF AIML (AITRC, VITA)


28
BIG DATA ANALYLITICS 22684

DEPARTMENT OF AIML (AITRC, VITA)


29
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


30
BIG DATA ANALYLITICS 22684

Practical no 5
Create Hive table: a. Create Hive External Table. b. Load data into Hive table. c. Create Hive
Internal Table.

Below are the steps to create Hive tables (both external and internal), and load data into them.
These steps assume you have a Hadoop ecosystem with Hive configured and running. You
can execute these commands from the Hive command line interface (CLI) or from a script.

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Create a hive table

Practical outcomes
Information for HIVE

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background:

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily
be analyzed to make informed, data driven decisions, and therefore it is a critical component
of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage
on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.

DEPARTMENT OF AIML (AITRC, VITA)


31
BIG DATA ANALYLITICS 22684

1. Creating a Hive External Table


An external table allows you to manage the data outside of Hive’s control, meaning Hive will not
manage the data location and you can access data from the external location (HDFS, local filesystem,
etc.).

-- a. Create Hive External Table


CREATE EXTERNAL TABLE external_table_name (
column1 STRING,
column2 INT,
column3 DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' -- Specify the delimiter (can be other characters like tab, space, etc.)
LINES TERMINATED BY '\n' -- Specify line terminator (can vary depending on your file format)
LOCATION '/user/hive/warehouse/external_data'; -- Specify the location of your external data (HDFS or local
path)

2. Loading Data into Hive External Table

After creating the external table, you can load data into it using one of the following methods:

-- b. Load data into Hive External Table


LOAD DATA INPATH '/user/hive/external_data.csv' INTO TABLE external_table_name;

Alternatively, you can use the LOCAL keyword if your data is on the local filesystem:

LOAD DATA LOCAL INPATH '/local/path/to/external_data.csv' INTO TABLE external_table_name;

3. Creating a Hive Internal Table

An internal table (also known as a managed table) means Hive will manage the data and the
metadata. The data will be stored in Hive's default warehouse directory unless specified
otherwise. If you drop the table, both the data and the metadata will be deleted.

-- c. Create Hive Internal Table


CREATE TABLE internal_table_name (
column1 STRING,
column2 INT,
column3 DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' -- Specify the delimiter (e.g., CSV format)
LINES TERMINATED BY '\n'; -- Specify the line terminator

4. Loading Data into Hive Internal Table


For an internal table, you can load data into it in the same way as an external table:

-- Load data into Hive Internal Table


LOAD DATA INPATH '/user/hive/warehouse/internal_data.csv' INTO TABLE internal_table_name;

Again, you can use LOCAL if the data resides on the local filesystem:LOAD DATA LOCAL INPATH

DEPARTMENT OF AIML (AITRC, VITA)


32
BIG DATA ANALYLITICS 22684

'/local/path/to/internal_data.csv' INTO TABLE internal_table_name;

Conclusion

Practical Releted Question:


1)How to create a hive table
2)write the features of hive

DEPARTMENT OF AIML (AITRC, VITA)


33
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


34
BIG DATA ANALYLITICS 22684

Practicle no 06

To load data into a Hive table, there are multiple ways depending on where the data is stored
(local file system, HDFS, etc.) and the tools you use. Here are the steps for each of the
approaches you mentioned:

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Load the data in hive

Practical outcomes
Information for HIVE

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background:

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily
be analyzed to make informed, data driven decisions, and therefore it is a critical component
of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage
on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.

a. Load Data from Local File System

To load data from the local file system into a Hive table, you need to first copy the data to
HDFS (Hive typically uses HDFS to store the data). Here's the general process:

1. Copy the file from local file system to HDFS: You first need to copy the file from

DEPARTMENT OF AIML (AITRC, VITA)


35
BIG DATA ANALYLITICS 22684

your local system to HDFS using the hadoop fs command.

bash
Copy
hadoop fs -put /path/to/local/file /user/hive/warehouse/hive_table_name/

Here, /path/to/local/file is the path to the local file you want to load, and
/user/hive/warehouse/hive_table_name/ is the HDFS directory where you want to store the
data.

2. Load the data from HDFS to Hive Table: After uploading the file to HDFS, use the
LOAD DATA command in Hive to load the data into your table.

sql
Copy
LOAD DATA INPATH '/user/hive/warehouse/hive_table_name/file' INTO TABLE your_hive_table;

This loads the file from the specified HDFS path into your Hive table.

b. Load Data from HDFS File System

If your data is already stored on HDFS, you can load it directly into a Hive table with the
LOAD DATA command:

1. Load the data from HDFS to Hive Table:

sql
Copy
LOAD DATA INPATH '/user/hive/warehouse/hive_table_name/file' INTO TABLE your_hive_table;

This command loads the data from the specified HDFS location into your Hive table.

c. Copy Data to Hive Table Location

If you want to move data directly into the Hive table's directory location (e.g., using the file
system or a shell), follow these steps:

1. Find the Hive table's storage location: When a Hive table is created, it typically has
a corresponding directory in HDFS under /user/hive/warehouse/ (unless you specified a
different location).

To find the table location:

sql
Copy
DESCRIBE FORMATTED your_hive_table;

This will show you the location of the table's directory in HDFS (under the Location
section).

2. Copy the data into this directory: After you know the location, you can use HDFS

DEPARTMENT OF AIML (AITRC, VITA)


36
BIG DATA ANALYLITICS 22684

commands to copy your file to this location.

bash
Copy
hadoop fs -put /path/to/local/file /user/hive/warehouse/your_hive_table/

Alternatively, you could copy the data using hadoop fs -copyFromLocal or use an HDFS
tool like hdfs dfs -cp.

d. Sqoop Hive Import to Import Table Data

If you want to import data from a relational database (e.g., MySQL, PostgreSQL) into a Hive
table, you can use Sqoop. Here's how to perform a Hive import using Sqoop:

1. Import Data from a Relational Database to Hive:

bash
Copy
sqoop import --connect jdbc:mysql://<hostname>:<port>/<db_name> \
--username <username> --password <password> \
--table <source_table_name> \
--hive-import \
--hive-table <hive_table_name> \
--create-hive-table

o Replace <hostname>, <port>, <db_name>, <username>, <password>,


<source_table_name>, and <hive_table_name> with the appropriate values.
o --hive-import will import data into Hive, and --create-hive-table will create the table
in Hive if it doesn’t already exist.
2. Verify the Data in Hive:

After importing, you can check the data in Hive using:

sql
Copy
SELECT * FROM your_hive_table;

These methods cover different ways of loading data into Hive from local file systems, HDFS,
and relational databases using Sqoop. You can choose the one that best fits your use case.

Conclusion :

DEPARTMENT OF AIML (AITRC, VITA)


37
BIG DATA ANALYLITICS 22684

Practical related questions:

1)how to load the data in the HDFS

2)how to load the data from local data

DEPARTMENT OF AIML (AITRC, VITA)


38
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


39
BIG DATA ANALYLITICS 22684

Practical no 07

To create Hive tables with the specified storage formats, you can define the tables using
different STORED AS clauses for each format. Below are the SQL statements for creating Hive
tables using the storage formats you requested.

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)


Create a hive table

Practical outcomes
Information for HIVE

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily
be analyzed to make informed, data driven decisions, and therefore it is a critical component
of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage
on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of
data using SQL.

1. Hive Text File Format

2. Hive Sequence File Format


CREATE TABLE table_sequence_file (
column1 STRING,
column2 INT,

DEPARTMENT OF AIML (AITRC, VITA)


40
BIG DATA ANALYLITICS 22684

column3 DOUBLE
)
STORED AS SEQUENCEFILE;

3. Hive RC File Format


CREATE TABLE table_rc_file (
column1 STRING,
column2 INT,
column3 DOUBLE
)
STORED AS RCFILE;

4. Hive AVRO File Format


CREATE TABLE table_avro (
column1 STRING,
column2 INT,
column3 DOUBLE
)
STORED AS AVRO;

5. Hive ORC File Format


CREATE TABLE table_orc (
column1 STRING,
column2 INT,
column3 DOUBLE
)
STORED AS ORC;

6. Hive Parquet File Format


CREATE TABLE table_parquet (
column1 STRING,
column2 INT,
column3 DOUBLE
)
STORED AS PARQUET;

Conclusion

• For TextFile, we use the ROW FORMAT DELIMITED clause to specify how the data is
delimited (commonly comma-separated, but can be changed).
• For SequenceFile, RCFile, Avro, ORC, and Parquet, no row format details are
needed since these formats define their structure internally.
• Ensure the appropriate libraries are loaded in your Hive environment for working with
Avro, Parquet, and ORC.

You can modify the column names and data types according to your actual dataset.

DEPARTMENT OF AIML (AITRC, VITA)


41
BIG DATA ANALYLITICS 22684

Practical related questions:


1) Which are hive file formats
2) Write the advantages of the hive
3) How to implement hive RC file formate

DEPARTMENT OF AIML (AITRC, VITA)


42
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


43
BIG DATA ANALYLITICS 22684

Practical no 8
Write the spark application to count the total number of WARN lines in
the logs.txt file using a Spark application,

Relevant Program Outcome (POs):


PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)

count the total number of WARN lines

Practical outcomes
Information for SPARK

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background

To count the total number of WARN lines in the logs.txt file using a Spark application, we can
implement it using either Scala or Python. I'll show you how to do this in both languages.

Python (Using PySpark)

You can use PySpark to read the logs.txt file, filter the lines that contain WARN, and then
count the number of those lines.

python
Copy
from pyspark.sql import SparkSession

# Initialize Spark session


spark = SparkSession.builder \
.appName("Count WARN messages") \
.getOrCreate()

# Read the logs.txt file


logs = spark.read.text("logs.txt")

DEPARTMENT OF AIML (AITRC, VITA)


44
BIG DATA ANALYLITICS 22684

# Filter the lines that start with 'WARN'


warn_lines = logs.filter(logs['value'].startswith("WARN"))

# Count the number of WARN lines


warn_count = warn_lines.count()

print(f"Total number of WARN lines: {warn_count}")

# Stop the Spark session


spark.stop()

Explanation:
1. SparkSession is initialized to create a Spark context.
2. The logs.txt file is read as text using spark.read.text().
3. We use filter() to keep only the rows where the line starts with WARN.
4. The count() method is used to get the number of WARN lines.
5. Finally, the Spark session is stopped using spark.stop().

Scala (Using Apache Spark)

In Scala, the code would look similar. Here's an example:

scala
Copy
import org.apache.spark.sql.SparkSession

object WarnLineCounter {
def main(args: Array[String]): Unit = {
// Initialize Spark session
val spark = SparkSession.builder
.appName("Count WARN messages")
.getOrCreate()

// Read the logs.txt file


val logs = spark.read.text("logs.txt")

// Filter the lines that start with 'WARN'


val warnLines = logs.filter(logs("value").startsWith("WARN"))

// Count the number of WARN lines


val warnCount = warnLines.count()

// Print the result


println(s"Total number of WARN lines: $warnCount")

// Stop the Spark session


spark.stop()
}
}

Explanation:
1. SparkSession is created in a similar way to the Python example.
2. The logs.txt file is read using spark.read.text().
3. The filter() method is used to keep the lines that begin with WARN.

DEPARTMENT OF AIML (AITRC, VITA)


45
BIG DATA ANALYLITICS 22684

4. The count() method is used to get the number of WARN lines.


5. Finally, the Spark session is stopped using spark.stop().

Sample Input (logs.txt)


vbnet
Copy
WARN This is a warning message
ERROR This is an error message
WARN This is a warning message
ERROR This is an error message
ERROR This is an error message
WARN This warning message
WARN This is a warning message

Conclusion

Practical related questions:

1) Write down the features of the spark


2) How to count the WARN lines

DEPARTMENT OF AIML (AITRC, VITA)


46
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


47
BIG DATA ANALYLITICS 22684

Practical no:09
Create the data frame of the create log file

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)

Create the data frame operation

Practical outcomes

Information for spark

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background:

To implement the creation of the log file and then load it into a Spark DataFrame, you can
follow these steps:

Step 1: Create the log file (logdata.log)

You can write the given data to a log file using Python or Scala. Below is an example of how
to write the data to a CSV file (logdata.log).

Python Code:

python
Copy
# Writing the data to a CSV file logdata.log
log_data = '''10:24:25,10.192.123.23,http://www.google.com/searchString,ODC1
10:24:21,10.123.103.23,http://www.amazon.com,ODC
10:24:21,10.112.123.23,http://www.amazon.com/Electronics,ODC1
10:24:21,10.124.123.24,http://www.amazon.com/Electronics/storagedevices,ODC1
10:24:22,10.122.123.23,http://www.gmail.com,ODC2
10:24:23,10.122.143.21,http://www.flipkart.com,ODC2

DEPARTMENT OF AIML (AITRC, VITA)


48
BIG DATA ANALYLITICS 22684

10:24:21,10.124.123.23,http://www.flipkart.com/offers,ODC1'''

# Save the log data to logdata.log


with open('logdata.log', 'w') as file:
file.write(log_data)

print("Log data has been saved to logdata.log.")

Scala Code:

// Writing the data to a CSV file logdata.log in Scala


import java.io._

val logData = """10:24:25,10.192.123.23,http://www.google.com/searchString,ODC1


10:24:21,10.123.103.23,http://www.amazon.com,ODC
10:24:21,10.112.123.23,http://www.amazon.com/Electronics,ODC1
10:24:21,10.124.123.24,http://www.amazon.com/Electronics/storagedevices,ODC1
10:24:22,10.122.123.23,http://www.gmail.com,ODC2
10:24:23,10.122.143.21,http://www.flipkart.com,ODC2
10:24:21,10.124.123.23,http://www.flipkart.com/offers,ODC1"""

// Save the log data to logdata.log


val writer = new PrintWriter(new File("logdata.log"))
writer.write(logData)
writer.close()

println("Log data has been saved to logdata.log.")

Step 2: Create a DataFrame using Spark

After saving the logdata.log file, you can use Spark to load it into a DataFrame. Below is the
code to do that:

Python Code using PySpark:


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
.appName("LogData") \
.getOrCreate()

# Read the log file into a DataFrame


log_df = spark.read.csv('logdata.log', header=False, inferSchema=True)

# Rename columns as per schema


log_df = log_df.withColumnRenamed('_c0', 'Time') \
.withColumnRenamed('_c1', 'IP Address') \
.withColumnRenamed('_c2', 'URL') \
.withColumnRenamed('_c3', 'Location')

# Show the DataFrame


log_df.show()

# Stop the Spark session


spark.stop()

DEPARTMENT OF AIML (AITRC, VITA)


49
BIG DATA ANALYLITICS 22684

Scala Code using Spark:


import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder()
.appName("LogData")
.getOrCreate()

// Read the log file into a DataFrame


val logDF = spark.read
.option("header", "false") // No header in the log file
.csv("logdata.log")

// Rename columns as per the schema


val renamedDF = logDF
.withColumnRenamed("_c0", "Time")
.withColumnRenamed("_c1", "IP Address")
.withColumnRenamed("_c2", "URL")
.withColumnRenamed("_c3", "Location")

// Show the DataFrame


renamedDF.show()

// Stop the Spark session


spark.stop()

Explanation:

1. Log File Creation:


o The log file is created with a simple string containing comma-delimited values
for the log entries.
2. Reading the Data in Spark:
o The Spark read.csv() method is used to read the CSV file.
o inferSchema=True automatically infers the data types, but we use header=False
since the log file doesn't contain column headers.
3. Renaming Columns:
o After loading the data, the columns are renamed to Time, IP Address, URL, and
Location to match the required schema.

Expected Output:

The output will display the DataFrame in Spark:

less
Copy
+--------+-------------+-----------------------------------------+--------+
| Time| IP Address| URL|Location|
+--------+-------------+-----------------------------------------+--------+
|10:24:25| 10.192.123.23| http://www.google.com/searchString| ODC1|
|10:24:21| 10.123.103.23| http://www.amazon.com| ODC|
|10:24:21| 10.112.123.23| http://www.amazon.com/Electronics| ODC1|
|10:24:21| 10.124.123.24|http://www.amazon.com/Electronics/storagedevices| ODC1|
|10:24:22| 10.122.123.23| http://www.gmail.com| ODC2|
|10:24:23| 10.122.143.21| http://www.flipkart.com| ODC2|
|10:24:21| 10.124.123.23| http://www.flipkart.com/offers| ODC1|

DEPARTMENT OF AIML (AITRC, VITA)


50
BIG DATA ANALYLITICS 22684

Practical related questions:

1)What is the features of spark

2) how to create the data frame using spark

DEPARTMENT OF AIML (AITRC, VITA)


51
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


52
BIG DATA ANALYLITICS 22684

Practicle no 11

To read and write data stored in Apache Hive through Spark SQL using either Scala or
Python, we need to use the Spark SQL API, which allows us to interact with Hive tables in a
Spark cluster. Below are the steps for both Scala and Python implementations.

Relevant Program Outcome (POs):

PO1 – Basic knowledge

PO2 – Discipline knowledge

PO3 – Experiments and practice

PO4 – Life-long learning

Relevant Course Outcome(s)

Read the data in spark

Practical outcomes

Information spark

Resource required

Sr no. Instrument/Object Specification Quantity Remark


01 Desktop PC Processor i3 1 yes
02 Software 1 yes

Minimum theoretical background

Prerequisites:

1. Apache Spark is installed and configured with Hive support.


2. Hive Metastore is properly configured and available.
3. Ensure the necessary dependencies (like hive-site.xml) are available for your Spark session.

Scala Implementation

To use Spark SQL with Hive in Scala, you will need to import the necessary libraries and
configure your SparkSession to connect to Hive.

1. Set up the SparkSession:

scala
Copy
import org.apache.spark.sql.SparkSession

// Initialize SparkSession with Hive support


val spark = SparkSession.builder()

DEPARTMENT OF AIML (AITRC, VITA)


53
BIG DATA ANALYLITICS 22684

.appName("Hive Spark Integration")


.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport() // This enables Hive support
.getOrCreate()

2. Reading Data from Hive Table:

scala
Copy
// Reading data from a Hive table into a DataFrame
val df = spark.sql("SELECT * FROM your_hive_table_name")
df.show()

3. Writing Data to Hive Table:

scala
Copy
// Writing a DataFrame to a Hive table (overwriting if the table exists)
df.write
.mode("overwrite") // Options: append, overwrite, ignore, error
.saveAsTable("your_hive_table_name")

4. Creating a Hive Table:

scala
Copy
// Creating a table in Hive if it doesn't exist
spark.sql("""
CREATE TABLE IF NOT EXISTS your_hive_table_name (
id INT,
name STRING,
age INT
) USING hive
""")

5. Stopping Spark Session:

Python Implementation

To use Spark SQL with Hive in Python, you need to follow similar steps. You can use
PySpark's SparkSession to interact with Hive.

1. Set up the SparkSession:

python
Copy
from pyspark.sql import SparkSession

# Initialize SparkSession with Hive support


spark = SparkSession.builder \
.appName("Hive Spark Integration") \
.config("spark.sql.warehouse.dir", "warehouse_location") \
.enableHiveSupport() \
.getOrCreate()

DEPARTMENT OF AIML (AITRC, VITA)


54
BIG DATA ANALYLITICS 22684

2. Reading Data from Hive Table:

python
Copy
# Reading data from a Hive table into a DataFrame
df = spark.sql("SELECT * FROM your_hive_table_name")
df.show()

3. Writing Data to Hive Table:

python
Copy
# Writing a DataFrame to a Hive table (overwriting if the table exists)
df.write \
.mode("overwrite") # Options: append, overwrite, ignore, error
.saveAsTable("your_hive_table_name")

4. Creating a Hive Table:

python
Copy
# Creating a table in Hive if it doesn't exist
spark.sql("""
CREATE TABLE IF NOT EXISTS your_hive_table_name (
id INT,
name STRING,
age INT
) USING hive
""")

5. Stopping Spark Session:

python
Copy
spark.stop()
Conclusion :

• The configuration spark.sql.warehouse.dir refers to the location where Spark will store metadata
and data for Hive tables.
• The method enableHiveSupport() is critical to connect Spark with the Hive metastore.
• For reading and writing, you can also use different data formats like parquet, orc, avro, etc.,
depending on how your Hive tables are set up.
• When writing data to Hive, you can specify the mode: overwrite, append, ignore, or error.

With this setup, you can interact with Hive tables using Spark SQL in both Scala and Python,
making data processing and querying more efficient in big data environments.

Practical related questions:

1)Write the data types in hive

2)How to read the data in the hive

DEPARTMENT OF AIML (AITRC, VITA)


55
BIG DATA ANALYLITICS 22684

DEPARTMENT OF AIML (AITRC, VITA)


56
BIG DATA ANALYLITICS 22684

Marked obtained Dated signature of teacher

Process Related Process Total (25)


(15) Related
(15)

DEPARTMENT OF AIML (AITRC, VITA)


57

You might also like