Abhishek BDA File
Abhishek BDA File
Branch: CSE
Semester: VIII
Roll No.: 23005
Certificate
Certified that this Practical entitled “Big Data Analysis” submitted by Abhishek Chauhan, Roll No. 23005,
student of Computer Science Engineering Department, Dronacharya College of Engineering,
Gurugram in the partial fulfillment of the requirement for the award of Bachelors of
Technology (Computer Science and Engineering) Degree of MDU, Rohtak is a record of
student’s own study carried under my supervision & guidance.
1 Write a Big data program to implement Python program that demonstrates a basic
process of extracting value from big data using a machine learning model.
2 Program that involves working with Big Data on a distributed computing platform
like Apache Spark and comparing it with a Relational Database Management
System (RDBMS)
3 Program for a Big Data Data Lakes scenario involves working with a distributed
storage system and possibly using a query language like SQL or tools like Apache
Spark. Below is a simplified example using Python, PySpark, and the Spark SQL
API to demonstrate working with data stored in a Data Lake.
4 Python program using PySpark, which is the Python API for Apache Spark, to
demonstrate basic data processing:
5 Python program using SQLAlchemy, a popular SQL toolkit and Object-Relational
Mapping (ORM) library, to model data and interact with a relational database.
6 Python program that demonstrates basic data operations using the panda library.
Pandas is a popular library for data manipulation and analysis in Python.
7 Write Program Python and PySpark to demonstrate data ingestion into Hadoop
Distributed File System (HDFS) and Apache Kafka.
basic process of extracting value from big data using a machine learning model.
PRE-EXPERIMENT QUESTIONS:
2. What are the requirements for method big data using a machine learning model. ?
import pandas as pd
data = pd.read_csv('big_data.csv')
# Preprocess data as needed (handle missing values, encode categorical variables, etc.)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Depending on your business case, you might use the model predictions to make decisions or take
actions.
# Additional steps for deployment, monitoring, and continuous improvement would be necessary in a
real-world scenario.
Output:
OBJECTIVE: Program that involves working with Big Data on a distributed computing
platform like Apache Spark and comparing it with a Relational Database Management System
(RDBMS)
PRE-EXPERIMENT QUESTIONS:
OBJECTIVE: Program for a Big Data Data Lakes scenario involves working with a distributed
storage system and possibly using a query language like SQL or tools like Apache Spark. Below is a
simplified example using Python, PySpark, and the Spark SQL API to demonstrate working with data
PRE-EXPERIMENT QUESTIONS:
spark = SparkSession.builder.appName("DataLakeExample").getOrCreate()
# Read data from a file in a Data Lake (Assuming a parquet file for illustration)
data_lake_df = spark.read.parquet("data_lake_file.parquet")
data_lake_df.createOrReplaceTempView("data_lake_table")
result.show()
# Write the result back to the Data Lake (Assuming a parquet file for illustration)
result.write.parquet("output_result.parquet")
spark.stop()
In this example:
1. What is the output of the program? Explain the Python, PySpark, and the Spark?
2. What is Big Data Data Lakes scenario?
LAB EXPERIMENT 4
OBJECTIVE: Program using PySpark, which is the Python API for Apache Spark, to demonstrate
basic data processing:
PRE-EXPERIMENT QUESTIONS:
• big_data.csv represents a large dataset, and the program reads the data using PySpark.
• The program performs a basic data transformation, calculating the sum of a column
('column_name' in this example).
• The result is then printed.
Please note that you would need to replace "big_data.csv" and "column_name" with your actual
dataset and column name. Additionally, in a real-world scenario, you would likely perform more
complex transformations and analyses on the data.
The output of this program would be the calculated sum of the specified column, as indicated by the
print statement. This is a basic example, and actual processing tasks would depend on the specific
requirements and nature of your big data.
1, What is CSV?
OBJECTIVE: Python program using SQLAlchemy, a popular SQL toolkit and Object-Relational
Mapping (ORM) library, to model data and interact with a relational database.
PRE-EXPERIMENT QUESTIONS:
2. What Object-Relational Mapping (ORM) library, to model data and interact with a
relational database.?
DATABASE_URL = "sqlite:///example.db"
Base = declarative_base()
class BigDataModel(Base):
tablename = "big_data_table"
column2 = Column(Integer)
Base.metadata.create_all(bind=engine)
# Create a session to interact with the database
db_session = Session(engine)
db_session.add(new_data_entry)
db_session.commit()
queried_data = db_session.query(BigDataModel).filter(BigDataModel.column1 ==
"example_data").first()
db_session.close()
OBJECTIVE: Python program that demonstrates basic data operations using the pandas
library. Pandas is a popular library for data manipulation and analysis in Python.
PRE-EXPERIMENT QUESTIONS:
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 22, 35, 28],
'Salary': [50000, 60000, 45000, 70000, 55000]}
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
print(df)
print("\n")
# Data Operations:
# 1. Selecting columns
selected_columns = df[['Name', 'Age']]
print("Selected Columns:")
print(selected_columns)
print("\n")
print(filtered_data)
print("\n")
# 3. Sorting by a column
sorted_data = df.sort_values(by='Salary', ascending=False)
print("Sorted Data (by Salary):")
print(sorted_data)
print("\n")
# 4. Grouping data and calculating aggregates
grouped_data = df.groupby('Age').mean()
print("Grouped Data (Average Salary by Age):")
print(grouped_data)
print("\n")
# 5. Adding a new column
df['Bonus'] = df['Salary'] * 0.1
print("DataFrame with Bonus Column:")
print(df)
print("\n")
# 6. Deleting a column
df = df.drop('Bonus', axis=1)
print("DataFrame after removing the Bonus column:")
print(df)
print("\n")
# 7. Renaming columns
df = df.rename(columns={'Age': 'Years'})
print(df)
Make sure to install the pandas library before running this program if you haven't
already:
OBJECTIVE: Write Program Python and PySpark to demonstrate data ingestion into Hadoop
PRE-EXPERIMENT QUESTIONS:
spark = SparkSession.builder.appName("DataIngestionToHDFS").getOrCreate()
input_file_path = "your_large_dataset.csv"
hdfs_output_path = "hdfs://localhost:9000/user/your_username/your_output_path"
data_df.write.mode("overwrite").parquet(hdfs_output_path)
spark.stop()
// Ensure you have a running HDFS instance, and adjust the input_file_path and
hdfs_output_path accordingly.
producer = KafkaProducer(bootstrap_servers='localhost:9092')
input_file_path = "your_data_file.txt"
producer.send('your_kafka_topic', value=line.encode('utf-8'))
producer.close()
// Make sure you have a running Kafka broker and adjust the input_file_path and
your_kafka_topic accordingly.
These are basic examples, and in real-world scenarios, you would likely deal with more
complex data formats, configurations, and error handling. Additionally, you may need to
consider tools like Apache NiFi for comprehensive data ingestion pipelines.
Ensure you have the required libraries installed before running the programs:
1. What is a kafka?
2. What are the common operations performed on a stack?
LAB EXPERIMENT 8
OBJECTIVE: Program Real-life applications of big data span across various industries, addressing
challenges related to large-scale data processing, analysis, and extraction of valuable insight
PRE-EXPERIMENT QUESTIONS:
1.What is large scale data Processing?
2.Extraction of value insights big Data?
BRIEF DISCUSSION AND EXPLANATION:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
# Load a sample e-commerce dataset (products and user interactions)
ecommerce_data = pd.read_csv("ecommerce_data.csv")
# Perform data preprocessing (cleaning, handling missing values, etc.)
# Create a TF-IDF vectorizer to convert product descriptions into numerical features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(ecommerce_data['product_description'].fillna(''))
# Calculate the cosine similarity between products based on their descriptions
cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)
# Function to get personalized product recommendations for a given product ID
def get_recommendations(product_id, cosine_sim=cosine_similarity):
idx = ecommerce_data.index[ecommerce_data['product_id'] == product_id].tolist()[0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:6]
product_indices = [i[0] for i in sim_scores]
return ecommerce_data['product_name'].iloc[product_indices]
# Example: Get recommendations for a specific product (change product_id accordingly)
product_id_to_recommend_for = 12345
recommendations = get_recommendations(product_id_to_recommend_for)
# Display the recommendations
print(f"Top 5 Recommendations for Product ID {product_id_to_recommend_for}:\n")
for i, product_name in enumerate(recommendations, start=1):
print(f"{i}. {product_name}")
OBJECTIVE: Python program that simulates a basic big data processing scenario using Apache
Spark for querying data.Large dataset (assuming a CSV file for simplicity), performs some data
preprocessing, and then executes SQL-like queries on the data using Spark SQL.
PRE-EXPERIMENT QUESTIONS:
spark = SparkSession.builder.appName("BigDataQueryExample").getOrCreate()
print("Original DataFrame:")
big_data_df.show(truncate=False)
print("\n")
print("Processed DataFrame:")
processed_data.show(truncate=False)
print("\n")
processed_data.createOrReplaceTempView("processed_data_table")
print("Query Result:")
query_result.show(truncate=False)
spark.stop()
LAB EXPERIMENT 10
OBJECTIVE: Python program using Apache Beam, a popular open-source data processing SDK, to
PRE-EXPERIMENT QUESTIONS:
data = [
# Step 1: Read data from a source (replace 'data' with your actual data source)
event_counts = (
events
# Step 3: Write the results to an output (replace 'output' with your actual output destination)
The ReadEvents step reads data from a source. Replace 'data' with your actual data
The MapEventType step applies a transformation to convert each event into a key-
value pair with the event type as the key and 1 as the value.
• The CountByEventType step uses CombinePerKey to count the occurrences of each
event type.
• The WriteResults step writes the final results to an output destination. Replace 'output'
with your actual output destination, which could be a file, database, or another storage
system.
OBJECTIVE: Python program that demonstrates basic analytical operations using Apache
PRE-EXPERIMENT QUESTIONS:
spark = SparkSession.builder.appName("BigDataAnalyticsExample").getOrCreate()
print("Original DataFrame:")
big_data_df.show(truncate=False)
print("\n")
# Analytical Operations:
total_rows = big_data_df.count()
summary_stats = big_data_df.describe().show(truncate=False)
print("Summary Statistics:")
print(summary_stats)
print("\n")
average_by_category.show(truncate=False)
print("\n")
print("\n")
spark.stop()
Discussion: