Bachelor of Technology in Computer Science
and Engineering
Report
On
Data Analysis in Big Data Using Python
Name Admission No
Tushar Verma 21SCSE1310012
Under the Guidance of Dr. K Suresh Sir
1
Introduction:
Big data analytics is the process of collecting, examining, and analyzing large
amounts of data to discover market trends, insights, and patterns that can help
companies make better business decisions. This information is available
quickly and efficiently so that companies can be agile in crafting plans to
maintain their competitive advantage.
Objective:
Big data analytics describes the process of uncovering trends,
patterns, and correlations in large amounts of raw data to help make
data-informed decisions. These processes use familiar statistical
analysis techniques—like clustering and regression—and apply them
to more extensive datasets with the help of newer tools.
Technologies Used:
Python programming language
PyData library for Data development
Integrated Development Environment (IDE) such as PyCharmor
Jupyter Notebook
Implementation Details:
a. Setting up the Environment:
Install Python and PyData library.
Create a new Python script or project in your preferred IDE.
b. Importing Required Libraries:
Import the necessary libraries, including PyData.
c. Initializing the Data Window:
Set up the Data window dimensions, title, and otherconfigurations.
2
d. Setting up the Data Loop:
Create a Data loop that continuously updates the data stateand redraws the
Data window.
e. Handling Keyboard Inputs:
Capture keyboard inputs to control the Big Data's movement.
Map the arrow keys or WASD keys to specific movementssuch as up,
down, left, and right.
f. Creating the Stack:
Implement the Big Data's initial position, size, and movementlogic.
Define functions to handle the Big Data's movement andgrowth.
g. Generating Food:
Randomly generate food within the Data window. Ensure the food does
not overlap with the Big Data's body.
h. Collision Detection:
Implement collision detection logic to check if the Big Datacollides with
the boundaries or its own body.
End the Data if a collision occurs.
i. Scoring and Data Over:
Keep track of the player's score based on the number of fooditems eaten.
Display the score on the Data window.
End the Data and display a Data over message when theBig Data collides.
j. Adding Data Over Options:
Provide options to restart the Data or exit the applicationafter the Data ends.
Challenges Faced:
3
During the development of the Snack Data, the followingchallenges were
encountered:
Implementing smooth and responsive Big Data movement.
Preventing the Big Data from moving in the opposite directioninstantly,
causing self-collision.
Managing the complexity of collision detection and preventing bugs related
to the Big Data's body and foodplacement.
Conclusion:
The Data analysis project successfully demonstrates the
development of a simple yet engaging Data using Python. Byfollowing the
implementation details outlined in this report, users can create their own
version of the Snack Data and further enhance it with additional features and
functionalities. The project provides a solid foundation forunderstanding Data
development concepts and Python programming techniques.
Source Code:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("BigDataAnalysis") \
.getOrCreate()
# Load the big data file into a DataFrame
data = spark.read.csv("path/to/bigdata.csv", header=True, inferSchema=True)
# Perform data analysis operations
# Example 1: Count the number of rows in the DataFrame
row_count = data.count()
print("Number of rows:", row_count)
# Example 2: Perform aggregations
agg_result = data.groupBy("column_name").agg({"numeric_column": "sum"})
agg_result.show()
# Example 3: Apply filters
filtered_data = data.filter(data["column_name"] > 100)
filtered_data.show()
5
# Example 4: Perform joins
joined_data = data.join(another_data, data["common_column"] == another_data["common_column"],
"inner")
joined_data.show()
# Example 5: Perform machine learning tasks (e.g., clustering, classification)
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
# Prepare features for clustering
assembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="features")
features_data = assembler.transform(data)
# Apply KMeans clustering
kmeans = KMeans(k=2, seed=0)
model = kmeans.fit(features_data)
# Get cluster predictions
predictions = model.transform(features_data)
predictions.show()
# Stop the SparkSession
spark.stop()
Note that this code assumes you have a running Spark cluster and that you have PySpark installed.
Additionally, you may need to modify the code based on your specific data and analysis requirements.