0% found this document useful (0 votes)
6 views

EDA Python for Data Analsis

The document provides a comprehensive guide on using Apache Spark for data manipulation, including data loading, cleaning, analysis, visualization, and machine learning integration. It covers various operations such as reading/writing different file formats, performing statistical analysis, and handling complex data types. Additionally, it discusses performance optimization techniques and advanced features like window functions, graph analysis, and real-time data processing.

Uploaded by

salmasaiff.22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

EDA Python for Data Analsis

The document provides a comprehensive guide on using Apache Spark for data manipulation, including data loading, cleaning, analysis, visualization, and machine learning integration. It covers various operations such as reading/writing different file formats, performing statistical analysis, and handling complex data types. Additionally, it discusses performance optimization techniques and advanced features like window functions, graph analysis, and real-time data processing.

Uploaded by

salmasaiff.22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Data Loading

• Read CSV File:

df = spark.read.csv('filename.csv', header=True, inferSchema=True)

• Read Parquet File:

df = spark.read.parquet('filename.parquet')

• Read from JDBC (Databases):

df=spark.read.format("jdbc").options(url="jdbc_url",dbtable="table_name").lo
ad()

2. show data

• Display Top Rows: df.show()

• Print Schema: df.printSchema()

• Summary Statistics: df.describe().show()

• Count Rows: df.count()

• Display Columns: df.columns

3. Data Cleaning

• Drop Missing Values: df.na.drop()

• Fill Missing Values: df.na.fill(value)

• Drop Irrelevant Columns: df.drop('column_name')

• Rename Column: df.withColumnRenamed('old_name', 'new_name')

• Check for Duplicates: df.dropDuplicates()

• Handle Duplicates: df.dropDuplicates(['column1', 'column2'])

• Remove Duplicates Completely: df.dropDuplicates()


• Check for Outliers:

6. Statistical Analysis

• Describe data: df.describe()

• To show distribution data: Sns.histplot(df,bins=20,kde=True)

• Correlation Matrix: from pyspark.ml.stat import Correlation;


Correlation.corr(df, 'column')

• Covariance: df.stat.cov('column1', 'column2')

• Frequency Items: df.stat.freqItems(['column1', 'column2'])

7. Data Visualization

• Bar Chart: df.groupBy('column').count().show()

• Histogram: df.select('column').rdd.flatMap(lambda x: x).histogram(10)

• Scatter Plot: df.select('column1', 'column2').show()

• Box Plot: pandas_df[['column']].boxplot()

• ……………………

8. Export Data in Python

• Convert to Pandas DataFrame: pandas_df = df.toPandas()


• Convert to CSV (Pandas): pandas_df.to_csv('path_to_save.csv',
index=False)

• Write DataFrame to CSV: df.write.csv('path_to_save.csv')

• Write DataFrameto Parquet: df.write.parquet('path_to_save.parquet')


9. Advanced Data Processing

• Window Functions: from pyspark.sql.window import Window;


df.withColumn('rank',
rank().over(Window.partitionBy('column').orderBy('other_column')))
• Pivot Table: df.groupBy('column').pivot('pivot_column').sum('sum_column')
• UDF (User Defined Functions): from pyspark.sql.functions import udf;
my_udf = udf(my_python_function); df.withColumn('new_col',
my_udf(df['col']))

10. Performance Optimization

• Caching DataFrame: df.cache()

• Repartitioning: df.repartition(10)

• Broadcast Join Hint: df.join(broadcast(df2), 'key', 'inner')

11. Exploratory Data Analysis Specifics

• Column Value Counts: df.groupBy('column').count().show()

• Distinct Values in a Column: df.select('column').distinct().show()

• Aggregations (sum, max, min, avg): df.groupBy().sum('column').show()

12. Working with Complex Data Types

• Exploding Arrays: df.withColumn('exploded', explode(df['array_column']))

• Working with Structs: df.select(df['struct_column']['field'])

• Handling Maps: df.select(map_keys(df['map_column']))

13. Joins

• Inner Join: df1.join(df2, df1['id'] == df2['id'])

• Left Outer Join: df1.join(df2, df1['id'] == df2['id'], 'left_outer')


• Right Outer Join: df1.join(df2, df1['id'] == df2['id'], 'right_outer')
14. Saving and Loading Models

• Saving ML Model: model.save('model_path')

• Loading ML Model:

from pyspark.ml.classification import LogisticRegressionModel;


LogisticRegressionModel.load('model_path')

15. Handling JSON and Complex Files

• Read JSON: df = spark.read.json('path_to_file.json')

• Explode JSON Object: df.selectExpr('json_column.*')

16. Custom Aggregations

• Custom Aggregate Function:

from pyspark.sql import functions as F;


df.groupBy('group_column').agg(F.sum('sum_column'))

17. Working with Null Values

• Counting Nulls in Each Column:

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

• Drop Rows with Null Values: df.na.drop()

18. Data Import/Export Tips

• Read Text Files: df = spark.read.text('path_to_file.txt')

• Write Data to JDBC:

df.write.format("jdbc").options(url="jdbc_url", dbtable="table_name").save()

19. Advanced SQL Operations

• Register DataFrame as Table: df.createOrReplaceTempView('temp_table')


• Perform SQL Queries: spark.sql('SELECT * FROM temp_table WHERE
condition')

20. Dealing with Large Datasets

• Sampling Data: sampled_df = df.sample(False, 0.1)

• Approximate Count Distinct:


df.select(approx_count_distinct('column')).show()

21. Data Quality Checks

• Checking Data Integrity: df.checkpoint()

• Asserting Conditions: df.filter(df['column'] > 0).count()

22. Advanced File Handling

• Specify Schema While Reading: schema = StructType([...]); df =


spark.read.csv('file.csv', schema=schema)

• Writing in Overwrite Mode: df.write.mode('overwrite').csv('path_to_file.csv')

23. Debugging and Error Handling

• Collecting Data Locally for Debugging: local_data = df.take(5)

• Handling Exceptions in UDFs:

def safe_udf(my_udf): def wrapper(*args, **kwargs): try: return


my_udf(*args, **kwargs) except: return None; return wrapper

24. Machine Learning Integration

• Creating Feature Vector:

from pyspark.ml.feature import VectorAssembler; assembler =


VectorAssembler(inputCols=['col1', 'col2'], outputCol='features'); feature_df =
assembler.transform(df)
25. Advanced Joins and Set Operations

• Cross Join: df1.crossJoin(df2)

• Set Operations (Union, Intersect, Minus): df1.union(df2);


df1.intersect(df2); df1.subtract(df2)

26. Dealing with Network Data

• Reading Data from HTTP Source: spark.read.format("csv").option("url",


"http://example.com/data.csv").load()

27. Integration with Visualization Libraries

• Convert to Pandas for Visualization: pandas_df = df.toPandas();


pandas_df.plot(kind='bar')

28. Spark Streaming for Real-Time EDA

• Reading from a Stream: df = spark.readStream.format('source').load()

• Writing to a Stream: df.writeStream.format('console').start()

29. Advanced Window Functions

• Cumulative Sum: from pyspark.sql.window import Window;


df.withColumn('cum_sum',
F.sum('column').over(Window.partitionBy('group_column').orderBy('order_col
umn')))

• Row Number: df.withColumn('row_num',


F.row_number().over(Window.orderBy('column')))

30. Handling Complex Analytics

• Rollup: df.rollup('column1', 'column2').agg(F.sum('column3'))

• Cube for Multi-Dimensional Aggregation: df.cube('column1',


'column2').agg(F.sum('column3'))
31. Dealing with Geospatial Data

• Using GeoSpark for Geospatial Data:

from geospark.register import GeoSparkRegistrator;


GeoSparkRegistrator.registerAll(spark)

32. Advanced File Formats

• Reading ORC Files: df = spark.read.orc('filename.orc')

• Writing Data to ORC: df.write.orc('path_to_file.orc')

33. Dealing with Sparse Data

• Using Sparse Vectors:

from pyspark.ml.linalg import SparseVector; sparse_vec =


SparseVector(size, {index: value})

34. Handling Binary Data

• Reading Binary Files:

df = spark.read.format('binaryFile').load('path_to_binary_file')

35. Efficient Data Transformation

• Using mapPartitions for Transformation:

rdd = df.rdd.mapPartitions(lambda partition: [transform(row) for row in


partition])

36. Advanced Machine Learning Operations

• Using ML Pipelines:

from pyspark.ml import Pipeline; pipeline = Pipeline(stages=[stage1,


stage2]); model = pipeline.fit(df)
• Model Evaluation:

from pyspark.ml.evaluation import BinaryClassificationEvaluator;


evaluator = BinaryClassificationEvaluator(); evaluator.evaluate(predictions)

37. Optimization Techniques

• Broadcast Variables for Efficiency: from pyspark.sql.functions import


broadcast; df.join(broadcast(df2), 'key')

• Using Accumulators for Global Aggregates: accumulator =


spark.sparkContext.accumulator(0); rdd.foreach(lambda x:
accumulator.add(x))

38. Advanced Data Import/Export

• Reading Data from Multiple Sources: df =


spark.read.format('format').option('option', 'value').load(['path1', 'path2'])

• Writing Data to Multiple Formats: df.write.format('format').save('path',


mode='overwrite')

39. Utilizing External Data Sources

• Connecting to External Data Sources (e.g., Kafka, S3):

df = spark.read.format('kafka').option('kafka.bootstrap.servers',
'host1:port1').load()

40. Efficient Use of SQL Functions

• Using Built-in SQL Functions:

from pyspark.sql.functions import col, lit; df.withColumn('new_column',


col('existing_column') + lit(1))

41. Exploring Data with GraphFrames

• Using GraphFrames for Graph Analysis:


from graphframes import GraphFrame; g = GraphFrame(vertices_df,
edges_df)

42. Working with Nested Data

• Exploding Nested Arrays:

df.selectExpr('id', 'explode(nestedArray) as element')

• Handling Nested Structs: df.select('struct_column.*')

43. Advanced Statistical Analysis

• Hypothesis Testing:

from pyspark.ml.stat import ChiSquareTest; r = ChiSquareTest.test(df,


'features', 'label')

• Statistical Functions (e.g., mean, stddev):

from pyspark.sql.functions import mean, stddev; df.select(mean('column'),


stddev('column'))

44. Customizing Spark Session

• Configuring SparkSession:

spark=SparkSession.builder.appName('app').config('spark.some.config.optio
n', 'value').getOrCreate()

You might also like