PySpark - Read CSV file into DataFrame Last Updated : 25 Oct, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we are going to see how to read CSV files into Dataframe. For this, we will use Pyspark and Python. Files Used: authorsbook_authorbooksRead CSV File into DataFrame Here we are going to read a single CSV into dataframe using spark.read.csv and then create dataframe with this data using .toPandas(). Python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'Read CSV File into DataFrame').getOrCreate() authors = spark.read.csv('/content/authors.csv', sep=',', inferSchema=True, header=True) df = authors.toPandas() df.head() Output: Here, we passed our CSV file authors.csv. Second, we passed the delimiter used in the CSV file. Here the delimiter is comma ','. Next, we set the inferSchema attribute as True, this will go through the CSV file and automatically adapt its schema into PySpark Dataframe. Then, we converted the PySpark Dataframe to Pandas Dataframe df using toPandas() method. Read Multiple CSV Files To read multiple CSV files, we will pass a python list of paths of the CSV files as string type. Python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Read Multiple CSV Files').getOrCreate() path = ['/content/authors.csv', '/content/book_author.csv'] files = spark.read.csv(path, sep=',', inferSchema=True, header=True) df1 = files.toPandas() display(df1.head()) display(df1.tail()) Output: Here, we imported authors.csv and book_author.csv present in the same current working directory having delimiter as comma ',' and the first row as Header. Read All CSV Files in Directory To read all CSV files in the directory, we will use * for considering each file in the directory. Python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'Read All CSV Files in Directory').getOrCreate() file2 = spark.read.csv('/content/*.csv', sep=',', inferSchema=True, header=True) df1 = file2.toPandas() display(df1.head()) display(df1.tail()) Output: This will read all the CSV files present in the current working directory, having delimiter as comma ',' and the first row as Header. Comment More infoAdvertise with us Next Article PySpark - Read CSV file into DataFrame R rahuls66 Follow Improve Article Tags : Python Blogathon Blogathon-2021 Python-Pyspark Practice Tags : python Similar Reads Read Text file into PySpark Dataframe In this article, we are going to see how to read text files in PySpark Dataframe. There are three ways to read text files into PySpark DataFrame. Using spark.read.text()Using spark.read.csv()Using spark.read.format().load() Using these we can read a single text file, multiple files, and all files fr 3 min read PySpark dataframe foreach to fill a list In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python. PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used f 3 min read Select columns in PySpark dataframe In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We 4 min read PySpark - Create DataFrame from List In this article, we are going to discuss how to create a Pyspark dataframe from a list. To do this first create a list of data and a list of column names. Then pass this zipped data to spark.createDataFrame() method. This method is used to create DataFrame. The data attribute will be the list of da 2 min read Create PySpark DataFrame from list of tuples In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list 2 min read Convert PySpark RDD to DataFrame In this article, we will discuss how to convert the RDD to dataframe in PySpark. There are two approaches to convert RDD to dataframe. Using createDataframe(rdd, schema)Using toDF(schema) But before moving forward for converting RDD to Dataframe first let's create an RDD Example: Python # importing 3 min read PySpark - Select Columns From DataFrame In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected 2 min read Rename Duplicated Columns after Join in Pyspark dataframe In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python. A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, t 4 min read Convert PySpark Row List to Pandas DataFrame In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects. Method 1 : Use createDataFrame() method and use toPandas() method Here is the syntax 4 min read Like