PySpark - Read CSV file into DataFrame Last Updated : 25 Oct, 2021 Summarize Comments Improve Suggest changes Share Like Article Like Report In this article, we are going to see how to read CSV files into Dataframe. For this, we will use Pyspark and Python. Files Used: authorsbook_authorbooksRead CSV File into DataFrame Here we are going to read a single CSV into dataframe using spark.read.csv and then create dataframe with this data using .toPandas(). Python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'Read CSV File into DataFrame').getOrCreate() authors = spark.read.csv('/content/authors.csv', sep=',', inferSchema=True, header=True) df = authors.toPandas() df.head() Output: Here, we passed our CSV file authors.csv. Second, we passed the delimiter used in the CSV file. Here the delimiter is comma ','. Next, we set the inferSchema attribute as True, this will go through the CSV file and automatically adapt its schema into PySpark Dataframe. Then, we converted the PySpark Dataframe to Pandas Dataframe df using toPandas() method. Read Multiple CSV Files To read multiple CSV files, we will pass a python list of paths of the CSV files as string type. Python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Read Multiple CSV Files').getOrCreate() path = ['/content/authors.csv', '/content/book_author.csv'] files = spark.read.csv(path, sep=',', inferSchema=True, header=True) df1 = files.toPandas() display(df1.head()) display(df1.tail()) Output: Here, we imported authors.csv and book_author.csv present in the same current working directory having delimiter as comma ',' and the first row as Header. Read All CSV Files in Directory To read all CSV files in the directory, we will use * for considering each file in the directory. Python3 from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'Read All CSV Files in Directory').getOrCreate() file2 = spark.read.csv('/content/*.csv', sep=',', inferSchema=True, header=True) df1 = file2.toPandas() display(df1.head()) display(df1.tail()) Output: This will read all the CSV files present in the current working directory, having delimiter as comma ',' and the first row as Header. Comment More infoAdvertise with us Next Article PySpark - Create DataFrame from List R rahuls66 Follow Improve Article Tags : Python Blogathon Blogathon-2021 Python-Pyspark Practice Tags : python Similar Reads Read Text file into PySpark Dataframe In this article, we are going to see how to read text files in PySpark Dataframe. There are three ways to read text files into PySpark DataFrame. Using spark.read.text()Using spark.read.csv()Using spark.read.format().load() Using these we can read a single text file, multiple files, and all files fr 3 min read PySpark dataframe foreach to fill a list In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python. PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used f 3 min read Select columns in PySpark dataframe In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We 4 min read PySpark - Create DataFrame from List In this article, we are going to discuss how to create a Pyspark dataframe from a list. To do this first create a list of data and a list of column names. Then pass this zipped data to spark.createDataFrame() method. This method is used to create DataFrame. The data attribute will be the list of da 2 min read Create PySpark DataFrame from list of tuples In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list 2 min read Convert PySpark RDD to DataFrame In this article, we will discuss how to convert the RDD to dataframe in PySpark. There are two approaches to convert RDD to dataframe. Using createDataframe(rdd, schema)Using toDF(schema) But before moving forward for converting RDD to Dataframe first let's create an RDD Example: Python # importing 3 min read Like