PySpark - Select Columns From DataFrame Last Updated : 04 Aug, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected column Let's create a sample dataframe Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of students data data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"], ["3", "rohith", "vvit"], ["4", "sridevi", "vignan"], ["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]] # specify column names columns = ['student ID', 'student NAME', 'college'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) print("Actual data in dataframe") # show dataframe dataframe.show() Output: Selecting single Column With column name, we can get the whole column in the data frame Syntax: dataframe.select("column_name").show() Python3 # select column with column name dataframe.select('student ID').show() Output: Selecting multiple Columns With multiple column names, we can get the whole column in the data frame Syntax: dataframe.select(["column_name1","column_name 2","column_name n"]).show() Python3 # select multiple column with column name dataframe.select(['student ID', 'student NAME', 'college']).show() Output: Selecting using Column Number Here we are going to select the columns based on the column number. This can be done using the indexing operator. We can pass the column number as the index to dataframe.columns[]. Syntax: dataframe.select(dataframe.columns[column_number]).show() Python3 # select column with column number 1 dataframe.select(dataframe.columns[1]).show() Output: Accessing multiple columns based on column number. Here we are going to select multiple columns by using the slice operator. Syntax: dataframe.select(dataframe.columns[column_start:column_end]).show() where, column_start is the starting index and column_end is the ending index Python3 # select column with column number slice # operator dataframe.select(dataframe.columns[0:3]).show() Output: Comment More infoAdvertise with us Next Article PySpark - Select Columns From DataFrame gottumukkalabobby Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Select columns in PySpark dataframe In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We 4 min read Pivot String column on Pyspark Dataframe Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate 4 min read PySpark - Split dataframe by column value A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either 3 min read PySpark DataFrame - Select all except one or a set of columns In this article, we are going to extract all columns except a set of columns or one column from Pyspark dataframe. For this, we will use the select(), drop() functions. But first, let's create Dataframe for demonestration. Python3 # importing module import pyspark # importing sparksession from pyspa 2 min read How to delete columns in PySpark dataframe ? In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing 2 min read Spark Trim String Column on DataFrame In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca 4 min read PySpark Collect() â Retrieve data from DataFrame Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program. So, in this article, we are going to learn how to re 6 min read How to verify Pyspark dataframe column type ? While working with a big Dataframe, Dataframe consists of any number of columns that are having different datatypes. For pre-processing the data to apply operations on it, we have to know the dimensions of the Dataframe and datatypes of the columns which are present in the Dataframe. In this article 4 min read How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy 2 min read Applying function to PySpark Dataframe Column In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes 4 min read Like