Skip to main content

Essential PySpark Interview Question & Answer(Explained Through Real-World Stories) – Part 2

6. What are the different file formats PySpark supports?

PySpark supports many file formats to read and write data easily. Some of the most common ones are:

  1. CSV (Comma-Separated Values): Text files where each line represents a row of data, and values are separated by commas. Example: data.csv

  2. JSON (JavaScript Object Notation): Data is stored as key-value pairs. Very useful for structured and semi-structured data. Example: data.json

  3. Parquet: A column-based storage format that is very fast and efficient. Commonly used in big data projects. Example: data.parquet

  4. ORC (Optimized Row Columnar): Another efficient columnar format used mostly with Hive. Example: data.orc

  5. Avro: A binary format often used for data exchange between systems. Example: data.avro

  6. Text Files: Simple files containing plain text data. Example: data.txt


7. How do you read a CSV file into a PySpark DataFrame?

To read a CSV file in PySpark, we use the read method of the SparkSession object. Here’s how you can do it:

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("ReadCSVExample").getOrCreate()

# Read CSV file
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show data
df.show()

Explanation:

  • header=True means the first line of the file contains column names.
  • inferSchema=True tells PySpark to automatically detect the data types (like integer, string, etc.).

8. How do you write a DataFrame to a Parquet file?

Writing a DataFrame to a Parquet file is simple and efficient in PySpark:

df.write.parquet("output_data.parquet")

Explanation:

  • The above command saves your DataFrame in Parquet format.
  • Parquet files are faster to read and take up less storage compared to CSV files.
  • You can also use options like mode("overwrite") to replace existing files.

Example:

df.write.mode("overwrite").parquet("output_data.parquet")

9. What are the common DataFrame transformations?

Transformations are operations that create a new DataFrame from an existing one — without changing the original data.

Some common transformations are:

TransformationDescriptionExample
select()Select specific columnsdf.select("name", "age")
filter() / where()Filter rows based on a conditiondf.filter(df.age > 18)
groupBy()Group data for aggregationdf.groupBy("city").count()
withColumn()Add or modify a columndf.withColumn("age_plus_1", df.age + 1)
drop()Remove a columndf.drop("address")
orderBy()Sort datadf.orderBy("age")

Note: Transformations are lazy, meaning they don’t run immediately — they wait until an action is performed.


10. What are actions in PySpark? Give examples.

Actions are operations that trigger the actual execution of transformations and return a result.

Some common actions are:

ActionDescriptionExample
show()Displays the DataFrame contentdf.show()
collect()Returns all rows as a listdf.collect()
count()Returns the number of rowsdf.count()
first() / head()Returns the first row(s)df.first()
take(n)Returns the first n rowsdf.take(5)
write()Saves data to a file or databasedf.write.csv("output.csv")

In simple words: Transformations prepare the data, while actions actually do the work (like displaying or saving results).

11. How do you show the top N rows of a DataFrame?

To display the top N rows of a PySpark DataFrame, use the show() function and specify the number of rows you want to see.

df.show(5)

Explanation:

  • This command displays the first 5 rows of the DataFrame in a readable table format.
  • If no number is given, PySpark shows 20 rows by default.

12. How do you select specific columns from a DataFrame?

To pick specific columns from a DataFrame, use the select() function.

df_selected = df.select("name", "age")
df_selected.show()

Explanation:

  • This creates a new DataFrame with only the columns “name” and “age”.
  • The original DataFrame stays unchanged since PySpark transformations are immutable.

13. How do you rename a column in PySpark?

To rename a column, use the withColumnRenamed() function.

df_renamed = df.withColumnRenamed("oldName", "newName")
df_renamed.show()

Explanation:

  • The column “oldName” is renamed to “newName”.
  • This does not change the original DataFrame — it returns a new one.

14. How do you drop columns from a DataFrame?

You can remove columns using the drop() function.

df_dropped = df.drop("address")
df_dropped.show()

Explanation:

  • The “address” column is removed from the DataFrame.
  • To drop multiple columns at once:
df_dropped = df.drop("address", "phone_number")

15. What is the difference between filter and where?

Both filter() and where() are used to select rows based on conditions. They work exactly the same way — the only difference is in the function name.

Example using filter():

df.filter(df.age > 18).show()

Example using where():

df.where(df.age > 18).show()

Explanation:

  • Both commands show rows where age > 18.
  • They are interchangeable, so you can use either one depending on your preference.