Skip to main content

Basic DataFrame Operations-part 1

What is a DataFrame in PySpark?

Imagine you're working at a startup, and your job is to analyze customer data. One morning, your manager drops a spreadsheet on your desk. It has names, ages, and purchase history of thousands of customers. Your job? Analyze trends, clean the data, and generate insights — fast.

But there’s a catch:
This isn’t just one file — it's millions of records spread across multiple machines. That old spreadsheet tool just won’t cut it anymore.
Enter PySpark — the DataFrame.

At its core, a DataFrame is just a table — like what you’d see in Excel or a SQL database. It has:

  • Rows (each row = one record)
  • Columns (each column = a specific field, like "name", "age", or "email")

But unlike Excel or Pandas, a PySpark DataFrame is built for scale — designed to handle huge datasets across a cluster of computers.

Professional
A PySpark DataFrame is:

✅ Like a table in a database
✅ Easy to query and transform using Python
✅ Powerful enough to handle massive data
✅ Optimized for distributed computing

Step-by-Step Explanation of Basic Spark DataFrame Operations

SparkSession

from pyspark.sql import SparkSession

What is this?
This line imports SparkSession, which is the entry point to programming with DataFrames in PySpark.

SparkSession allows your program to connect to a Spark cluster and use all its functionalities (like reading data, transforming data, running SQL queries, etc.).

SparkSession.builder.appName('Basics').getOrCreate()

spark = SparkSession.builder.appName('Basics').getOrCreate()

What is happening here?
SparkSession.builder starts the builder pattern to create a new SparkSession.

.appName('Basics') sets a name for your Spark application – useful for tracking/logging purposes.

.getOrCreate() does one of two things:

  • If a SparkSession already exists, it returns that.
  • If it doesn’t, it creates a new one.

In simple terms:
This line initializes the Spark engine so you can start using PySpark.

Sample raw data

data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35}
]

df = spark.createDataFrame(data)

Explanation:

  • data is just a Python list of dictionaries (like rows in a table).

  • Each dictionary represents a record with keys as column names and values as row values.

  • spark.createDataFrame(data) converts this Python list into a Spark DataFrame.

df.show()

You can inspect the contents with:

df.show()

Result:-

+-------+---+   
| name|age|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
+-------+---+

🧠 Schema: Understanding the Structure of the DataFrame

You can check the schema (column names and data types) using:

df.printSchema()

Result:-

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
  • name: is of type string.
  • age: is of type long (which represents integer values in PySpark).
  • nullable = true: means that the column can have null (empty) values.

df.columns

df.columns

This returns a list of column names in the DataFrame.

Example:- ['name', 'age', 'salary']

df.describe().show()

df.describe().show()

It gives you a quick statistical summary of all the numeric columns in your DataFrame.

Result:-

+-------+-----+------+
|summary| name| age |
+-------+-----+------+
| count | 3| 3|
| mean | null| 30.0 |
| stddev| null| 5.0 |
| min |Alice| 25 |
| max |Charlie| 35 |
+-------+-----+------+

Here’s what each row means:

count: How many values are present (non-null)

mean: The average

stddev: Standard deviation (how spread out the numbers are)

min / max: Minimum and maximum values

🔑 1-Minute Summary

CodeWhat it Does
from pyspark.sql import SparkSessionImports the main entry point for PySpark
SparkSession.builder...getOrCreate()Starts or retrieves a Spark session
spark.createDataFrame(data)Converts Python list of dictionaries to a Spark DataFrame
df.show()Displays rows in the DataFrame
df.printSchema()Shows the structure (schema) of the DataFrame
df.columnsReturns a list of column names in the DataFrame
df.describe().show()Provides a statistical summary of numeric columns