Basic DataFrame Operations-part 1
What is a DataFrame in PySpark?
Imagine you're working at a startup, and your job is to analyze customer data. One morning, your manager drops a spreadsheet on your desk. It has names, ages, and purchase history of thousands of customers. Your job? Analyze trends, clean the data, and generate insights — fast.
But there’s a catch:
This isn’t just one file — it's millions of records spread across multiple machines. That old spreadsheet tool just won’t cut it anymore.
Enter PySpark — the DataFrame.
At its core, a DataFrame is just a table — like what you’d see in Excel or a SQL database. It has:
- Rows (each row = one record)
- Columns (each column = a specific field, like "name", "age", or "email")
But unlike Excel or Pandas, a PySpark DataFrame is built for scale — designed to handle huge datasets across a cluster of computers.
Professional
A PySpark DataFrame is:
✅ Like a table in a database
✅ Easy to query and transform using Python
✅ Powerful enough to handle massive data
✅ Optimized for distributed computing
Step-by-Step Explanation of Basic Spark DataFrame Operations
SparkSession
from pyspark.sql import SparkSession
What is this?
This line imports SparkSession, which is the entry point to programming with DataFrames in PySpark.
SparkSession allows your program to connect to a Spark cluster and use all its functionalities (like reading data, transforming data, running SQL queries, etc.).
SparkSession.builder.appName('Basics').getOrCreate()
spark = SparkSession.builder.appName('Basics').getOrCreate()
What is happening here?
SparkSession.builder starts the builder pattern to create a new SparkSession.
.appName('Basics') sets a name for your Spark application – useful for tracking/logging purposes.
.getOrCreate() does one of two things:
- If a SparkSession already exists, it returns that.
- If it doesn’t, it creates a new one.
In simple terms:
This line initializes the Spark engine so you can start using PySpark.
Sample raw data
data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35}
]
df = spark.createDataFrame(data)
Explanation:
-
data is just a Python list of dictionaries (like rows in a table).
-
Each dictionary represents a record with keys as column names and values as row values.
-
spark.createDataFrame(data) converts this Python list into a Spark DataFrame.
df.show()
You can inspect the contents with:
df.show()
Result:-
+-------+---+
| name|age|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
+-------+---+
🧠 Schema: Understanding the Structure of the DataFrame
You can check the schema (column names and data types) using:
df.printSchema()
Result:-
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
- name: is of type string.
- age: is of type long (which represents integer values in PySpark).
- nullable = true: means that the column can have null (empty) values.
df.columns
df.columns
This returns a list of column names in the DataFrame.
Example:- ['name', 'age', 'salary']
df.describe().show()
df.describe().show()
It gives you a quick statistical summary of all the numeric columns in your DataFrame.
Result:-
+-------+-----+------+
|summary| name| age |
+-------+-----+------+
| count | 3| 3|
| mean | null| 30.0 |
| stddev| null| 5.0 |
| min |Alice| 25 |
| max |Charlie| 35 |
+-------+-----+------+
Here’s what each row means:
count: How many values are present (non-null)
mean: The average
stddev: Standard deviation (how spread out the numbers are)
min / max: Minimum and maximum values
🔑 1-Minute Summary
Code | What it Does |
---|---|
from pyspark.sql import SparkSession | Imports the main entry point for PySpark |
SparkSession.builder...getOrCreate() | Starts or retrieves a Spark session |
spark.createDataFrame(data) | Converts Python list of dictionaries to a Spark DataFrame |
df.show() | Displays rows in the DataFrame |
df.printSchema() | Shows the structure (schema) of the DataFrame |
df.columns | Returns a list of column names in the DataFrame |
df.describe().show() | Provides a statistical summary of numeric columns |