Skip to main content

One-Line PySpark Function Meanings

🧰 Basic utilities​

Function / MethodMeaning
from pyspark.sql.functions import ...Imports built-in functions from PySpark SQL module.
.select([...])Selects specific columns from the DataFrame.
.show() after any transformationDisplays the result of the transformation for visual inspection.

📦Data Loading & Preview​

Method / ParameterMeaning
inferSchema=TrueAutomatically detects the data types of columns.
header=TrueTreats the first row of the file as column headers.
df.show()Displays the first 20 rows of the DataFrame.
df.head(1)Retrieves the first row as a list of Row objects.
df.printSchema()Displays the structure (schema) of the DataFrame.

📅Date Functions​

Function / MethodMeaning
dayofmonth(df['Date'])Extracts the day (1–31) from a date.
month(df['Date'])Extracts the month (1–12) from a date.
year(df['Date'])Extracts the year (e.g., 2023) from a date.
weekofyear(df['Date'])Extracts the week number of the year from a date.
date_format(df['Date'], 'MMM')Formats the date into a custom format (e.g., Jan, Feb).

🧮 Aggregations & Metrics​

Function / MethodMeaning
withColumn('Year', year(df['Date']))Creates a new column 'Year' derived from the 'Date' column.
groupBy('Year')Groups the rows by 'Year' column.
.mean() or .agg({'col':'mean'})Computes average of one or more columns.
.agg({'Sales':'sum'})Computes the total sum for the 'Sales' column.
.agg({'Volume':'max'})Finds the maximum value in the 'Volume' column.
round(mean(df['Close']), 2)Computes the mean and rounds it to 2 decimal places.
max(df['Volume']), min(df['Volume'])Gets maximum or minimum volume across all rows.
countDistinct(df['Sales'])Counts unique (distinct) values in the 'Sales' column.

🧮 Column Math / Derived Columns​

Expression / MethodMeaning
(df['ForecastUnits'] / df['ActualUnits'])Creates a new column with ratio of forecast to actual units.
.alias('Forecast_to_Actual')Renames the resulting column to a readable name.

🧠 SQL Queries​

Function / MethodMeaning
createOrReplaceTempView('table')Registers DataFrame as a temporary SQL table.
spark.sql('...')Executes SQL query on registered temp table.
SELECT MAX(column) FROM tableSQL syntax to find the maximum value in a column.
WHERE ActualUnits = (SELECT MAX(ActualUnits)...)Filters rows that have the maximum actual units.

🧰🧠Utility & Optimization Methods​

Function / MethodOne-Line Meaning
distinct()Removes duplicate rows from the DataFrame.
dropDuplicates(['col1', 'col2'])Removes duplicate rows based on specific columns.
selectExpr("colA as newCol")Selects column(s) using SQL expressions with aliasing.
withColumnRenamed("old", "new")Renames a column.
cache()Caches the DataFrame in memory for faster access.
persist()Stores DataFrame with a specified storage level (memory, disk, etc.).
repartition(4)Redistributes rows across a specified number of partitions.
coalesce(1)Reduces the number of partitions, often to write a single output file.
dropna()Drops rows with null values (alias for na.drop()).
fillna(value)Fills null values with the specified value.
isNull(), isNotNull()Filters rows with null or non-null values.
when(...).otherwise(...) (from functions)Performs conditional logic like SQL CASE WHEN.

Joins​

Join TypeWhy We Need It
innerOnly matching rows (common records)
leftKeep all from left, match where possible
rightKeep all from right, match where possible
outerKeep everything from both sides
left_semiFilter left rows that exist in right
left_antiFilter left rows that don’t exist in right
crossJoin()All combinations (use cautiously — can explode rows)