Skip to main content

One-Line PySpark Function Meanings

🧰 Basic utilities

Function / Method	Meaning
`from pyspark.sql.functions import ...`	Imports built-in functions from PySpark SQL module.
`.select([...])`	Selects specific columns from the DataFrame.
`.show()` after any transformation	Displays the result of the transformation for visual inspection.

📦Data Loading & Preview

Method / Parameter	Meaning
`inferSchema=True`	Automatically detects the data types of columns.
`header=True`	Treats the first row of the file as column headers.
`df.show()`	Displays the first 20 rows of the DataFrame.
`df.head(1)`	Retrieves the first row as a list of Row objects.
`df.printSchema()`	Displays the structure (schema) of the DataFrame.

📅Date Functions

Function / Method	Meaning
`dayofmonth(df['Date'])`	Extracts the day (1–31) from a date.
`month(df['Date'])`	Extracts the month (1–12) from a date.
`year(df['Date'])`	Extracts the year (e.g., 2023) from a date.
`weekofyear(df['Date'])`	Extracts the week number of the year from a date.
`date_format(df['Date'], 'MMM')`	Formats the date into a custom format (e.g., Jan, Feb).

🧮 Aggregations & Metrics

Function / Method	Meaning
`withColumn('Year', year(df['Date']))`	Creates a new column 'Year' derived from the 'Date' column.
`groupBy('Year')`	Groups the rows by 'Year' column.
`.mean()` or `.agg({'col':'mean'})`	Computes average of one or more columns.
`.agg({'Sales':'sum'})`	Computes the total sum for the 'Sales' column.
`.agg({'Volume':'max'})`	Finds the maximum value in the 'Volume' column.
`round(mean(df['Close']), 2)`	Computes the mean and rounds it to 2 decimal places.
`max(df['Volume'])`, `min(df['Volume'])`	Gets maximum or minimum volume across all rows.
`countDistinct(df['Sales'])`	Counts unique (distinct) values in the 'Sales' column.

🧮 Column Math / Derived Columns

Expression / Method	Meaning
`(df['ForecastUnits'] / df['ActualUnits'])`	Creates a new column with ratio of forecast to actual units.
`.alias('Forecast_to_Actual')`	Renames the resulting column to a readable name.

🧠 SQL Queries

Function / Method	Meaning
`createOrReplaceTempView('table')`	Registers DataFrame as a temporary SQL table.
`spark.sql('...')`	Executes SQL query on registered temp table.
`SELECT MAX(column) FROM table`	SQL syntax to find the maximum value in a column.
`WHERE ActualUnits = (SELECT MAX(ActualUnits)...)`	Filters rows that have the maximum actual units.

🧰🧠Utility & Optimization Methods

Function / Method	One-Line Meaning
`distinct()`	Removes duplicate rows from the DataFrame.
`dropDuplicates(['col1', 'col2'])`	Removes duplicate rows based on specific columns.
`selectExpr("colA as newCol")`	Selects column(s) using SQL expressions with aliasing.
`withColumnRenamed("old", "new")`	Renames a column.
`cache()`	Caches the DataFrame in memory for faster access.
`persist()`	Stores DataFrame with a specified storage level (memory, disk, etc.).
`repartition(4)`	Redistributes rows across a specified number of partitions.
`coalesce(1)`	Reduces the number of partitions, often to write a single output file.
`dropna()`	Drops rows with null values (alias for `na.drop()`).
`fillna(value)`	Fills null values with the specified value.
`isNull()`, `isNotNull()`	Filters rows with null or non-null values.
`when(...).otherwise(...)` (from `functions`)	Performs conditional logic like SQL `CASE WHEN`.

Joins

Join Type	Why We Need It
`inner`	Only matching rows (common records)
`left`	Keep all from left, match where possible
`right`	Keep all from right, match where possible
`outer`	Keep everything from both sides
`left_semi`	Filter left rows that exist in right
`left_anti`	Filter left rows that don’t exist in right
`crossJoin()`	All combinations (use cautiously — can explode rows)

🧰 Basic utilities
📦Data Loading & Preview
📅Date Functions
🧮 Aggregations & Metrics
🧮 Column Math / Derived Columns
🧠 SQL Queries
🧰🧠Utility & Optimization Methods
Joins