Skip to main content

Dates & Timestamps in PySpark DataFrames

Working with dates and timestamps is essential for analytics, forecasting, time-series modeling, and demand planning.
PySpark provides powerful built-in functions to extract, transform, manipulate, and aggregate date-related fields.

This guide walks you step-by-step through date extraction, aggregations, SQL queries, and derived metrics.


Loading Data

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('dates').getOrCreate()

df = spark.read.csv('/path/to/demand_planning.csv', header=True, inferSchema=True)

Preview & Schema

df.show()

Example output:

+----------+--------+-------------+-----------+------------+
| Date| Product|ForecastUnits|ActualUnits|LeadTimeDays|
+----------+--------+-------------+-----------+------------+
|2023-01-01| Soap| 100| 95| 3|
|2023-01-02| Soap| 120| 110| 2|
|2023-01-03| Shampoo| 80| 85| 4|
|2023-01-04| Shampoo| 90| 88| 3|
|2023-02-01| Soap| 130| 125| 2|
|2023-02-02| Shampoo| 85| 82| 4|
|2023-03-01| Soap| 140| 135| 3|
|2023-03-05| Shampoo| 100| 102| 3|
+----------+--------+-------------+-----------+------------+
df.printSchema()

Output:

root
|-- Date: date (nullable = true)
|-- Product: string (nullable = true)
|-- ForecastUnits: integer (nullable = true)
|-- ActualUnits: integer (nullable = true)
|-- LeadTimeDays: integer (nullable = true)

Note: Some Spark versions read dates as strings. You can convert using: df = df.withColumn("Date", to_date("Date"))


Extracting Date Components in PySpark

PySpark provides functions to extract day, month, year, and week numbers.

from pyspark.sql.functions import dayofmonth, month, year, weekofyear
df.select(
dayofmonth(df['Date']).alias('Day'),
month(df['Date']).alias('Month'),
year(df['Date']).alias('Year'),
weekofyear(df['Date']).alias('WeekOfYear')
).show()

Output:

+---+-----+----+----------+
|Day|Month|Year|WeekOfYear|
+---+-----+----+----------+
| 1 | 1 |2023| 52 |
| 2 | 1 |2023| 1 |
| 3 | 1 |2023| 1 |
| 4 | 1 |2023| 1 |
| 1 | 2 |2023| 5 |
| 2 | 2 |2023| 5 |
| 1 | 3 |2023| 9 |
| 5 | 3 |2023| 10 |
+---+-----+----+----------+

✔️ What This Does

Extracts different date components for time-series analysis, forecasting, and reporting.


Aggregation by Year

from pyspark.sql.functions import avg
df_with_year = df.withColumn('Year', year(df['Date']))

df_with_year.groupBy('Year') \
.mean() \
.select('Year', 'avg(ActualUnits)') \
.show()

Output:

+----+------------------+
|Year|avg(ActualUnits) |
+----+------------------+
|2023| 103.375 |
+----+------------------+

✔️ Meaning

Computes yearly averages—useful for trend analysis and demand forecasting.


Creating Ratios / Derived Metrics

df.select(
(df['ForecastUnits'] / df['ActualUnits']).alias('Forecast_to_Actual')
).show()

Output:

+--------------------+
|Forecast_to_Actual |
+--------------------+
|1.0526315789473684 |
|1.0909090909090908 |
|0.9411764705882353 |
|1.0227272727272727 |
|1.04 |
|1.0365853658536585 |
|1.037037037037037 |
|0.9803921568627451 |
+--------------------+

✔️ Meaning

Shows forecasting accuracy by comparing forecast vs actual.


Running SQL Queries with Dates

df.createOrReplaceTempView('demand')

Query: Find Date(s) With Maximum Actual Units

spark.sql("""
SELECT Date, ActualUnits
FROM demand
WHERE ActualUnits = (SELECT MAX(ActualUnits) FROM demand)
""").show()

Output:

+----------+-----------+
| Date |ActualUnits|
+----------+-----------+
|2023-03-05| 102 |
+----------+-----------+

✔️ Meaning

Returns the highest-demand day(s) using SQL subqueries.


More Aggregations & Statistics

from pyspark.sql.functions import round, max, min
df.select(
round(avg(df['ActualUnits']), 2).alias('AvgActual'),
max(df['ForecastUnits']).alias('MaxForecast'),
min(df['ForecastUnits']).alias('MinForecast')
).show()

Output:

+-----------+-------------+-------------+
| AvgActual | MaxForecast | MinForecast |
+-----------+-------------+-------------+
| 103.38 | 140 | 80 |
+-----------+-------------+-------------+

✔️ Meaning

Computes multiple KPIs in a single query.


🟦 1-Minute Summary — Dates & Time in PySpark

Code / ExpressionWhat It Does
dayofmonth(), month(), year(), weekofyear()Extract components from a Date column
withColumn('Year', year(...))Adds new year column
groupBy('Year').mean()Aggregates metrics by year
(ForecastUnits / ActualUnits)Computes forecast accuracy
createOrReplaceTempView()Registers DataFrame for SQL
SELECT ... WHERE ActualUnits = (MAX...)Finds rows with maximum demand
round(avg(), 2), max(), min()Summary statistics

Next, we’ll cover Using Spark SQL basic — Register Temp Views and Query.

Career