Skip to main content

Sorting, Sampling & Limit in PySpark DataFrames β€” Efficient Data Retrieval at Scale

At NeoMart, data analysts often ask questions like:

  • β€œGive me the top 10 highest-selling products.”
  • β€œFetch a random 1% sample for model testing.”
  • β€œSort all orders by timestamp for fraud analysis.”

Though simple, operations like sort, sample, and limit can get expensive at scale.
This chapter teaches how to use them efficiently on large distributed datasets.


1. Sample DataFrame​

from pyspark.sql import functions as F

df = spark.createDataFrame([
(1, "Laptop", 1200),
(2, "Mouse", 25),
(3, "Monitor", 220),
(4, "Keyboard", 75),
(5, "Headset", 60)
], ["product_id", "name", "price"])

df.show()

Input Data​

product_idnameprice
1Laptop1200
2Mouse25
3Monitor220
4Keyboard75
5Headset60

2. Sorting DataFrames​

Sorting is done using:

βœ” orderBy() βœ” sort() (alias)

df_sorted = df.orderBy(F.desc("price"))
df_sorted.show()

Output​

product_idnameprice
1Laptop1200
3Monitor220
4Keyboard75
5Headset60
2Mouse25

Story​

NeoMart’s fraud-detection team sorts orders by price to catch unusually high transactions.


3. Multi-Column Sorting​

df.orderBy(F.desc("price"), F.asc("name")).show()

When keys tie, the next key breaks the tie.


4. Sorting Large Datasets β€” Performance Tips​

βœ” Sorting triggers a global shuffle βœ” Use partitioning to reduce sort costs βœ” Use sortWithinPartitions() when global order isn’t required

Example​

df_small_sort = df.sortWithinPartitions("price")

Useful for window functions or per-partition ordering.


5. Limit β€” Fetching Top N Rows​

limit() retrieves a fixed number of rows.

df.limit(3).show()

Output​

product_idnameprice
1Laptop1200
2Mouse25
3Monitor220

⚠ Important: limit() does not guarantee order. Combine with orderBy() for deterministic results.


6. Top-N Query (Sort + Limit)​

df.orderBy(F.desc("price")).limit(2).show()

Output​

product_idnameprice
1Laptop1200
3Monitor220

7. Sampling β€” Random Subset of Data​

Sampling helps NeoMart’s ML team train models faster.

7.1 Random Sampling​

df_sample = df.sample(withReplacement=False, fraction=0.4, seed=42)
df_sample.show()

Possible Output​

product_idnameprice
3Monitor220
5Headset60

7.2 Sampling With Replacement​

df_replace = df.sample(withReplacement=True, fraction=0.5)

Useful in bootstrapping algorithms.


7.3 Stratified Sampling (sampleBy)​

Use when each group must be sampled proportionally.

df_category = spark.createDataFrame([
("Electronics", 1),
("Electronics", 2),
("Fashion", 3),
("Fashion", 4)
], ["category", "id"])

fractions = {"Electronics": 0.5, "Fashion": 1.0}

df_stratified = df_category.sampleBy("category", fractions)
df_stratified.show()

Output (varies)​

categoryid
Electronics2
Fashion3
Fashion4

8. Limit vs Sample β€” When to Use What?​

RequirementUse
Get fixed top N rowslimit()
Need randomnesssample()
Need proportional samplingsampleBy()
Top N by specific metricorderBy().limit()

9. Using Sorting, Sampling & Limit in SQL​

Sorting​

SELECT * FROM products ORDER BY price DESC;

Sampling (Random rows)​

SELECT * FROM products TABLESAMPLE(50 PERCENT);

Limit​

SELECT * FROM products LIMIT 5;

10. Best Practices​

βœ” Avoid global sorting on multi-terabyte data βœ” Use sortWithinPartitions() when global order not required βœ” Keep seed fixed for reproducible sampling βœ” Always combine orderBy + limit for stable top-N queries βœ” Avoid sampling huge datasets repeatedly β†’ cache first


Summary​

You now know how to:

  • Sort DataFrames efficiently
  • Retrieve random or stratified samples
  • Limit output for fast previews
  • Optimize sorting and sampling operations
  • Balance correctness vs performance

Sorting, sampling, and limiting are essential for analytics, ML pipelines, data previewing, and debugging at NeoMart β€” and now you can use them expertly too.


Next Topic β†’ Partitioning & Bucketing for Large Datasets