Sorting, Sampling & Limit in PySpark DataFrames — Efficient Data Retrieval at Scale

At NeoMart, data analysts often ask questions like:

“Give me the top 10 highest-selling products.”
“Fetch a random 1% sample for model testing.”
“Sort all orders by timestamp for fraud analysis.”

Though simple, operations like sort, sample, and limit can get expensive at scale.
This chapter teaches how to use them efficiently on large distributed datasets.

1. Sample DataFrame

from pyspark.sql import functions as F

df = spark.createDataFrame([
    (1, "Laptop", 1200),
    (2, "Mouse", 25),
    (3, "Monitor", 220),
    (4, "Keyboard", 75),
    (5, "Headset", 60)
], ["product_id", "name", "price"])

df.show()

Input Data

product_id	name	price
1	Laptop	1200
2	Mouse	25
3	Monitor	220
4	Keyboard	75
5	Headset	60

2. Sorting DataFrames

Sorting is done using:

✔ orderBy() ✔ sort() (alias)

df_sorted = df.orderBy(F.desc("price"))
df_sorted.show()

Output

product_id	name	price
1	Laptop	1200
3	Monitor	220
4	Keyboard	75
5	Headset	60
2	Mouse	25

Story

NeoMart’s fraud-detection team sorts orders by price to catch unusually high transactions.

3. Multi-Column Sorting

df.orderBy(F.desc("price"), F.asc("name")).show()

When keys tie, the next key breaks the tie.

4. Sorting Large Datasets — Performance Tips

✔ Sorting triggers a global shuffle ✔ Use partitioning to reduce sort costs ✔ Use sortWithinPartitions() when global order isn’t required

Example

df_small_sort = df.sortWithinPartitions("price")

Useful for window functions or per-partition ordering.

5. Limit — Fetching Top N Rows

limit() retrieves a fixed number of rows.

df.limit(3).show()

Output

product_id	name	price
1	Laptop	1200
2	Mouse	25
3	Monitor	220

⚠ Important: limit() does not guarantee order. Combine with orderBy() for deterministic results.

6. Top-N Query (Sort + Limit)

df.orderBy(F.desc("price")).limit(2).show()

Output

product_id	name	price
1	Laptop	1200
3	Monitor	220

7. Sampling — Random Subset of Data

Sampling helps NeoMart’s ML team train models faster.

7.1 Random Sampling

df_sample = df.sample(withReplacement=False, fraction=0.4, seed=42)
df_sample.show()

Possible Output

product_id	name	price
3	Monitor	220
5	Headset	60

7.2 Sampling With Replacement

df_replace = df.sample(withReplacement=True, fraction=0.5)

Useful in bootstrapping algorithms.

7.3 Stratified Sampling (sampleBy)

Use when each group must be sampled proportionally.

df_category = spark.createDataFrame([
    ("Electronics", 1),
    ("Electronics", 2),
    ("Fashion", 3),
    ("Fashion", 4)
], ["category", "id"])

fractions = {"Electronics": 0.5, "Fashion": 1.0}

df_stratified = df_category.sampleBy("category", fractions)
df_stratified.show()

Output (varies)

category	id
Electronics	2
Fashion	3
Fashion	4

8. Limit vs Sample — When to Use What?

Requirement	Use
Get fixed top N rows	`limit()`
Need randomness	`sample()`
Need proportional sampling	`sampleBy()`
Top N by specific metric	`orderBy().limit()`

9. Using Sorting, Sampling & Limit in SQL

Sorting

SELECT * FROM products ORDER BY price DESC;

Sampling (Random rows)

SELECT * FROM products TABLESAMPLE(50 PERCENT);

Limit

SELECT * FROM products LIMIT 5;

10. Best Practices

✔ Avoid global sorting on multi-terabyte data ✔ Use sortWithinPartitions() when global order not required ✔ Keep seed fixed for reproducible sampling ✔ Always combine orderBy + limit for stable top-N queries ✔ Avoid sampling huge datasets repeatedly → cache first

Summary

You now know how to:

Sort DataFrames efficiently
Retrieve random or stratified samples
Limit output for fast previews
Optimize sorting and sampling operations
Balance correctness vs performance

Sorting, sampling, and limiting are essential for analytics, ML pipelines, data previewing, and debugging at NeoMart — and now you can use them expertly too.

Next Topic → Partitioning & Bucketing for Large Datasets

1. Sample DataFrame​

Input Data​

2. Sorting DataFrames​

Output​

Story​

3. Multi-Column Sorting​

4. Sorting Large Datasets — Performance Tips​

Example​

5. Limit — Fetching Top N Rows​

Output​

6. Top-N Query (Sort + Limit)​

Output​

7. Sampling — Random Subset of Data​

7.1 Random Sampling​

Possible Output​

7.2 Sampling With Replacement​

7.3 Stratified Sampling (sampleBy)​

Output (varies)​

8. Limit vs Sample — When to Use What?​

9. Using Sorting, Sampling & Limit in SQL​

Sorting​

Sampling (Random rows)​

Limit​

10. Best Practices​

Summary​

1. Sample DataFrame

Input Data

2. Sorting DataFrames

Output

Story

3. Multi-Column Sorting

4. Sorting Large Datasets — Performance Tips

Example

5. Limit — Fetching Top N Rows

Output

6. Top-N Query (Sort + Limit)

Output

7. Sampling — Random Subset of Data

7.1 Random Sampling

Possible Output

7.2 Sampling With Replacement

7.3 Stratified Sampling (sampleBy)

Output (varies)

8. Limit vs Sample — When to Use What?

9. Using Sorting, Sampling & Limit in SQL

Sorting

Sampling (Random rows)

Limit

10. Best Practices

Summary