Sorting, Sampling & Limit in PySpark DataFrames — Efficient Data Retrieval at Scale
At NeoMart, data analysts often ask questions like:
- “Give me the top 10 highest-selling products.”
- “Fetch a random 1% sample for model testing.”
- “Sort all orders by timestamp for fraud analysis.”
Though simple, operations like sort, sample, and limit can get expensive at scale.
This chapter teaches how to use them efficiently on large distributed datasets.
1. Sample DataFrame
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, "Laptop", 1200),
(2, "Mouse", 25),
(3, "Monitor", 220),
(4, "Keyboard", 75),
(5, "Headset", 60)
], ["product_id", "name", "price"])
df.show()
Input Data
| product_id | name | price |
|---|---|---|
| 1 | Laptop | 1200 |
| 2 | Mouse | 25 |
| 3 | Monitor | 220 |
| 4 | Keyboard | 75 |
| 5 | Headset | 60 |
2. Sorting DataFrames
Sorting is done using:
✔ orderBy()
✔ sort() (alias)
df_sorted = df.orderBy(F.desc("price"))
df_sorted.show()
Output
| product_id | name | price |
|---|---|---|
| 1 | Laptop | 1200 |
| 3 | Monitor | 220 |
| 4 | Keyboard | 75 |
| 5 | Headset | 60 |
| 2 | Mouse | 25 |
Story
NeoMart’s fraud-detection team sorts orders by price to catch unusually high transactions.
3. Multi-Column Sorting
df.orderBy(F.desc("price"), F.asc("name")).show()
When keys tie, the next key breaks the tie.
4. Sorting Large Datasets — Performance Tips
✔ Sorting triggers a global shuffle
✔ Use partitioning to reduce sort costs
✔ Use sortWithinPartitions() when global order isn’t required
Example
df_small_sort = df.sortWithinPartitions("price")
Useful for window functions or per-partition ordering.
5. Limit — Fetching Top N Rows
limit() retrieves a fixed number of rows.
df.limit(3).show()
Output
| product_id | name | price |
|---|---|---|
| 1 | Laptop | 1200 |
| 2 | Mouse | 25 |
| 3 | Monitor | 220 |
⚠ Important:
limit() does not guarantee order.
Combine with orderBy() for deterministic results.
6. Top-N Query (Sort + Limit)
df.orderBy(F.desc("price")).limit(2).show()
Output
| product_id | name | price |
|---|---|---|
| 1 | Laptop | 1200 |
| 3 | Monitor | 220 |
7. Sampling — Random Subset of Data
Sampling helps NeoMart’s ML team train models faster.
7.1 Random Sampling
df_sample = df.sample(withReplacement=False, fraction=0.4, seed=42)
df_sample.show()
Possible Output
| product_id | name | price |
|---|---|---|
| 3 | Monitor | 220 |
| 5 | Headset | 60 |
7.2 Sampling With Replacement
df_replace = df.sample(withReplacement=True, fraction=0.5)
Useful in bootstrapping algorithms.
7.3 Stratified Sampling (sampleBy)
Use when each group must be sampled proportionally.
df_category = spark.createDataFrame([
("Electronics", 1),
("Electronics", 2),
("Fashion", 3),
("Fashion", 4)
], ["category", "id"])
fractions = {"Electronics": 0.5, "Fashion": 1.0}
df_stratified = df_category.sampleBy("category", fractions)
df_stratified.show()
Output (varies)
| category | id |
|---|---|
| Electronics | 2 |
| Fashion | 3 |
| Fashion | 4 |
8. Limit vs Sample — When to Use What?
| Requirement | Use |
|---|---|
| Get fixed top N rows | limit() |
| Need randomness | sample() |
| Need proportional sampling | sampleBy() |
| Top N by specific metric | orderBy().limit() |
9. Using Sorting, Sampling & Limit in SQL
Sorting
SELECT * FROM products ORDER BY price DESC;
Sampling (Random rows)
SELECT * FROM products TABLESAMPLE(50 PERCENT);
Limit
SELECT * FROM products LIMIT 5;
10. Best Practices
✔ Avoid global sorting on multi-terabyte data
✔ Use sortWithinPartitions() when global order not required
✔ Keep seed fixed for reproducible sampling
✔ Always combine orderBy + limit for stable top-N queries
✔ Avoid sampling huge datasets repeatedly → cache first
Summary
You now know how to:
- Sort DataFrames efficiently
- Retrieve random or stratified samples
- Limit output for fast previews
- Optimize sorting and sampling operations
- Balance correctness vs performance
Sorting, sampling, and limiting are essential for analytics, ML pipelines, data previewing, and debugging at NeoMart — and now you can use them expertly too.
Next Topic → Partitioning & Bucketing for Large Datasets