17 docs tagged with "Column Operations"

Complex SQL Queries in PySpark

Learn how to write complex SQL queries in PySpark, including joins, subqueries, aggregations, and window functions for advanced analytics in Databricks.

Creating DataFrames from CSV, JSON, Parquet & Hive Tables

Learn how to create PySpark DataFrames from CSV, JSON, Parquet files, and Hive tables using Databricks and Spark best practices.

First PySpark Job — Hello World Example

Learn how to write and run your first PySpark job with a hands-on “Hello World” example, and understand the end-to-end workflow in Spark.

Handling Missing Data — Drop, Fill & Replace

Learn how to handle missing data in PySpark DataFrames using drop, fill, and replace operations with practical Databricks examples.

Handling Missing Data in PySpark DataFrames (Complete Guide)

Learn all techniques for handling missing or null data in PySpark DataFrames including dropping nulls, filling values, conditional replacement, and computing statistics.

Installing PySpark & Setting Up Environment

Step-by-step guide to install PySpark, set up your development environment, and run your first Spark job for big data processing.

Key-Value RDDs — reduceByKey, groupByKey & aggregate

Learn how to use key-value RDDs in Spark with reduceByKey, groupByKey, and aggregate operations, complete with real-world Databricks examples and performance tips.

MLlib Overview

Imagine you’re a data scientist in a high-tech lab, not just a data engineer. Data isn’t sitting quietly in files—it’s streaming, growing, and changing constantly. You want to predict outcomes, classify users, or group behaviors, all at scale.

Performance Comparison — DataFrame API vs Spark SQL

Understand the performance differences between PySpark DataFrame API and Spark SQL, with tips on when to use each approach for optimal performance in Databricks.

RDD Basics — Creation, Transformation & Actions

Learn the fundamentals of RDDs in Apache Spark, including how to create them, apply transformations, trigger actions, and understand their importance in distributed data processing.

RDD Persistence & Caching — Memory Management in Spark

Learn how Spark RDD caching and persistence work, why they matter for performance, and how to manage memory effectively in distributed data pipelines.

RDDs vs DataFrames vs Datasets — When to Use

Understand the differences between RDDs, DataFrames, and Datasets in PySpark, and learn when to use each for efficient big data processing.

SparkSession, SparkContext, and Configuration Basics

Learn the core components of PySpark—SparkSession, SparkContext, and configurations—and how they form the foundation of big data processing.

UDFs & UDAFs — Custom Functions in SQL

Learn how to create and use PySpark UDFs (User Defined Functions) and UDAFs (User Defined Aggregate Functions) to implement custom logic and aggregations in Spark SQL and DataFrames.

Using Spark SQL — Register Temp Views and Query

Learn how to use Spark SQL in PySpark by registering temporary views and running SQL queries on DataFrames in Databricks.

Window Functions in PySpark DataFrames

Learn how to use PySpark window functions for ranking, running totals, cumulative sums, and time-based analytics in Databricks.

Working with Dates and Timestamps in PySpark DataFrames (Full Guide)

Complete guide to date and timestamp operations in PySpark, including extracting date components, aggregations, ratios, and SQL queries with real examples.