Databricks DBFS — Internal File System Explained

Welcome back to ShopWave, our fictional retail company.
You’ve built notebooks, set up clusters, and secured access—but now a question pops up:

“Where do we actually store all these files and datasets inside Databricks?”

Enter DBFS — Databricks File System.

🗂️ What Is DBFS?

DBFS is Databricks’ built-in file system, a layer on top of cloud storage (AWS S3, Azure ADLS, or GCP GCS) that makes it look and behave like a local filesystem.

Think of it as:

“A Google Drive inside your Databricks workspace.”

It lets you:

Read and write files
Store notebooks, datasets, models
Share files between clusters and notebooks
Access cloud storage seamlessly

🔥 Why DBFS Matters

DBFS is important because:

Unified access — Any cluster can access the same files.
Seamless integration — Works with Spark, Python, R, Scala, SQL.
Persistent storage — Files persist even if clusters are terminated.
Organized structure — Personal workspace, shared workspace, temporary storage.

ShopWave stores raw sales data, cleaned datasets, ML models, and experiment outputs in DBFS to keep everything organized.

🗂️ DBFS Structure

Here’s how DBFS is organized:

/dbfs
├── /FileStore
│    ├── /datasets
│    ├── /models
│    └── /temp
├── /mnt
│    └── /external_cloud_storage_mounts
└── /tmp
└── /temporary_files

/FileStore → User-uploaded files, notebooks, datasets
/mnt → Mount points for external cloud storage
/tmp → Temporary files during execution

Example: ShopWave uploads their CSV sales file to /FileStore/datasets/sales.csv.

💻 Accessing DBFS

1️⃣ Using Python / Spark

# Read CSV from DBFS
sales_df = spark.read.csv("/FileStore/datasets/sales.csv", header=True, inferSchema=True)
sales_df.show(5)

# Write DataFrame back to DBFS
sales_df.write.parquet("/FileStore/datasets/sales_parquet")

2️⃣ Using SQL

-- Read a Delta table stored in DBFS
SELECT * FROM delta.`/FileStore/datasets/sales_delta`

3️⃣ Using CLI

# List files
databricks fs ls dbfs:/FileStore/datasets

# Copy local file to DBFS
databricks fs cp local_file.csv dbfs:/FileStore/datasets/

# Remove a file
databricks fs rm dbfs:/FileStore/datasets/old_file.csv

🔗 Mounting External Storage

DBFS can mount cloud storage, making it appear as part of the filesystem:

/mnt/s3_sales_data
/mnt/adls_customer_data

ShopWave mounts AWS S3 buckets containing raw sales and inventory data to /mnt, then accesses them through Spark without worrying about bucket paths each time.

🏢 Real Business Example — ShopWave

Scenario: ShopWave is building a recommendation engine.

Engineers upload product and sales CSVs to /FileStore/datasets.
Data scientists read these files from notebooks to train ML models.
Transformed data is written back as Delta tables in /FileStore/models.
BI dashboards access aggregated results from the same location.

DBFS ensures all teams work with the same files, avoiding duplication or version conflicts.

🧠 Quick Tips

Use /FileStore for shared files within Databricks.
Use /mnt for mounted cloud storage.
Use /tmp for temporary files during workflows.
Always clean up unused files to save storage costs.
Leverage DBFS commands in notebooks or CLI for automation.

🏁 Quick Summary

DBFS is Databricks’ internal file system, providing persistent storage for notebooks, datasets, and models.
Organizes data in /FileStore, /mnt, and /tmp.
Allows seamless integration with Python, SQL, R, Scala, Spark, and external cloud storage.
Critical for collaboration across data engineers, analysts, and scientists.
Makes workflows more efficient, organized, and scalable.

🚀 Coming Next

👉 ** Databricks Pricing — How Clusters, SQL & Jobs Are Charged**