Databricks DBFS — Internal File System Explained
Welcome back to ShopWave, our fictional retail company.
You’ve built notebooks, set up clusters, and secured access—but now a question pops up:
“Where do we actually store all these files and datasets inside Databricks?”
Enter DBFS — Databricks File System.
🗂️ What Is DBFS?
DBFS is Databricks’ built-in file system, a layer on top of cloud storage (AWS S3, Azure ADLS, or GCP GCS) that makes it look and behave like a local filesystem.
Think of it as:
“A Google Drive inside your Databricks workspace.”
It lets you:
- Read and write files
- Store notebooks, datasets, models
- Share files between clusters and notebooks
- Access cloud storage seamlessly
🔥 Why DBFS Matters
DBFS is important because:
- Unified access — Any cluster can access the same files.
- Seamless integration — Works with Spark, Python, R, Scala, SQL.
- Persistent storage — Files persist even if clusters are terminated.
- Organized structure — Personal workspace, shared workspace, temporary storage.
ShopWave stores raw sales data, cleaned datasets, ML models, and experiment outputs in DBFS to keep everything organized.
🗂️ DBFS Structure
Here’s how DBFS is organized:
/dbfs
├── /FileStore
│ ├── /datasets
│ ├── /models
│ └── /temp
├── /mnt
│ └── /external_cloud_storage_mounts
└── /tmp
└── /temporary_files
/FileStore→ User-uploaded files, notebooks, datasets/mnt→ Mount points for external cloud storage/tmp→ Temporary files during execution
Example: ShopWave uploads their CSV sales file to /FileStore/datasets/sales.csv.
💻 Accessing DBFS
1️⃣ Using Python / Spark
# Read CSV from DBFS
sales_df = spark.read.csv("/FileStore/datasets/sales.csv", header=True, inferSchema=True)
sales_df.show(5)
# Write DataFrame back to DBFS
sales_df.write.parquet("/FileStore/datasets/sales_parquet")
2️⃣ Using SQL
-- Read a Delta table stored in DBFS
SELECT * FROM delta.`/FileStore/datasets/sales_delta`
3️⃣ Using CLI
# List files
databricks fs ls dbfs:/FileStore/datasets
# Copy local file to DBFS
databricks fs cp local_file.csv dbfs:/FileStore/datasets/
# Remove a file
databricks fs rm dbfs:/FileStore/datasets/old_file.csv
🔗 Mounting External Storage
DBFS can mount cloud storage, making it appear as part of the filesystem:
/mnt/s3_sales_data
/mnt/adls_customer_data
ShopWave mounts AWS S3 buckets containing raw sales and inventory data to /mnt, then accesses them through Spark without worrying about bucket paths each time.
🏢 Real Business Example — ShopWave
Scenario: ShopWave is building a recommendation engine.
- Engineers upload product and sales CSVs to
/FileStore/datasets. - Data scientists read these files from notebooks to train ML models.
- Transformed data is written back as Delta tables in
/FileStore/models. - BI dashboards access aggregated results from the same location.
DBFS ensures all teams work with the same files, avoiding duplication or version conflicts.
🧠 Quick Tips
- Use
/FileStorefor shared files within Databricks. - Use
/mntfor mounted cloud storage. - Use
/tmpfor temporary files during workflows. - Always clean up unused files to save storage costs.
- Leverage DBFS commands in notebooks or CLI for automation.
🏁 Quick Summary
- DBFS is Databricks’ internal file system, providing persistent storage for notebooks, datasets, and models.
- Organizes data in
/FileStore,/mnt, and/tmp. - Allows seamless integration with Python, SQL, R, Scala, Spark, and external cloud storage.
- Critical for collaboration across data engineers, analysts, and scientists.
- Makes workflows more efficient, organized, and scalable.
🚀 Coming Next
👉 ** Databricks Pricing — How Clusters, SQL & Jobs Are Charged**