Skip to main content

Linear Regression - Explained Simply

🚕 Predicting Ride Fares with Linear Regression

Imagine you’re part of a data science team at a ride-sharing company — something like Uber or Lyft.

Your product manager asks:

🧠 “Can we predict the fare price of a ride based on distance, traffic, time of day, and weather?”

This sounds like a perfect job for Linear Regression — one of the simplest yet most powerful algorithms for predicting continuous values such as prices, distances, or durations.


🧩 Understanding the Problem

Your company has collected millions of historical ride records.
Here’s a small sample of what that data might look like:

Distance (km)Time of DayTraffic LevelWeatherFare ($)
3MorningLowClear6.50
5EveningHighRainy12.00
10AfternoonMediumClear18.00

Your goal is to train a model that can learn from this data and predict fares for future rides — even those it hasn’t seen before.

🎯 In short: given ride details like distance, traffic, and weather, the model should estimate a fair and accurate fare.

That’s where Linear Regression comes in — it’s a perfect first model to learn and apply.


📘 What Is Linear Regression?

Linear Regression is a supervised learning algorithm used to predict a numeric (continuous) value from one or more input variables (called features).

The intuition is simple:

The output (fare) changes linearly with the inputs (distance, time, traffic, etc.).


📐 The Formula

y = w1·x1 + w2·x2 + ... + wn·xn + b
TermMeaning
yThe value you want to predict (e.g., fare)
x1, x2, ..., xnThe input features (e.g., distance, time, traffic)
w1, w2, ..., wnThe weights — how important each feature is
bThe bias — the base value when all inputs are zero (like a booking fee)

So, in our case:

fare = w1*distance + w2*time + w3*traffic + w4*weather + b

How It Works in Practice

Let’s assume your model is trained with these features:

x1 = distance (in km) x2 = hour (time of day) x3 = traffic level (numeric scale) x4 = weather (rain = 1, clear = 0)

The trained model will then predict fares using:

fare = w1*distance + w2*hour + w3*traffic + w4*rain + b

This equation becomes your ride fare prediction engine.

Let’s Code It (with PySpark)

Using PySpark, you can easily train a Linear Regression model that scales to millions of rows.

Here’s a simple example:

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Sample training data
data = [
(3, 8, 1, 0, 6.5),
(5, 18, 3, 1, 12.0),
(10, 14, 2, 0, 18.0)
]
df = spark.createDataFrame(data, ["distance", "hour", "traffic", "rain", "fare"])

Step 1: Combine Features with VectorAssembler
PySpark’s machine learning models expect all input features to be combined into a single column called features. That’s exactly what the VectorAssembler does — it merges multiple columns (like distance, hour, etc.) into one feature vector.

assembler = VectorAssembler(
inputCols=["distance", "hour", "traffic", "rain"],
outputCol="features"
)
df_features = assembler.transform(df).select("features", "fare")

Step 2: Train the Linear Regression Model

lr = LinearRegression(featuresCol="features", labelCol="fare")
model = lr.fit(df_features)

Step 3: Make Predictions

predictions = model.transform(df_features)
predictions.show()

Step 4: Inspect the Learned Model

Let’s check what the model learned — the weights (w) and bias (b):

print("Weights (w):", model.coefficients)
print("Bias (b):", model.intercept)

Example Output

Weights (w): [1.5, 0.2, 2.0, 1.0]
Bias (b): 3.0

So your model equation is:

fare = 1.5*distance + 0.2*hour + 2.0*traffic + 1.0*rain + 3.0

Step 5: Test the Model on a New Ride Let’s plug in a new ride:

  • Distance: 7 km
  • Hour: 17 (5 PM)
  • Traffic: 2 (Moderate)
  • Rain: 1 (Yes)
fare = 1.5*7 + 0.2*17 + 2.0*2 + 1.0*1 + 3.0
= 10.5 + 3.4 + 4.0 + 1.0 + 3.0
= $21.90

🎉 Predicted Fare: $21.90

Key Takeaways

  • Linear Regression is the foundation of many advanced ML models.

  • PySpark’s VectorAssembler combines columns into one feature vector for model training.

  • You can easily interpret model outputs to see which factors influence prices most.

  • Great for beginners to understand predictive modeling at scale.

🧭 1-Minute Recap: Linear Regression

SectionDetails
Use CasePredict ride fare prices based on ride data (distance, time, traffic, weather).
Problem TypeRegression – predicting a continuous value (fare).
AlgorithmLinear Regression (Supervised Learning)
Model Formulay = w1·x1 + w2·x2 + ... + wn·xn + b
Target (y)Fare (price of the ride)
Features (x1...xn)Distance, Hour, Traffic, Rain
Weights (w1...wn)Importance of each feature
Bias (b)Base fare (booking fee)
Goal of ModelFind the best weights and bias to minimize prediction error
Training Data Example(distance=3, hour=8, traffic=1, rain=0, fare=6.5)
Learned Modelfare = 1.5*distance + 0.2*hour + 2.0*traffic + 1.0*rain + 3.0
Prediction ExampleDistance=7, Hour=17, Traffic=2, Rain=1
Predicted Fare$21.90
You LearnedHow to frame a regression problem, train a PySpark model, and interpret weights & bias