Regression is one of the most useful parts of supervised machine learning because it answers a very common engineering question: given some input data, what numeric value should we expect?
A regression model can estimate a house price, predict revenue, forecast product quality, model disease progression, or estimate delivery time. The target is not a label such as approved or rejected. The target is a number that usually exists on a continuous scale.
This post explains regression from a developer's point of view. The goal is not only to know model names, but to understand how to build a practical regression workflow: define the prediction problem, choose a baseline model, detect overfitting, add regularization, compare metrics, and tune the model without fooling yourself.
The Problem
A regression system takes one or more input features and predicts a numeric output.
For example:
- Input: square footage, location score, number of rooms
- Output: estimated house price
Or:
- Input: process temperature, pressure, machine speed
- Output: expected product quality score
Or:
- Input: patient measurements
- Output: expected disease progression value
The practical goal is simple: learn a function that maps inputs to a continuous target.
Historical data
|
v
Feature matrix X + numeric target y
|
v
Train regression model
|
v
Predict numeric value for new input
|
v
Evaluate error and improve the model
A regression model is useful only when it generalizes. A model that performs well on training data but badly on new data is not reliable. Much of regression engineering is about preventing that problem.
What Regression Means in Practice
Regression is a supervised learning technique. That means the training data includes both:
- Features: the input values used for prediction
- Target: the numeric value the model should learn to predict
If you have 1 input feature, the problem is called simple regression. If you have several input features, it is called multiple regression.
Regression can also be linear or non-linear.
| Regression type | What it means | Example |
|---|---|---|
| Simple regression | One feature predicts the target | Square footage predicts house price |
| Multiple regression | Many features predict the target | Size, location, and age predict house price |
| Linear regression | Relationship is modeled as a straight-line combination | Price increases with size |
| Non-linear regression | Relationship needs curves or more complex patterns | Quality changes differently at low and high temperatures |
A good first habit is to ask this before modeling:
Is the target a number, and can prediction error be measured numerically?
If yes, regression may be the right family of algorithms.
Linear Regression as the Baseline
Linear regression is usually the best starting point. It is simple, fast, and interpretable. It assumes the target can be approximated as a weighted sum of the input features.
For one feature, the idea can be written as:
prediction = intercept + slope * feature
For several features, it becomes:
prediction = intercept
+ coefficient_1 * feature_1
+ coefficient_2 * feature_2
+ ...
+ coefficient_n * feature_n
The intercept is the predicted value when all features are zero. Each coefficient describes how much the prediction changes when that feature increases by one unit, assuming the other features stay constant.
That interpretability is one of the reasons linear regression is still valuable even when more complex models are available.
Linear Regression Assumptions
Linear regression works best when these assumptions are reasonable:
- The relationship between features and target is roughly linear.
- Observations are independent from each other.
- Prediction errors have roughly constant variance.
- Errors are approximately normally distributed.
- Input features are not strongly duplicated or highly correlated with each other.
These assumptions do not need to be perfect for every practical use case, but ignoring them can lead to misleading coefficients and weak predictions.
A Small Linear Regression Example
The following example creates a small synthetic dataset where a numeric target depends mostly on one feature. The names are changed to a delivery-cost style example so the code reads like an application problem rather than a math exercise.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
rng = np.random.default_rng(42)
# One feature: delivery distance in arbitrary units.
distance = 2.0 * rng.random((120, 1))
# Target: delivery cost with a clear trend plus noise.
noise = rng.normal(0.0, 1.0, size=(120, 1))
cost = 5.0 + 2.8 * distance + noise
X_train, X_test, y_train, y_test = train_test_split(
distance,
cost.ravel(),
test_size=0.2,
random_state=42,
)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Intercept: {regressor.intercept_:.3f}")
print(f"Slope: {regressor.coef_[0]:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R2: {r2:.3f}")
This example demonstrates the basic regression loop:
- Prepare features and target.
- Split data into training and testing sets.
- Fit the model on training data.
- Predict on test data.
- Measure error.
The split is important. Testing on the same data used for training gives an overly optimistic result. The model must be evaluated on data it did not use to learn its coefficients.
Polynomial Regression for Curved Relationships
Linear regression can only model straight-line relationships between the transformed inputs and the target. Some real problems need a curve.
Polynomial regression solves this by creating additional features such as powers of the original feature:
prediction = b0 + b1*x + b2*x^2 + b3*x^3
The model is still linear in its coefficients, but the input representation is richer. This lets the model fit curved patterns.
The danger is overfitting. A very high polynomial degree can bend too much and follow noise instead of the real pattern.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
rng = np.random.default_rng(42)
# One feature with a curved relationship to the target.
load = 6.0 * rng.random((150, 1)) - 3.0
noise = rng.normal(0.0, 1.0, size=(150, 1))
latency = 1.5 + 0.8 * load + 0.6 * load**2 + noise
X_train, X_test, y_train, y_test = train_test_split(
load,
latency.ravel(),
test_size=0.2,
random_state=42,
)
for degree in [1, 2, 3, 10]:
model = Pipeline([
("polynomial_features", PolynomialFeatures(degree=degree, include_bias=False)),
("linear_model", LinearRegression()),
])
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"Degree {degree}: test RMSE = {rmse:.3f}")
A degree of 1 is ordinary linear regression. Higher degrees can capture more complex shapes. The best degree is not the highest one by default. The best degree is the one that improves test performance without making the model unstable.
Regularization: Controlling Model Complexity
Regularization adds a penalty to the training objective. Instead of only minimizing prediction error, the model also tries to keep coefficients under control.
This helps when:
- The dataset has many features compared with the number of rows.
- Several features are strongly correlated.
- The model is flexible enough to memorize noise.
- Polynomial features make the input space much larger.
Ridge Regression
Ridge regression uses L2 regularization. It discourages large coefficients by adding a penalty based on squared coefficient values.
Ridge usually keeps all features in the model, but shrinks their coefficients. It is often useful when many features contain some signal and when features are correlated.
Lasso Regression
Lasso regression uses L1 regularization. It can shrink some coefficients all the way to zero.
That makes lasso useful when you suspect only a smaller subset of features is important. In practice, lasso can act as a built-in feature selection method.
Elastic Net Regression
Elastic Net combines L1 and L2 regularization. It is useful when you want some feature selection behavior while still handling correlated features better than plain lasso.
| Model | Penalty style | Practical behavior |
|---|---|---|
| Ordinary least squares | No regularization | Simple baseline, but can overfit with complex features |
| Ridge | L2 | Shrinks coefficients, usually keeps all features |
| Lasso | L1 | Can set some coefficients to zero |
| Elastic Net | L1 + L2 | Balances coefficient shrinkage and feature selection |
Here is a practical comparison pattern using a standard scikit-learn pipeline.
import numpy as np
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
rng = np.random.default_rng(42)
# Several related features for a generic forecasting problem.
rows = 220
feature_a = rng.normal(10, 2, size=rows)
feature_b = feature_a * 0.7 + rng.normal(0, 1, size=rows)
feature_c = rng.normal(5, 3, size=rows)
feature_d = rng.normal(0, 1, size=rows)
X = np.column_stack([feature_a, feature_b, feature_c, feature_d])
y = 15 + 2.0 * feature_a - 1.2 * feature_c + rng.normal(0, 2, size=rows)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
candidates = {
"Linear Regression": LinearRegression(),
"Ridge": Ridge(alpha=1.0),
"Lasso": Lasso(alpha=0.1),
"Elastic Net": ElasticNet(alpha=0.1, l1_ratio=0.5),
}
for name, estimator in candidates.items():
pipeline = Pipeline([
("scaler", StandardScaler()),
("regressor", estimator),
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
mse = mean_squared_error(y_test, predicted)
print(f"\n{name}")
print(f"RMSE: {np.sqrt(mse):.3f}")
print(f"MAE: {mean_absolute_error(y_test, predicted):.3f}")
print(f"R2: {r2_score(y_test, predicted):.3f}")
Scaling is included because regularized linear models compare coefficient sizes. Without consistent feature scales, a feature measured in thousands can dominate a feature measured in fractions.
Combining Polynomial Features with Regularization
Polynomial features are useful for non-linear patterns, but they also increase model complexity quickly. A practical pattern is to combine polynomial feature generation with scaling and ridge regression.
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
rng = np.random.default_rng(42)
machine_speed = 2.0 * rng.random((140, 1))
noise = rng.normal(0.0, 1.0, size=(140, 1))
quality_score = 4.0 + 3.0 * machine_speed - 0.8 * machine_speed**2 + noise
X_train, X_test, y_train, y_test = train_test_split(
machine_speed,
quality_score.ravel(),
test_size=0.2,
random_state=42,
)
model = Pipeline([
("polynomial_features", PolynomialFeatures(degree=3, include_bias=False)),
("scaler", StandardScaler()),
("ridge", Ridge(alpha=0.1)),
])
model.fit(X_train, y_train)
predicted = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predicted))
r2 = r2_score(y_test, predicted)
print(f"Polynomial Ridge RMSE: {rmse:.3f}")
print(f"Polynomial Ridge R2: {r2:.3f}")
This pipeline has three responsibilities:
PolynomialFeaturescreates curved input terms.StandardScalernormalizes the generated features.Ridgelearns a regularized regression model.
This is a clean pattern because preprocessing and model training are packaged together. The same transformations are applied during training and prediction.
Robust and Quantile Regression
Standard linear regression can be sensitive to outliers. A small number of unusual data points can pull the fitted line in the wrong direction.
Robust regression reduces the influence of outliers. One common option is Huber regression, which behaves more like squared error for small mistakes and more like absolute error for large mistakes.
Quantile regression answers a different question. Instead of predicting the average target value, it predicts a selected quantile. For example:
- A 0.1 quantile model estimates a lower expected boundary.
- A 0.5 quantile model estimates the median.
- A 0.9 quantile model estimates an upper expected boundary.
This is useful when the spread of possible outcomes matters, not only the average.
import numpy as np
from sklearn.linear_model import HuberRegressor, QuantileRegressor
from sklearn.model_selection import train_test_split
rng = np.random.default_rng(42)
input_value = 2.0 * rng.random((120, 1))
noise = rng.normal(0.0, 1.0, size=120)
target_value = 4.0 + 3.0 * input_value.ravel() + noise
X_train, X_test, y_train, y_test = train_test_split(
input_value,
target_value,
test_size=0.2,
random_state=42,
)
huber = HuberRegressor(epsilon=1.35, max_iter=100)
huber.fit(X_train, y_train)
print("Huber sample predictions:", huber.predict(X_test[:3]))
for quantile in [0.1, 0.5, 0.9]:
quantile_model = QuantileRegressor(quantile=quantile, solver="highs")
quantile_model.fit(X_train, y_train)
print(f"Quantile {quantile} sample predictions:", quantile_model.predict(X_test[:3]))
Use robust regression when outliers should not dominate the fitted model. Use quantile regression when you need to understand different parts of the conditional distribution, not just the mean prediction.
How Regression Models Learn
A regression algorithm needs a way to find good parameters. Two important ideas are ordinary least squares and gradient descent.
Ordinary Least Squares
Ordinary least squares, often shortened to OLS, fits a linear regression model by minimizing the sum of squared differences between actual and predicted values.
Conceptually:
error = actual_value - predicted_value
squared_error = error * error
objective = sum of squared_error for all training rows
OLS has a closed-form solution when the required matrix operations are valid. This makes it efficient and elegant for standard linear regression.
Gradient Descent
Gradient descent is an iterative optimization approach. It starts with initial parameter values, calculates how the loss changes with respect to those parameters, and updates them in the direction that reduces the loss.
Start with initial parameters
Repeat until the loss stops improving:
Make predictions
Calculate the loss
Calculate the gradient of the loss
Move parameters opposite the gradient
The learning rate controls the update size.
- If it is too high, the optimization can overshoot good values.
- If it is too low, training can be unnecessarily slow.
Gradient descent is especially important when the model is too complex for a simple closed-form solution.
Evaluation Metrics for Regression
A regression model should never be judged from one number alone. Different metrics highlight different behavior.
Mean Squared Error
Mean squared error, or MSE, averages squared prediction errors.
MSE = average((actual - predicted)^2)
Because errors are squared, large mistakes are punished heavily. The downside is that the unit is squared, so the number may be harder to explain to business users.
Root Mean Squared Error
Root mean squared error, or RMSE, is the square root of MSE.
RMSE = sqrt(MSE)
RMSE is often easier to interpret than MSE because it has the same unit as the target.
Mean Absolute Error
Mean absolute error, or MAE, averages the absolute prediction error.
MAE = average(abs(actual - predicted))
MAE is easy to understand and less sensitive to large outliers than MSE and RMSE.
R2 Score
R2 describes how much target variance is explained by the model compared with a simple baseline that always predicts the mean target value.
A higher R2 is usually better, but it should not be used alone. R2 can look acceptable while error values are still too large for the business problem.
Adjusted R2
R2 can increase when more features are added, even when those features are not useful. Adjusted R2 penalizes unnecessary features and is more useful when comparing models with different numbers of predictors.
Cross-Validation
A single train-test split can be lucky or unlucky. Cross-validation gives a more stable view by training and evaluating the model multiple times on different folds.
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X, y = load_diabetes(return_X_y=True)
pipeline = Pipeline([
("scaler", StandardScaler()),
("ridge", Ridge(alpha=1.0)),
])
folds = KFold(n_splits=5, shuffle=True, random_state=42)
negative_mse_scores = cross_val_score(
pipeline,
X,
y,
cv=folds,
scoring="neg_mean_squared_error",
)
rmse_scores = np.sqrt(-negative_mse_scores)
print(f"RMSE per fold: {rmse_scores}")
print(f"Average RMSE: {rmse_scores.mean():.3f}")
For time series data, standard k-fold cross-validation can leak future information into training. In that case, use a split strategy where training data always comes before validation data.
A Practical Regression Workflow
A reliable regression workflow should be boring and repeatable. The following example uses a built-in medical regression dataset and compares several models using the same evaluation function.
The target is numeric, so the problem is regression. The workflow is the important part:
- Load features and target.
- Split into train and test sets.
- Build pipelines.
- Train each model.
- Compare RMSE, MAE, and R2.
- Tune the best candidate.
- Inspect model behavior.
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
np.random.seed(42)
medical_data = load_diabetes()
X = medical_data.data
y = medical_data.target
feature_names = medical_data.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
def build_scaled_pipeline(estimator):
return Pipeline([
("scaler", StandardScaler()),
("regressor", estimator),
])
def evaluate_regressor(name, model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
test_mae = mean_absolute_error(y_test, test_predictions)
test_r2 = r2_score(y_test, test_predictions)
return {
"model": name,
"train_rmse": train_rmse,
"test_rmse": test_rmse,
"test_mae": test_mae,
"test_r2": test_r2,
}
models = {
"Linear Regression": build_scaled_pipeline(LinearRegression()),
"Ridge": build_scaled_pipeline(Ridge(alpha=1.0)),
"Lasso": build_scaled_pipeline(Lasso(alpha=0.1)),
"Elastic Net": build_scaled_pipeline(ElasticNet(alpha=0.1, l1_ratio=0.5)),
"SVR": build_scaled_pipeline(SVR(kernel="rbf", C=100, gamma=0.1, epsilon=0.1)),
"Random Forest": build_scaled_pipeline(RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)),
"Gradient Boosting": build_scaled_pipeline(GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)),
}
rows = []
for model_name, model in models.items():
rows.append(evaluate_regressor(model_name, model, X_train, X_test, y_train, y_test))
comparison = pd.DataFrame(rows).sort_values("test_rmse")
print(comparison)
print("Features:", feature_names)
This code intentionally evaluates training RMSE and test RMSE. If training RMSE is much better than test RMSE, the model may be overfitting. If both are poor, the model may be too simple, the features may not explain the target well, or the data may need better preparation.
Hyperparameter Tuning
After comparing baseline candidates, tune one promising model. Do not tune every possible model endlessly. Start with a reasonable candidate and a small parameter grid.
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
pipeline = Pipeline([
("scaler", StandardScaler()),
("regressor", GradientBoostingRegressor(random_state=42)),
])
parameter_grid = {
"regressor__n_estimators": [50, 100, 200],
"regressor__learning_rate": [0.01, 0.05, 0.1],
"regressor__max_depth": [2, 3, 4],
}
search = GridSearchCV(
pipeline,
parameter_grid,
cv=5,
scoring="neg_mean_squared_error",
n_jobs=-1,
)
search.fit(X_train, y_train)
best_model = search.best_estimator_
test_predictions = best_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
test_r2 = r2_score(y_test, test_predictions)
print("Best parameters:", search.best_params_)
print(f"Tuned test RMSE: {test_rmse:.3f}")
print(f"Tuned test R2: {test_r2:.3f}")
The parameter names use the pipeline step name followed by a double underscore. For example, regressor__max_depth means the max_depth parameter of the estimator inside the regressor pipeline step.
Interpreting the Result
For linear models, coefficients are often the first thing to inspect. A positive coefficient means the predicted value tends to increase when that feature increases, assuming other features stay fixed. A negative coefficient means the opposite.
For tree-based models such as random forest and gradient boosting, feature importance can help show which features the model used heavily.
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
medical_data = load_diabetes()
X = medical_data.data
y = medical_data.target
feature_names = np.array(medical_data.feature_names)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.05,
max_depth=2,
random_state=42,
)
model.fit(X_train, y_train)
importance_order = np.argsort(model.feature_importances_)[::-1]
for index in importance_order:
name = feature_names[index]
importance = model.feature_importances_[index]
print(f"{name}: {importance:.4f}")
Feature importance should not be treated as a perfect explanation of causality. It is a model inspection tool. It tells you what the model used, not what causes the real-world outcome.
Common Mistakes
Training and Testing on the Same Data
This is the fastest way to get misleading results. Always keep a test set that the model does not see during training.
Choosing the Most Complex Model First
Start simple. Linear regression, ridge, or lasso often provide useful baselines. Complex models should earn their place by improving test performance.
Ignoring Feature Scaling
Regularized linear models, support vector regression, and many distance-based techniques are sensitive to feature scale. Use a pipeline with scaling so preprocessing is applied consistently.
Using R2 Alone
R2 is helpful, but it does not tell the whole story. Always inspect error metrics such as RMSE and MAE. A model can explain some variance and still make errors that are too large for production use.
Overusing Polynomial Degrees
Polynomial features can capture curves, but high degrees can create unstable models. Combine polynomial features with regularization and evaluate on test data.
Forgetting Outliers
Outliers can strongly affect ordinary linear regression. If your domain contains unusual but valid cases, compare standard regression with robust alternatives such as Huber regression.
Applying Random Cross-Validation to Time Series
Time-dependent data requires time-aware validation. Do not allow future data to influence training for a past prediction.
Practical Checklist
Use this checklist when building a regression model:
- Confirm the target is numeric and continuous.
- Define what error size is acceptable for the business problem.
- Create a simple baseline model first.
- Split data into training and testing sets before evaluation.
- Use pipelines for preprocessing and modeling.
- Compare RMSE, MAE, and R2.
- Check the gap between training error and test error.
- Add regularization when the model is too flexible or features are correlated.
- Use polynomial features only when there is evidence of non-linear behavior.
- Use cross-validation for a more stable estimate of generalization.
- Tune hyperparameters only after baseline comparison.
- Inspect coefficients or feature importances carefully.
- Treat model interpretation as supporting evidence, not automatic truth.
Conclusion
Regression is the foundation for many prediction systems because it turns historical examples into numeric estimates. Linear regression gives a simple and interpretable baseline. Polynomial regression extends that baseline to curved relationships. Ridge, lasso, and Elastic Net help control complexity and reduce overfitting. Robust and quantile regression handle cases where ordinary mean prediction is not enough.
A practical regression workflow is not about choosing the fanciest algorithm. It is about building a repeatable process: define the target, prepare the data, train simple models first, evaluate with the right metrics, control overfitting, and tune carefully. When that process is followed, regression becomes a reliable tool for solving real prediction problems in software systems.