Building Reliable Regression Models with Python

Regression is one of the most useful parts of supervised machine learning because it answers a very common engineering question: given some input data, what numeric value should we expect?

A regression model can estimate a house price, predict revenue, forecast product quality, model disease progression, or estimate delivery time. The target is not a label such as approved or rejected. The target is a number that usually exists on a continuous scale.

This post explains regression from a developer's point of view. The goal is not only to know model names, but to understand how to build a practical regression workflow: define the prediction problem, choose a baseline model, detect overfitting, add regularization, compare metrics, and tune the model without fooling yourself.

The Problem

A regression system takes one or more input features and predicts a numeric output.

For example:

Input: square footage, location score, number of rooms
Output: estimated house price

Or:

Input: process temperature, pressure, machine speed
Output: expected product quality score

Or:

Input: patient measurements
Output: expected disease progression value

The practical goal is simple: learn a function that maps inputs to a continuous target.

Historical data
  |
  v
Feature matrix X + numeric target y
  |
  v
Train regression model
  |
  v
Predict numeric value for new input
  |
  v
Evaluate error and improve the model

A regression model is useful only when it generalizes. A model that performs well on training data but badly on new data is not reliable. Much of regression engineering is about preventing that problem.

What Regression Means in Practice

Regression is a supervised learning technique. That means the training data includes both:

Features: the input values used for prediction
Target: the numeric value the model should learn to predict

If you have 1 input feature, the problem is called simple regression. If you have several input features, it is called multiple regression.

Regression can also be linear or non-linear.

Regression type	What it means	Example
Simple regression	One feature predicts the target	Square footage predicts house price
Multiple regression	Many features predict the target	Size, location, and age predict house price
Linear regression	Relationship is modeled as a straight-line combination	Price increases with size
Non-linear regression	Relationship needs curves or more complex patterns	Quality changes differently at low and high temperatures

A good first habit is to ask this before modeling:

Is the target a number, and can prediction error be measured numerically?

If yes, regression may be the right family of algorithms.

Linear Regression as the Baseline

Linear regression is usually the best starting point. It is simple, fast, and interpretable. It assumes the target can be approximated as a weighted sum of the input features.

For one feature, the idea can be written as:

prediction = intercept + slope * feature

For several features, it becomes:

prediction = intercept
           + coefficient_1 * feature_1
           + coefficient_2 * feature_2
           + ...
           + coefficient_n * feature_n

The intercept is the predicted value when all features are zero. Each coefficient describes how much the prediction changes when that feature increases by one unit, assuming the other features stay constant.

That interpretability is one of the reasons linear regression is still valuable even when more complex models are available.

Linear Regression Assumptions

Linear regression works best when these assumptions are reasonable:

The relationship between features and target is roughly linear.
Observations are independent from each other.
Prediction errors have roughly constant variance.
Errors are approximately normally distributed.
Input features are not strongly duplicated or highly correlated with each other.

These assumptions do not need to be perfect for every practical use case, but ignoring them can lead to misleading coefficients and weak predictions.

A Small Linear Regression Example

The following example creates a small synthetic dataset where a numeric target depends mostly on one feature. The names are changed to a delivery-cost style example so the code reads like an application problem rather than a math exercise.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(42)

# One feature: delivery distance in arbitrary units.
distance = 2.0 * rng.random((120, 1))

# Target: delivery cost with a clear trend plus noise.
noise = rng.normal(0.0, 1.0, size=(120, 1))
cost = 5.0 + 2.8 * distance + noise

X_train, X_test, y_train, y_test = train_test_split(
    distance,
    cost.ravel(),
    test_size=0.2,
    random_state=42,
)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

predictions = regressor.predict(X_test)

mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Intercept: {regressor.intercept_:.3f}")
print(f"Slope: {regressor.coef_[0]:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R2: {r2:.3f}")

This example demonstrates the basic regression loop:

Prepare features and target.
Split data into training and testing sets.
Fit the model on training data.
Predict on test data.
Measure error.

The split is important. Testing on the same data used for training gives an overly optimistic result. The model must be evaluated on data it did not use to learn its coefficients.

Polynomial Regression for Curved Relationships

Linear regression can only model straight-line relationships between the transformed inputs and the target. Some real problems need a curve.

Polynomial regression solves this by creating additional features such as powers of the original feature:

prediction = b0 + b1*x + b2*x^2 + b3*x^3

The model is still linear in its coefficients, but the input representation is richer. This lets the model fit curved patterns.

The danger is overfitting. A very high polynomial degree can bend too much and follow noise instead of the real pattern.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

rng = np.random.default_rng(42)

# One feature with a curved relationship to the target.
load = 6.0 * rng.random((150, 1)) - 3.0
noise = rng.normal(0.0, 1.0, size=(150, 1))
latency = 1.5 + 0.8 * load + 0.6 * load**2 + noise

X_train, X_test, y_train, y_test = train_test_split(
    load,
    latency.ravel(),
    test_size=0.2,
    random_state=42,
)

for degree in [1, 2, 3, 10]:
    model = Pipeline([
        ("polynomial_features", PolynomialFeatures(degree=degree, include_bias=False)),
        ("linear_model", LinearRegression()),
    ])

    model.fit(X_train, y_train)
    test_predictions = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, test_predictions))

    print(f"Degree {degree}: test RMSE = {rmse:.3f}")

A degree of 1 is ordinary linear regression. Higher degrees can capture more complex shapes. The best degree is not the highest one by default. The best degree is the one that improves test performance without making the model unstable.

Regularization: Controlling Model Complexity

Regularization adds a penalty to the training objective. Instead of only minimizing prediction error, the model also tries to keep coefficients under control.

This helps when:

The dataset has many features compared with the number of rows.
Several features are strongly correlated.
The model is flexible enough to memorize noise.
Polynomial features make the input space much larger.

Ridge Regression

Ridge regression uses L2 regularization. It discourages large coefficients by adding a penalty based on squared coefficient values.

Ridge usually keeps all features in the model, but shrinks their coefficients. It is often useful when many features contain some signal and when features are correlated.

Lasso Regression

Lasso regression uses L1 regularization. It can shrink some coefficients all the way to zero.

That makes lasso useful when you suspect only a smaller subset of features is important. In practice, lasso can act as a built-in feature selection method.

Elastic Net Regression

Elastic Net combines L1 and L2 regularization. It is useful when you want some feature selection behavior while still handling correlated features better than plain lasso.

Model	Penalty style	Practical behavior
Ordinary least squares	No regularization	Simple baseline, but can overfit with complex features
Ridge	L2	Shrinks coefficients, usually keeps all features
Lasso	L1	Can set some coefficients to zero
Elastic Net	L1 + L2	Balances coefficient shrinkage and feature selection

Here is a practical comparison pattern using a standard scikit-learn pipeline.

import numpy as np
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)

# Several related features for a generic forecasting problem.
rows = 220
feature_a = rng.normal(10, 2, size=rows)
feature_b = feature_a * 0.7 + rng.normal(0, 1, size=rows)
feature_c = rng.normal(5, 3, size=rows)
feature_d = rng.normal(0, 1, size=rows)

X = np.column_stack([feature_a, feature_b, feature_c, feature_d])
y = 15 + 2.0 * feature_a - 1.2 * feature_c + rng.normal(0, 2, size=rows)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

candidates = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "Elastic Net": ElasticNet(alpha=0.1, l1_ratio=0.5),
}

for name, estimator in candidates.items():
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", estimator),
    ])

    pipeline.fit(X_train, y_train)
    predicted = pipeline.predict(X_test)

    mse = mean_squared_error(y_test, predicted)
    print(f"\n{name}")
    print(f"RMSE: {np.sqrt(mse):.3f}")
    print(f"MAE: {mean_absolute_error(y_test, predicted):.3f}")
    print(f"R2: {r2_score(y_test, predicted):.3f}")

Scaling is included because regularized linear models compare coefficient sizes. Without consistent feature scales, a feature measured in thousands can dominate a feature measured in fractions.

Combining Polynomial Features with Regularization

Polynomial features are useful for non-linear patterns, but they also increase model complexity quickly. A practical pattern is to combine polynomial feature generation with scaling and ridge regression.

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

rng = np.random.default_rng(42)

machine_speed = 2.0 * rng.random((140, 1))
noise = rng.normal(0.0, 1.0, size=(140, 1))
quality_score = 4.0 + 3.0 * machine_speed - 0.8 * machine_speed**2 + noise

X_train, X_test, y_train, y_test = train_test_split(
    machine_speed,
    quality_score.ravel(),
    test_size=0.2,
    random_state=42,
)

model = Pipeline([
    ("polynomial_features", PolynomialFeatures(degree=3, include_bias=False)),
    ("scaler", StandardScaler()),
    ("ridge", Ridge(alpha=0.1)),
])

model.fit(X_train, y_train)
predicted = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, predicted))
r2 = r2_score(y_test, predicted)

print(f"Polynomial Ridge RMSE: {rmse:.3f}")
print(f"Polynomial Ridge R2: {r2:.3f}")

This pipeline has three responsibilities:

PolynomialFeatures creates curved input terms.
StandardScaler normalizes the generated features.
Ridge learns a regularized regression model.

This is a clean pattern because preprocessing and model training are packaged together. The same transformations are applied during training and prediction.

Robust and Quantile Regression

Standard linear regression can be sensitive to outliers. A small number of unusual data points can pull the fitted line in the wrong direction.

Robust regression reduces the influence of outliers. One common option is Huber regression, which behaves more like squared error for small mistakes and more like absolute error for large mistakes.

Quantile regression answers a different question. Instead of predicting the average target value, it predicts a selected quantile. For example:

A 0.1 quantile model estimates a lower expected boundary.
A 0.5 quantile model estimates the median.
A 0.9 quantile model estimates an upper expected boundary.

This is useful when the spread of possible outcomes matters, not only the average.

import numpy as np
from sklearn.linear_model import HuberRegressor, QuantileRegressor
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(42)

input_value = 2.0 * rng.random((120, 1))
noise = rng.normal(0.0, 1.0, size=120)
target_value = 4.0 + 3.0 * input_value.ravel() + noise

X_train, X_test, y_train, y_test = train_test_split(
    input_value,
    target_value,
    test_size=0.2,
    random_state=42,
)

huber = HuberRegressor(epsilon=1.35, max_iter=100)
huber.fit(X_train, y_train)
print("Huber sample predictions:", huber.predict(X_test[:3]))

for quantile in [0.1, 0.5, 0.9]:
    quantile_model = QuantileRegressor(quantile=quantile, solver="highs")
    quantile_model.fit(X_train, y_train)
    print(f"Quantile {quantile} sample predictions:", quantile_model.predict(X_test[:3]))

Use robust regression when outliers should not dominate the fitted model. Use quantile regression when you need to understand different parts of the conditional distribution, not just the mean prediction.

How Regression Models Learn

A regression algorithm needs a way to find good parameters. Two important ideas are ordinary least squares and gradient descent.

Ordinary Least Squares

Ordinary least squares, often shortened to OLS, fits a linear regression model by minimizing the sum of squared differences between actual and predicted values.

Conceptually:

error = actual_value - predicted_value
squared_error = error * error
objective = sum of squared_error for all training rows

OLS has a closed-form solution when the required matrix operations are valid. This makes it efficient and elegant for standard linear regression.

Gradient Descent

Gradient descent is an iterative optimization approach. It starts with initial parameter values, calculates how the loss changes with respect to those parameters, and updates them in the direction that reduces the loss.

Start with initial parameters
Repeat until the loss stops improving:
    Make predictions
    Calculate the loss
    Calculate the gradient of the loss
    Move parameters opposite the gradient

The learning rate controls the update size.

If it is too high, the optimization can overshoot good values.
If it is too low, training can be unnecessarily slow.

Gradient descent is especially important when the model is too complex for a simple closed-form solution.

Evaluation Metrics for Regression

A regression model should never be judged from one number alone. Different metrics highlight different behavior.

Mean Squared Error

Mean squared error, or MSE, averages squared prediction errors.

MSE = average((actual - predicted)^2)

Because errors are squared, large mistakes are punished heavily. The downside is that the unit is squared, so the number may be harder to explain to business users.

Root Mean Squared Error

Root mean squared error, or RMSE, is the square root of MSE.

RMSE = sqrt(MSE)

RMSE is often easier to interpret than MSE because it has the same unit as the target.

Mean Absolute Error

Mean absolute error, or MAE, averages the absolute prediction error.

MAE = average(abs(actual - predicted))

MAE is easy to understand and less sensitive to large outliers than MSE and RMSE.

R2 Score

R2 describes how much target variance is explained by the model compared with a simple baseline that always predicts the mean target value.

A higher R2 is usually better, but it should not be used alone. R2 can look acceptable while error values are still too large for the business problem.

Adjusted R2

R2 can increase when more features are added, even when those features are not useful. Adjusted R2 penalizes unnecessary features and is more useful when comparing models with different numbers of predictors.

Cross-Validation

A single train-test split can be lucky or unlucky. Cross-validation gives a more stable view by training and evaluating the model multiple times on different folds.

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y = load_diabetes(return_X_y=True)

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge(alpha=1.0)),
])

folds = KFold(n_splits=5, shuffle=True, random_state=42)
negative_mse_scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=folds,
    scoring="neg_mean_squared_error",
)

rmse_scores = np.sqrt(-negative_mse_scores)

print(f"RMSE per fold: {rmse_scores}")
print(f"Average RMSE: {rmse_scores.mean():.3f}")

For time series data, standard k-fold cross-validation can leak future information into training. In that case, use a split strategy where training data always comes before validation data.

A Practical Regression Workflow

A reliable regression workflow should be boring and repeatable. The following example uses a built-in medical regression dataset and compares several models using the same evaluation function.

The target is numeric, so the problem is regression. The workflow is the important part:

Load features and target.
Split into train and test sets.
Build pipelines.
Train each model.
Compare RMSE, MAE, and R2.
Tune the best candidate.
Inspect model behavior.

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

np.random.seed(42)

medical_data = load_diabetes()
X = medical_data.data
y = medical_data.target
feature_names = medical_data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

def build_scaled_pipeline(estimator):
    return Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", estimator),
    ])

def evaluate_regressor(name, model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)

    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)

    train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
    test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
    test_mae = mean_absolute_error(y_test, test_predictions)
    test_r2 = r2_score(y_test, test_predictions)

    return {
        "model": name,
        "train_rmse": train_rmse,
        "test_rmse": test_rmse,
        "test_mae": test_mae,
        "test_r2": test_r2,
    }

models = {
    "Linear Regression": build_scaled_pipeline(LinearRegression()),
    "Ridge": build_scaled_pipeline(Ridge(alpha=1.0)),
    "Lasso": build_scaled_pipeline(Lasso(alpha=0.1)),
    "Elastic Net": build_scaled_pipeline(ElasticNet(alpha=0.1, l1_ratio=0.5)),
    "SVR": build_scaled_pipeline(SVR(kernel="rbf", C=100, gamma=0.1, epsilon=0.1)),
    "Random Forest": build_scaled_pipeline(RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)),
    "Gradient Boosting": build_scaled_pipeline(GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)),
}

rows = []
for model_name, model in models.items():
    rows.append(evaluate_regressor(model_name, model, X_train, X_test, y_train, y_test))

comparison = pd.DataFrame(rows).sort_values("test_rmse")
print(comparison)
print("Features:", feature_names)

This code intentionally evaluates training RMSE and test RMSE. If training RMSE is much better than test RMSE, the model may be overfitting. If both are poor, the model may be too simple, the features may not explain the target well, or the data may need better preparation.

Hyperparameter Tuning

After comparing baseline candidates, tune one promising model. Do not tune every possible model endlessly. Start with a reasonable candidate and a small parameter grid.

from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

X, y = load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", GradientBoostingRegressor(random_state=42)),
])

parameter_grid = {
    "regressor__n_estimators": [50, 100, 200],
    "regressor__learning_rate": [0.01, 0.05, 0.1],
    "regressor__max_depth": [2, 3, 4],
}

search = GridSearchCV(
    pipeline,
    parameter_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
)

search.fit(X_train, y_train)

best_model = search.best_estimator_
test_predictions = best_model.predict(X_test)

test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
test_r2 = r2_score(y_test, test_predictions)

print("Best parameters:", search.best_params_)
print(f"Tuned test RMSE: {test_rmse:.3f}")
print(f"Tuned test R2: {test_r2:.3f}")

The parameter names use the pipeline step name followed by a double underscore. For example, regressor__max_depth means the max_depth parameter of the estimator inside the regressor pipeline step.

Interpreting the Result

For linear models, coefficients are often the first thing to inspect. A positive coefficient means the predicted value tends to increase when that feature increases, assuming other features stay fixed. A negative coefficient means the opposite.

For tree-based models such as random forest and gradient boosting, feature importance can help show which features the model used heavily.

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

medical_data = load_diabetes()
X = medical_data.data
y = medical_data.target
feature_names = np.array(medical_data.feature_names)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.05,
    max_depth=2,
    random_state=42,
)
model.fit(X_train, y_train)

importance_order = np.argsort(model.feature_importances_)[::-1]

for index in importance_order:
    name = feature_names[index]
    importance = model.feature_importances_[index]
    print(f"{name}: {importance:.4f}")

Feature importance should not be treated as a perfect explanation of causality. It is a model inspection tool. It tells you what the model used, not what causes the real-world outcome.

Common Mistakes

Training and Testing on the Same Data

This is the fastest way to get misleading results. Always keep a test set that the model does not see during training.

Choosing the Most Complex Model First

Start simple. Linear regression, ridge, or lasso often provide useful baselines. Complex models should earn their place by improving test performance.

Ignoring Feature Scaling

Regularized linear models, support vector regression, and many distance-based techniques are sensitive to feature scale. Use a pipeline with scaling so preprocessing is applied consistently.

Using R2 Alone

R2 is helpful, but it does not tell the whole story. Always inspect error metrics such as RMSE and MAE. A model can explain some variance and still make errors that are too large for production use.

Overusing Polynomial Degrees

Polynomial features can capture curves, but high degrees can create unstable models. Combine polynomial features with regularization and evaluate on test data.

Forgetting Outliers

Outliers can strongly affect ordinary linear regression. If your domain contains unusual but valid cases, compare standard regression with robust alternatives such as Huber regression.

Applying Random Cross-Validation to Time Series

Time-dependent data requires time-aware validation. Do not allow future data to influence training for a past prediction.

Practical Checklist

Use this checklist when building a regression model:

Confirm the target is numeric and continuous.
Define what error size is acceptable for the business problem.
Create a simple baseline model first.
Split data into training and testing sets before evaluation.
Use pipelines for preprocessing and modeling.
Compare RMSE, MAE, and R2.
Check the gap between training error and test error.
Add regularization when the model is too flexible or features are correlated.
Use polynomial features only when there is evidence of non-linear behavior.
Use cross-validation for a more stable estimate of generalization.
Tune hyperparameters only after baseline comparison.
Inspect coefficients or feature importances carefully.
Treat model interpretation as supporting evidence, not automatic truth.

Conclusion

Regression is the foundation for many prediction systems because it turns historical examples into numeric estimates. Linear regression gives a simple and interpretable baseline. Polynomial regression extends that baseline to curved relationships. Ridge, lasso, and Elastic Net help control complexity and reduce overfitting. Robust and quantile regression handle cases where ordinary mean prediction is not enough.

A practical regression workflow is not about choosing the fanciest algorithm. It is about building a repeatable process: define the target, prepare the data, train simple models first, evaluate with the right metrics, control overfitting, and tune carefully. When that process is followed, regression becomes a reliable tool for solving real prediction problems in software systems.