Advanced workflows¶

This page highlights optional steps available when training models. The advanced_demo.ipynb notebook demonstrates the commands.

Grid search¶

Run mlcls-train -g to search a wider range of parameters. The job records metrics for each candidate and stores the best estimator under artefacts/. To perform grid search on the random-forest pipeline pass the model option:

mlcls-train --model random_forest -g
mlcls-train --model gboost -g

Calibration¶

After training a model you can calibrate the predicted probabilities:

mlcls-eval --calibrate isotonic

This fits a calibration model on the validation fold and reports Brier score in the output table. The calibrated estimator is saved with the suffix _calibrated.joblib.

Run the standalone helper to draw reliability plots for both models:

python -m src.calibration

The script reads the saved models and generates *_calibration.png files under artefacts/.

Fairness checks¶

Use the evaluation command with --group-col to compute group metrics such as statistical parity and equal opportunity. The summary table now includes an equal_opp column showing the worst to best true positive rate ratio:

mlcls-eval --group-col gender --group-col marital

This command prints parity ratios for each group and stores them in artefacts/group_metrics.csv. summary_metrics.csv records the equal_opp ratio for each model. It also stores eq_odds which is the difference between the true- and false-positive rate gaps.

Set a custom probability cutoff for these metrics with --threshold. When omitted the tool chooses the Youden J statistic:

mlcls-eval --group-col gender --threshold 0.6

Select specific pipelines with --models. Pass multiple names to evaluate only those models:

mlcls-eval --models logreg random_forest svm

The advanced_demo.ipynb notebook walks through these steps and shows the additional plots.

SHAP values¶

Provide a DataFrame to logreg_coefficients or tree_feature_importances and pass shap_csv_path to store per-feature SHAP values:

from src.feature_importance import logreg_coefficients

shap_df = logreg_coefficients(
    "artefacts/lr.joblib",
    shap_csv_path="artefacts/logreg_shap_values.csv",
    X=X_test,
)

The helper function compute_shap_values creates the table with columns matching the input DataFrame.

SHAP plots¶

Use plot_shap_summary to visualise these values:

from src.feature_importance import plot_shap_summary

plot_shap_summary(
    "artefacts/lr.joblib",
    X=X_test,
    png_path="artefacts/logreg_shap.png",
)

The image logreg_shap.png appears under artefacts/.

Report artefacts¶

Gather recent metrics and plots with the report command:

mlcls-report

The tool copies the latest files into report_artifacts/. You can zip the folder and share it with collaborators.