API reference¶
The sections below list the main modules. Each module link shows the available functions and classes.
Utilities for loading and cleaning the Kaggle loan dataset.
- src.dataprep.clean(df: DataFrame) DataFrame ¶
Basic cleaning: drop duplicates/NA and normalise target column.
- src.dataprep.load_raw(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame ¶
Return the raw dataset as a
DataFrame
.
Feature engineering utilities.
- class src.features.FeatureEngineer¶
Encapsulates feature engineering logic.
- ASSET_COLS = ['residential_assets_value', 'commercial_assets_value', 'luxury_assets_value', 'bank_asset_value']¶
- MARKET_APR = 0.09¶
- transform(df: DataFrame) DataFrame ¶
Return engineered feature DataFrame.
ColumnTransformer helpers.
- src.preprocessing.build_preprocessor(num_cols: list[str], cat_cols: list[str]) ColumnTransformer ¶
Return basic preprocessing ColumnTransformer.
- src.preprocessing.make_preprocessor(num_cols: list[str], cat_cols: list[str], bool_cols: list[str] | None = None, *, include_cont: bool = True) ColumnTransformer ¶
Return ColumnTransformer for logistic or tree models.
- src.preprocessing.safe_transform(preprocessor: ColumnTransformer, X_new: DataFrame) ndarray ¶
Transform X_new dropping unseen columns.
- src.preprocessing.validate_prep(prep: ColumnTransformer, X: DataFrame, name: str, check_scale: bool = True) None ¶
Raise if
prep
produces NaNs or deviates from unit scale.
- src.models.logreg.build_pipeline(cat_cols: list[str], num_cols: list[str], sampler: SamplerMixin | None = None) Pipeline ¶
Create preprocessing and logistic regression pipeline.
- src.models.logreg.grid_train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float ¶
Grid-search logistic regression and return validation ROC-AUC.
- src.models.logreg.load_data(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame ¶
Return cleaned and engineered DataFrame loaded from
path
.
- src.models.logreg.main(data_path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv'), sampler: SamplerMixin | None = None) None ¶
- src.models.logreg.train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float ¶
Train model on
df
and return validation ROC-AUC.
- src.models.cart.build_pipeline(cat_cols: list[str], num_cols: list[str], sampler: SamplerMixin | None = None) Pipeline ¶
Create preprocessing and decision-tree pipeline.
- src.models.cart.grid_train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) GridSearchCV ¶
Return fitted GridSearchCV and optionally save best model.
If
artefact_path
is provided, the best estimator is persisted.
- src.models.cart.load_data(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame ¶
Return cleaned and engineered DataFrame loaded from
path
.
- src.models.cart.main(data_path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv'), sampler: SamplerMixin | None = None) None ¶
- src.models.cart.train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float ¶
Train model on
df
and return validation ROC-AUC.
- src.train.main(args: list[str] | None = None) None ¶
CLI entry point training the logistic and tree models.
- src.evaluate.evaluate_models(df: DataFrame, target: str = 'Loan_Status', group_col: str | None = None, csv_path: Path = PosixPath('artefacts/summary_metrics.csv'), threshold: float | None = None, models: list[str] | None = None) DataFrame ¶
Return nested-CV metrics and write
csv_path
.threshold
sets the probability cutoff used for group metrics. When it isNone
the Youden J statistic is used instead.models
selects which pipelines to run.
- src.evaluate.main(args: list[str] | None = None) None ¶
CLI entry point evaluating both models on the cleaned dataset.
- src.predict.main(args: list[str] | None = None) None ¶
CLI entry point applying a trained model to new data.
- src.calibration.calibrate_model(estimator: ClassifierMixin, X: DataFrame, y: Series, method: str = 'sigmoid') CalibratedClassifierCV ¶
Return fitted calibration wrapper for
estimator
.
- src.calibration.main(args: list[str] | None = None) None ¶
CLI entry calibrating saved models.
- src.utils.dedup_pairs(old: Sequence[tuple], new: Sequence[tuple]) list[tuple] ¶
Merge
old
andnew
lists of 2-tuples, dropping duplicates.
- src.utils.is_binary_numeric(series: Series) bool ¶
Return
True
if numericseries
contains only0
/1
values.
- src.utils.prefix(label: str) str ¶
Return substring before
'__'
if present else empty string.
- src.utils.set_seeds(seed: int = 42) None ¶
Seed Python, NumPy and
PYTHONHASHSEED
.
- src.utils.zeros_like(index: Index) Series ¶
Return a zero-filled
Series
aligned withindex
.
- src.summary.dataset_summary(df: DataFrame, target: str = 'Loan_Status') str ¶
Return a short dataset overview string.
- src.summary.main(args: list[str] | None = None) None ¶
CLI entry point printing dataset statistics.