API reference

The sections below list the main modules. Each module link shows the available functions and classes.

Utilities for loading and cleaning the Kaggle loan dataset.

src.dataprep.clean(df: DataFrame) DataFrame

Basic cleaning: drop duplicates/NA and normalise target column.

src.dataprep.load_raw(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame

Return the raw dataset as a DataFrame.

Feature engineering utilities.

class src.features.FeatureEngineer

Encapsulates feature engineering logic.

ASSET_COLS = ['residential_assets_value', 'commercial_assets_value', 'luxury_assets_value', 'bank_asset_value']
MARKET_APR = 0.09
transform(df: DataFrame) DataFrame

Return engineered feature DataFrame.

ColumnTransformer helpers.

src.preprocessing.build_preprocessor(num_cols: list[str], cat_cols: list[str]) ColumnTransformer

Return basic preprocessing ColumnTransformer.

src.preprocessing.make_preprocessor(num_cols: list[str], cat_cols: list[str], bool_cols: list[str] | None = None, *, include_cont: bool = True) ColumnTransformer

Return ColumnTransformer for logistic or tree models.

src.preprocessing.safe_transform(preprocessor: ColumnTransformer, X_new: DataFrame) ndarray

Transform X_new dropping unseen columns.

src.preprocessing.validate_prep(prep: ColumnTransformer, X: DataFrame, name: str, check_scale: bool = True) None

Raise if prep produces NaNs or deviates from unit scale.

src.models.logreg.build_pipeline(cat_cols: list[str], num_cols: list[str], sampler: SamplerMixin | None = None) Pipeline

Create preprocessing and logistic regression pipeline.

src.models.logreg.grid_train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float

Grid-search logistic regression and return validation ROC-AUC.

src.models.logreg.load_data(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame

Return cleaned and engineered DataFrame loaded from path.

src.models.logreg.main(data_path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv'), sampler: SamplerMixin | None = None) None
src.models.logreg.train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float

Train model on df and return validation ROC-AUC.

src.models.cart.build_pipeline(cat_cols: list[str], num_cols: list[str], sampler: SamplerMixin | None = None) Pipeline

Create preprocessing and decision-tree pipeline.

src.models.cart.grid_train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) GridSearchCV

Return fitted GridSearchCV and optionally save best model.

If artefact_path is provided, the best estimator is persisted.

src.models.cart.load_data(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame

Return cleaned and engineered DataFrame loaded from path.

src.models.cart.main(data_path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv'), sampler: SamplerMixin | None = None) None
src.models.cart.train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float

Train model on df and return validation ROC-AUC.

src.train.main(args: list[str] | None = None) None

CLI entry point training the logistic and tree models.

src.evaluate.evaluate_models(df: DataFrame, target: str = 'Loan_Status', group_col: str | None = None, csv_path: Path = PosixPath('artefacts/summary_metrics.csv'), threshold: float | None = None, models: list[str] | None = None) DataFrame

Return nested-CV metrics and write csv_path.

threshold sets the probability cutoff used for group metrics. When it is None the Youden J statistic is used instead. models selects which pipelines to run.

src.evaluate.main(args: list[str] | None = None) None

CLI entry point evaluating both models on the cleaned dataset.

src.predict.main(args: list[str] | None = None) None

CLI entry point applying a trained model to new data.

src.calibration.calibrate_model(estimator: ClassifierMixin, X: DataFrame, y: Series, method: str = 'sigmoid') CalibratedClassifierCV

Return fitted calibration wrapper for estimator.

src.calibration.main(args: list[str] | None = None) None

CLI entry calibrating saved models.

src.utils.dedup_pairs(old: Sequence[tuple], new: Sequence[tuple]) list[tuple]

Merge old and new lists of 2-tuples, dropping duplicates.

src.utils.is_binary_numeric(series: Series) bool

Return True if numeric series contains only 0/1 values.

src.utils.prefix(label: str) str

Return substring before '__' if present else empty string.

src.utils.set_seeds(seed: int = 42) None

Seed Python, NumPy and PYTHONHASHSEED.

src.utils.zeros_like(index: Index) Series

Return a zero-filled Series aligned with index.

src.summary.dataset_summary(df: DataFrame, target: str = 'Loan_Status') str

Return a short dataset overview string.

src.summary.main(args: list[str] | None = None) None

CLI entry point printing dataset statistics.