API reference¶
The sections below list the main modules. Each module link shows the available functions and classes.
Utilities for loading and cleaning the Kaggle loan dataset.
- src.dataprep.clean(df: DataFrame) DataFrame¶
Basic cleaning: drop duplicates/NA and normalise target column.
- src.dataprep.load_raw(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame¶
Return the raw dataset as a
DataFrame.
Feature engineering utilities.
- class src.features.FeatureEngineer¶
Encapsulates feature engineering logic.
- ASSET_COLS = ['residential_assets_value', 'commercial_assets_value', 'luxury_assets_value', 'bank_asset_value']¶
- MARKET_APR = 0.09¶
- transform(df: DataFrame) DataFrame¶
Return engineered feature DataFrame.
ColumnTransformer helpers.
- src.preprocessing.build_preprocessor(num_cols: list[str], cat_cols: list[str]) ColumnTransformer¶
Return basic preprocessing ColumnTransformer.
- src.preprocessing.make_preprocessor(num_cols: list[str], cat_cols: list[str], bool_cols: list[str] | None = None, *, include_cont: bool = True) ColumnTransformer¶
Return ColumnTransformer for logistic or tree models.
- src.preprocessing.safe_transform(preprocessor: ColumnTransformer, X_new: DataFrame) ndarray¶
Transform X_new dropping unseen columns.
- src.preprocessing.validate_prep(prep: ColumnTransformer, X: DataFrame, name: str, check_scale: bool = True) None¶
Raise if
prepproduces NaNs or deviates from unit scale.
- src.models.logreg.build_pipeline(cat_cols: list[str], num_cols: list[str], sampler: SamplerMixin | None = None) Pipeline¶
Create preprocessing and logistic regression pipeline.
- src.models.logreg.grid_train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float¶
Grid-search logistic regression and return validation ROC-AUC.
- src.models.logreg.load_data(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame¶
Return cleaned and engineered DataFrame loaded from
path.
- src.models.logreg.main(data_path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv'), sampler: SamplerMixin | None = None) None¶
- src.models.logreg.train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float¶
Train model on
dfand return validation ROC-AUC.
- src.models.cart.build_pipeline(cat_cols: list[str], num_cols: list[str], sampler: SamplerMixin | None = None) Pipeline¶
Create preprocessing and decision-tree pipeline.
- src.models.cart.grid_train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) GridSearchCV¶
Return fitted GridSearchCV and optionally save best model.
If
artefact_pathis provided, the best estimator is persisted.
- src.models.cart.load_data(path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv')) DataFrame¶
Return cleaned and engineered DataFrame loaded from
path.
- src.models.cart.main(data_path: str | Path = PosixPath('data/raw/loan_approval_dataset.csv'), sampler: SamplerMixin | None = None) None¶
- src.models.cart.train_from_df(df: DataFrame, target: str = 'loan_status', artefact_path: Path | None = None, sampler: SamplerMixin | None = None) float¶
Train model on
dfand return validation ROC-AUC.
- src.train.main(args: list[str] | None = None) None¶
CLI entry point training the logistic and tree models.
- src.evaluate.evaluate_models(df: DataFrame, target: str = 'Loan_Status', group_col: str | None = None, csv_path: Path = PosixPath('artefacts/summary_metrics.csv'), threshold: float | None = None, models: list[str] | None = None) DataFrame¶
Return nested-CV metrics and write
csv_path.thresholdsets the probability cutoff used for group metrics. When it isNonethe Youden J statistic is used instead.modelsselects which pipelines to run.
- src.evaluate.main(args: list[str] | None = None) None¶
CLI entry point evaluating both models on the cleaned dataset.
- src.predict.main(args: list[str] | None = None) None¶
CLI entry point applying a trained model to new data.
- src.calibration.calibrate_model(estimator: ClassifierMixin, X: DataFrame, y: Series, method: str = 'sigmoid') CalibratedClassifierCV¶
Return fitted calibration wrapper for
estimator.
- src.calibration.main(args: list[str] | None = None) None¶
CLI entry calibrating saved models.
- src.utils.dedup_pairs(old: Sequence[tuple], new: Sequence[tuple]) list[tuple]¶
Merge
oldandnewlists of 2-tuples, dropping duplicates.
- src.utils.is_binary_numeric(series: Series) bool¶
Return
Trueif numericseriescontains only0/1values.
- src.utils.prefix(label: str) str¶
Return substring before
'__'if present else empty string.
- src.utils.set_seeds(seed: int = 42) None¶
Seed Python, NumPy and
PYTHONHASHSEED.
- src.utils.zeros_like(index: Index) Series¶
Return a zero-filled
Seriesaligned withindex.
- src.summary.dataset_summary(df: DataFrame, target: str = 'Loan_Status') str¶
Return a short dataset overview string.
- src.summary.main(args: list[str] | None = None) None¶
CLI entry point printing dataset statistics.