API

StepMix

class stepmix.stepmix.StepMix(n_components=2, *, n_steps=1, measurement='bernoulli', structural='gaussian_unit', assignment='modal', correction=None, abs_tol=1e-10, rel_tol=0.0, max_iter=1000, n_init=1, save_param_init=False, init_params='random', random_state=None, verbose=0, progress_bar=1, measurement_params=None, structural_params=None)[source]

Bases: BaseEstimator

StepMix estimator for Latent Class Analysis.

Multi-step EM estimation of latent class models with measurement and structural models. The measurement and structural models can be fit together (1-step) or sequentially (2-step and 3-step). This estimator implements the BCH and ML bias correction methods for 3-step estimation.

The measurement and structural models can be any of those defined in stepmix.emission. The measurement model can be used alone to effectively fit a latent mixture model.

This class was adapted from the scikit-learn BaseMixture and GaussianMixture classes.

New in version 0.00.

Parameters
  • n_components (int, default=2) – The number of latent classes.

  • n_steps ({1, 2, 3}, default=1) –

    Number of steps in the estimation. Must be one of :

    • 1: run EM on both the measurement and structural models.

    • 2: first run EM on the measurement model, then on the complete model, but keep the measurement parameters fixed for the second step. See Bakk, 2018.

    • 3: first run EM on the measurement model, assign class probabilities, then fit the structural model via maximum likelihood. See the correction parameter for bias correction.

  • measurement ({'bernoulli', 'bernoulli_nan', 'binary', 'binary_nan', 'categorical', 'categorical_nan', 'continuous', 'continuous_nan', 'covariate', 'gaussian', 'gaussian_nan', 'gaussian_unit', 'gaussian_unit_nan', 'gaussian_spherical', 'gaussian_spherical_nan', 'gaussian_tied', 'gaussian_diag', 'gaussian_diag_nan', 'gaussian_full', 'multinoulli', 'multinoulli_nan', dict}, default='bernoulli') –

    String describing the measurement model. Must be one of:

    • ’bernoulli’: the observed data consists of n_features bernoulli (binary) random variables.

    • ’bernoulli_nan’: the observed data consists of n_features bernoulli (binary) random variables. Supports missing values.

    • ’binary’: alias for bernoulli.

    • ’binary_nan’: alias for bernoulli_nan.

    • ’categorical’: alias for multinoulli.

    • ’categorical_nan’: alias for multinoulli_nan.

    • ’continuous’: alias for gaussian_diag.

    • ’continuous_nan’: alias for gaussian_diag_nan. Supports missing values.

    • ’covariate’: covariate model where class probabilities are a multinomial logistic model of the features.

    • ’gaussian’: alias for gaussian_unit.

    • ’gaussian_nan’: alias for gaussian_unit. Supports missing values.

    • ’gaussian_unit’: each gaussian component has unit variance. Only fit the mean.

    • ’gaussian_unit_nan’: each gaussian component has unit variance. Only fit the mean. Supports missing values.

    • ’gaussian_spherical’: each gaussian component has its own single variance.

    • ’gaussian_spherical_nan’: each gaussian component has its own single variance. Supports missing values.

    • ’gaussian_tied’: all gaussian components share the same general covariance matrix.

    • ’gaussian_diag’: each gaussian component has its own diagonal covariance matrix.

    • ’gaussian_diag_nan’: each gaussian component has its own diagonal covariance matrix. Supports missing values.

    • ’gaussian_full’: each gaussian component has its own general covariance matrix.

    • ’multinoulli’: the observed data consists of n_features multinoulli (categorical) random variables.

    • ’multinoulli_nan’: the observed data consists of n_features multinoulli (categorical) random variables. Supports missing values.

    Models suffixed with _nan support missing values, but may be slower than their fully observed counterpart.

    Alternatively accepts a dict to define a nested model, e.g., 3 gaussian features and 2 binary features. Please refer to stepmix.emission.nested.Nested for details.

  • structural ({'bernoulli', 'bernoulli_nan', 'binary', 'binary_nan', 'categorical', 'categorical_nan', 'continuous', 'continuous_nan', 'covariate', 'gaussian', 'gaussian_nan', 'gaussian_unit', 'gaussian_unit_nan', 'gaussian_spherical', 'gaussian_spherical_nan', 'gaussian_tied', 'gaussian_diag', 'gaussian_diag_nan', 'gaussian_full', 'multinoulli', 'multinoulli_nan', dict}, default='bernoulli') – String describing the structural model. Same options as those for the measurement model.

  • assignment ({'soft', 'modal'}, default='modal') –

    Class assignments for 3-step estimation. Must be one of:

    • ’soft’: keep class responsibilities (posterior probabilities) as is.

    • ’modal’: assign 1 to the class with max probability, 0 otherwise (one-hot encoding).

  • correction ({None, 'BCH', 'ML'}, default=None) –

    Bias correction for 3-step estimation. Must be one of:

    • None : No correction. Run Naive 3-step.

    • ’BCH’ : Apply the empirical BCH correction from Vermunt, 2004.

    • ’ML’ : Apply the ML correction from Vermunt, 2010; Bakk et al., 2013.

  • abs_tol (float, default=1e-10) – The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.

  • rel_tol (float, default=0.00) – The convergence threshold. EM iterations will stop when the relative lower bound average gain is below this threshold.

  • max_iter (int, default=1000) – The number of EM iterations to perform.

  • n_init (int, default=1) – The number of initializations to perform. The best results are kept.

  • save_param_init (bool, default=False) – Save the estimated parameters of all initializations to self.param_buffer_.

  • init_params ({'kmeans', 'random'}, default='random') –

    The method used to initialize the weights, the means and the precisions. Must be one of:

    • ’kmeans’ : responsibilities are initialized using kmeans.

    • ’random’ : responsibilities are initialized randomly.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

  • verbose (int, default=0) – Enable verbose output. If 1, will print detailed report of the model and the performance metrics after fitting.

  • progress_bar (int, default=1) –

    Display a tqdm progress bar during fitting.

    • 0 : No progress bar.

    • 1 : Progress bar for initializations.

    • 2 : Progress bars for initializations and iterations. This requires a nested tqdm bar and may not work properly in some terminals.

  • measurement_params ({dict, None}, default=None) – Additional params passed to the measurement model class. Particularly useful to specify optimization parameters for stepmix.emission.covariate.Covariate. Ignored if the measurement descriptor is a nested object (see stepmix.emission.nested.Nested).

  • structural_params ({dict, None}, default=None) – Additional params passed to the structural model class. Particularly useful to specify optimization parameters for stepmix.emission.covariate.Covariate. Ignored if the structural descriptor is a nested object (see stepmix.emission.nested.Nested).

weights_

The weights of each mixture components.

Type

ndarray of shape (n_components,)

_mm

Measurement model, including parameters and estimation methods.

Type

stepmix.emission.Emission

_sm

Structural model, including parameters and estimation methods.

Type

stepmix.emission.Emission

log_resp_

Initial log responsibilities.

Type

ndarray of shape (n_samples, n_components)

measurement_in_

Number of features in the measurement model.

Type

int

structural_in_

Number of features in the structural model.

Type

int

converged_

True when convergence was reached in fit(), False otherwise.

Type

bool

n_iter_

Number of step used by the best fit of EM to reach the convergence.

Type

int

lower_bound_

Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.

Type

float

lower_bound_buffer_

Lower bound values on the log-likelihood (of the training data with respect to the model) of all EM initializations.

Type

float

param_buffer_

Final parameters of all initializations. Only updated if save_param_init=True.

Type

list

x_names_

If input is a DataFrame, column names of X.

Type

list

y_names_

If input is a DataFrame, column names of Y.

Type

list

Notes

References

Bolck, A., Croon, M., and Hagenaars, J. Estimating latent structure models with categorical variables: One-step versus three-step estimators. Political analysis, 12(1): 3–27, 2004.

Vermunt, J. K. Latent class modeling with covariates: Two improved three-step approaches. Political analysis, 18 (4):450–469, 2010.

Bakk, Z., Tekle, F. B., and Vermunt, J. K. Estimating the association between latent class membership and external variables using bias-adjusted three-step approaches. Sociological Methodology, 43(1):272–311, 2013.

Bakk, Z. and Kuha, J. Two-step estimation of models between latent classes and external variables. Psychometrika, 83(4):871–892, 2018

Examples

from stepmix.datasets import data_bakk_response
from stepmix.stepmix import StepMix
# Soft 3-step
X, Y, _ = data_bakk_response(n_samples=1000, sep_level=.7, random_state=42)
model = StepMix(n_components=3, n_steps=3, measurement='bernoulli', structural='gaussian_unit', random_state=42, assignment='soft')
model.fit(X, Y)
model.score(X, Y)  # Average log-likelihood

# Equivalently, each step can be performed individually. See the code of the fit method for details.
model = StepMix(n_components=3, measurement='bernoulli', structural='gaussian_unit', random_state=42)
model.em(X) # Step 1
probs = model.predict_proba(X) # Step 2
model.m_step_structural(probs, Y) # Step 3
model.score(X, Y)  # Average log-likelihood
aic(X, Y=None)[source]

Akaike information criterion for the current model on the measurement data X and optionally the structural data Y.

Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

aic – The lower the better.

Return type

float

bic(X, Y=None)[source]

Bayesian information criterion for the current model on the measurement data X and optionally the structural data Y.

Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

bic – The lower the better.

Return type

float

bootstrap(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, sampler=None, identify_classes=True, progress_bar=True, random_state=None)[source]

Parametric or Non-parametric boostrap of this estimator.

Fit n_repetitions clones of the estimator on resampled datasets.

If identify_classes=True, repeated parameter estimates are aligned with the class order of the main estimator using a permutation search.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored if parametric=True.

  • n_repetitions (int) – Number of repetitions to fit.

  • parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.

  • sampler (bool, default=None) – Another fitted estimator to use for sampling instead of the main estimator. Only used for parametric bootstrapping.

  • identify_classes (bool, default=True) – Run a permutation test to align the classes of the repetitions to the classes of the main estimator. This is required if inference on the model parameters is needed, but can be turned off if only the likelihood needs to be bootstrapped to save computations. progress_bar : bool, default=True Display a tqdm progress bar for repetitions.

  • progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.

  • random_state (int, default=None) – If none, use self.random_state.

Returns

  • samples (DataFrame) – Parameter DataFrame for all repetitions. Follows the convention of StepMix.get_parameters_df() with an additional ‘rep’ column.

  • rep_stats (DataFrame) – Likelihood statistics of each repetition.

bootstrap_stats(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, progress_bar=True)[source]

Bootstrapping of a StepMix estimator. Obtain boostrapped parameters and some statistics (mean and standard deviation).

If a covariate model is used in the structural model, the output keys “cw_mean” and “cw_std” are omitted.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

  • n_repetitions (int) – Number of repetitions to fit.

  • parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.

  • progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.

Returns

bootstrap_and_stats – Dictionary of dataframes { ‘samples’: Parameters estimated by self.boostrap in a long-form DataFrame, ‘rep_stats’: Likelihood statistics of each repetition provided by self.boostrap, ‘cw_mean’: Bootstrapped means of the class weights, ‘cw_std’: Bootstrapped standard deviations of the class weights, ‘mm_mean’: Bootstrapped means of the measurement model parameters, ‘mm_std’: Bootstrapped standard deviations of the measurement model parameters, ‘sm_mean’: Bootstrapped means of the structural model parameters, ‘sm_std’: Bootstrapped standard deviations of the structural model parameters, }.

Return type

dict,

caic(X, Y=None)[source]

Consistent AIC.

References

Bozdogan, H. 1987. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 52: 345–370.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

caic – The lower the better.

Return type

float

em(X, Y=None, sample_weight=None, freeze_measurement=False, log_emission_pm=None)[source]

EM algorithm to fit the weights, measurement parameters and structural parameters.

Adapted from the fit_predict method of the sklearn BaseMixture class to include (optional) structural model computations.

Setting Y=None will run EM on the measurement model only. Providing both X and Y will run EM on the complete model, unless otherwise specified by freeze_measurement.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

  • freeze_measurement (bool, default =False) – Run EM on the complete model, but do not update measurement model parameters. Useful for 2-step estimation and 3-step with ML correction.

  • log_emission_pm (ndarray of shape (n, n_components), default=None) – Log probabilities of the predicted class given the true latent class for ML correction.

entropy(X, Y=None)[source]

Entropy of the posterior over latent classes.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

entropy

Return type

float

fit(X, Y=None, sample_weight=None, y=None)[source]

Fit StepMix measurement model and optionally the structural model.

Setting Y=None will fit the measurement model only. Providing both X and Y will fit the full model following the self.n_steps argument.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • y (array-like of shape (n_samples, n_features), default=None) – Alias for Y. Ignored if Y is provided.

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_cw_df(x_names=None, y_names=None)[source]

Get class weights as DataFrame with classes as columns.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routing – A MetadataRequest encapsulating routing information.

Return type

MetadataRequest

get_mm_df(x_names=None, y_names=None)[source]

Get measurement model parameters as DataFrame with classes as columns.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

get_parameters()[source]

Get model parameters as a Python dictionary.

Returns

params – Nested dict {‘weights’: Current class weights, ‘measurement’: dict of measurement params, ‘structural’: dict of structural params, ‘measurement_in’: number of measurements, ‘structural_in’: number of structural features, }.

Return type

dict,

get_parameters_df(x_names=None, y_names=None)[source]

Get model parameters as a long-form DataFrame.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

get_sm_df(x_names=None, y_names=None)[source]

Get structural model parameters as DataFrame with classes as columns.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

m_step_structural(resp, Y, sample_weight=None)[source]

M-step for the structural model only.

Handy for 3-step estimation.

Parameters
  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in Y.

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

property n_parameters

Get number of free parameters.

permute_classes(perm)[source]

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters

perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

predict(X, Y=None)[source]

Predict the cluster/latent class/component labels for the data samples in X.

Optionally, an array-like Y can be provided to predict the labels based on both the measurement and structural models.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

labels – Component labels.

Return type

array, shape (n_samples,)

predict_Y(X)[source]

Call the predict method of the structural model to predict argmax P(Y|X) (Supervised prediction).

Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.

Parameters

X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

predictions – Y predictions.

Return type

array, shape (n_samples, n_columns)

predict_class(X, Y=None)[source]

Predict the cluster/latent class/component labels for the data samples in X.

Optionally, an array-like Y can be provided to predict the labels based on both the measurement and structural models.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

labels – Component labels.

Return type

array, shape (n_samples,)

predict_proba(X, Y=None)[source]

Predict the latent class probabilities for the data samples in X using the measurement model.

Optionally, an array-like Y can be provided to predict the posterior based on both the measurement and structural models.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

resp – P(class|X, Y) for each sample in X.

Return type

array, shape (n_samples, n_components)

predict_proba_Y(X)[source]

Call the predict method of the structural model to predict the full conditional P(Y|X).

Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.

Parameters

X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

conditional – P(Y|X).

Return type

array, shape (n_samples, n_columns)

predict_proba_class(X, Y=None)[source]

Predict the latent class probabilities for the data samples in X using the measurement model.

Optionally, an array-like Y can be provided to predict the posterior based on both the measurement and structural models.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

resp – P(class|X, Y) for each sample in X.

Return type

array, shape (n_samples, n_components)

relative_entropy(X, Y=None)[source]

Scaled Relative Entropy of the posterior over latent classes.

Ramaswamy et al., 1993.

1 - entropy / (n_samples * log(n_components))

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

relative_entropy

Return type

float

report(X, Y=None, sample_weight=None)[source]

Print detailed report of the model and performance metrics.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

sabic(X, Y=None)[source]

Sample-Sized Adjusted BIC.

References

Sclove SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika. 1987;52(3):333–343.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

ssa_bic

Return type

float

sample(n_samples, labels=None)[source]

Sample method for fitted StepMix model.

Adapted from the sklearn BaseMixture sample method.

Parameters
  • n_samples (int) – Number of samples.

  • labels (ndarray of shape (n_samples,)) – Predetermined class labels. Will ignore class weights if provided.

Returns

  • X (array-like of shape (n_samples, n_columns)) – Measurement samples.

  • Y (array-like of shape (n_samples, n_columns_structural)) – Structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

score(X, Y=None, sample_weight=None)[source]

Compute the average log-likelihood over samples.

Setting Y=None will ignore the structural likelihood.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

Returns

avg_ll – Average log likelihood over samples.

Return type

float

set_fit_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMix

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns

self – The updated object.

Return type

object

set_parameters(params)[source]

Set parameters.

Parameters

params (dict,) – Same format as self.get_parameters().

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMix

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns

self – The updated object.

Return type

object

class stepmix.stepmix.StepMixClassifier(n_components=2, *, n_steps=1, measurement='bernoulli', structural='gaussian_unit', assignment='modal', correction=None, abs_tol=1e-10, rel_tol=0.0, max_iter=1000, n_init=1, save_param_init=False, init_params='random', random_state=None, verbose=0, progress_bar=1, measurement_params=None, structural_params=None)[source]

Bases: StepMix

StepMix Supervised Classifier

Identical to a StepMix estimator, but we remap predict and predict_proba to perform inference over Y instead of the latent class. This follows the supervised learning convention of sklearn.

We call this a classifier since inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.

Also works with the aliases ‘bernoulli’, ‘bernoulli_nan’, ‘multinoulli’ and ‘multinoulli_nan’.

aic(X, Y=None)

Akaike information criterion for the current model on the measurement data X and optionally the structural data Y.

Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

aic – The lower the better.

Return type

float

bic(X, Y=None)

Bayesian information criterion for the current model on the measurement data X and optionally the structural data Y.

Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

bic – The lower the better.

Return type

float

bootstrap(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, sampler=None, identify_classes=True, progress_bar=True, random_state=None)

Parametric or Non-parametric boostrap of this estimator.

Fit n_repetitions clones of the estimator on resampled datasets.

If identify_classes=True, repeated parameter estimates are aligned with the class order of the main estimator using a permutation search.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored if parametric=True.

  • n_repetitions (int) – Number of repetitions to fit.

  • parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.

  • sampler (bool, default=None) – Another fitted estimator to use for sampling instead of the main estimator. Only used for parametric bootstrapping.

  • identify_classes (bool, default=True) – Run a permutation test to align the classes of the repetitions to the classes of the main estimator. This is required if inference on the model parameters is needed, but can be turned off if only the likelihood needs to be bootstrapped to save computations. progress_bar : bool, default=True Display a tqdm progress bar for repetitions.

  • progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.

  • random_state (int, default=None) – If none, use self.random_state.

Returns

  • samples (DataFrame) – Parameter DataFrame for all repetitions. Follows the convention of StepMix.get_parameters_df() with an additional ‘rep’ column.

  • rep_stats (DataFrame) – Likelihood statistics of each repetition.

bootstrap_stats(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, progress_bar=True)

Bootstrapping of a StepMix estimator. Obtain boostrapped parameters and some statistics (mean and standard deviation).

If a covariate model is used in the structural model, the output keys “cw_mean” and “cw_std” are omitted.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

  • n_repetitions (int) – Number of repetitions to fit.

  • parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.

  • progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.

Returns

bootstrap_and_stats – Dictionary of dataframes { ‘samples’: Parameters estimated by self.boostrap in a long-form DataFrame, ‘rep_stats’: Likelihood statistics of each repetition provided by self.boostrap, ‘cw_mean’: Bootstrapped means of the class weights, ‘cw_std’: Bootstrapped standard deviations of the class weights, ‘mm_mean’: Bootstrapped means of the measurement model parameters, ‘mm_std’: Bootstrapped standard deviations of the measurement model parameters, ‘sm_mean’: Bootstrapped means of the structural model parameters, ‘sm_std’: Bootstrapped standard deviations of the structural model parameters, }.

Return type

dict,

caic(X, Y=None)

Consistent AIC.

References

Bozdogan, H. 1987. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 52: 345–370.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

caic – The lower the better.

Return type

float

em(X, Y=None, sample_weight=None, freeze_measurement=False, log_emission_pm=None)

EM algorithm to fit the weights, measurement parameters and structural parameters.

Adapted from the fit_predict method of the sklearn BaseMixture class to include (optional) structural model computations.

Setting Y=None will run EM on the measurement model only. Providing both X and Y will run EM on the complete model, unless otherwise specified by freeze_measurement.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

  • freeze_measurement (bool, default =False) – Run EM on the complete model, but do not update measurement model parameters. Useful for 2-step estimation and 3-step with ML correction.

  • log_emission_pm (ndarray of shape (n, n_components), default=None) – Log probabilities of the predicted class given the true latent class for ML correction.

entropy(X, Y=None)

Entropy of the posterior over latent classes.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

entropy

Return type

float

fit(X, Y=None, sample_weight=None, y=None)

Fit StepMix measurement model and optionally the structural model.

Setting Y=None will fit the measurement model only. Providing both X and Y will fit the full model following the self.n_steps argument.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • y (array-like of shape (n_samples, n_features), default=None) – Alias for Y. Ignored if Y is provided.

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_cw_df(x_names=None, y_names=None)

Get class weights as DataFrame with classes as columns.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routing – A MetadataRequest encapsulating routing information.

Return type

MetadataRequest

get_mm_df(x_names=None, y_names=None)

Get measurement model parameters as DataFrame with classes as columns.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

get_parameters()

Get model parameters as a Python dictionary.

Returns

params – Nested dict {‘weights’: Current class weights, ‘measurement’: dict of measurement params, ‘structural’: dict of structural params, ‘measurement_in’: number of measurements, ‘structural_in’: number of structural features, }.

Return type

dict,

get_parameters_df(x_names=None, y_names=None)

Get model parameters as a long-form DataFrame.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

get_sm_df(x_names=None, y_names=None)

Get structural model parameters as DataFrame with classes as columns.

Parameters
  • x_names (List of str) – Column names of X.

  • y_names (List of str) – Column names of Y.

Returns

params

Return type

pd.DataFrame

m_step_structural(resp, Y, sample_weight=None)

M-step for the structural model only.

Handy for 3-step estimation.

Parameters
  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in Y.

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

property n_parameters

Get number of free parameters.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters

perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

predict(X)[source]

Call the predict method of the structural model to predict argmax P(Y|X) (Supervised prediction).

Parameters

X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

predictions – Y predictions.

Return type

array, shape (n_samples, n_columns)

predict_Y(X)

Call the predict method of the structural model to predict argmax P(Y|X) (Supervised prediction).

Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.

Parameters

X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

predictions – Y predictions.

Return type

array, shape (n_samples, n_columns)

predict_class(X, Y=None)

Predict the cluster/latent class/component labels for the data samples in X.

Optionally, an array-like Y can be provided to predict the labels based on both the measurement and structural models.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

labels – Component labels.

Return type

array, shape (n_samples,)

predict_proba(X)[source]

Call the predict method of the structural model to predict the full conditional P(Y|X).

Parameters

X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

conditional – P(Y|X).

Return type

array, shape (n_samples, n_columns)

predict_proba_Y(X)

Call the predict method of the structural model to predict the full conditional P(Y|X).

Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.

Parameters

X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

conditional – P(Y|X).

Return type

array, shape (n_samples, n_columns)

predict_proba_class(X, Y=None)

Predict the latent class probabilities for the data samples in X using the measurement model.

Optionally, an array-like Y can be provided to predict the posterior based on both the measurement and structural models.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

Returns

resp – P(class|X, Y) for each sample in X.

Return type

array, shape (n_samples, n_components)

relative_entropy(X, Y=None)

Scaled Relative Entropy of the posterior over latent classes.

Ramaswamy et al., 1993.

1 - entropy / (n_samples * log(n_components))

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

relative_entropy

Return type

float

report(X, Y=None, sample_weight=None)

Print detailed report of the model and performance metrics.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • sample_weight (array-like of shape(n_samples,), default=None) –

sabic(X, Y=None)

Sample-Sized Adjusted BIC.

References

Sclove SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika. 1987;52(3):333–343.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

Returns

ssa_bic

Return type

float

sample(n_samples, labels=None)

Sample method for fitted StepMix model.

Adapted from the sklearn BaseMixture sample method.

Parameters
  • n_samples (int) – Number of samples.

  • labels (ndarray of shape (n_samples,)) – Predetermined class labels. Will ignore class weights if provided.

Returns

  • X (array-like of shape (n_samples, n_columns)) – Measurement samples.

  • Y (array-like of shape (n_samples, n_columns_structural)) – Structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

score(X, Y=None, sample_weight=None)

Compute the average log-likelihood over samples.

Setting Y=None will ignore the structural likelihood.

Parameters
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

Returns

avg_ll – Average log likelihood over samples.

Return type

float

set_fit_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMixClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns

self – The updated object.

Return type

object

set_parameters(params)

Set parameters.

Parameters

params (dict,) – Same format as self.get_parameters().

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMixClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns

self – The updated object.

Return type

object

Bootstrap

Utility functions for model bootstrapping and confidence intervals.

stepmix.bootstrap.blrt(null_model, alternative_model, X, Y=None, n_repetitions=30, random_state=42)[source]

BLRT Test

References

Dziak, John J., Stephanie T. Lanza, and Xianming Tan. “Effect size, statistical power, and sample size requirements for the bootstrap likelihood ratio test in latent class analysis.” Structural equation modeling: a multidisciplinary journal 21.4 (2014): 534-552. Nylund, Karen L., Tihomir Asparouhov, and Bengt O. Muthén. “Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study.” Structural equation modeling: A multidisciplinary Journal 14.4 (2007): 535-569.

Parameters
  • null_model (StepMix instance) – A StepMix model with k classes.

  • alternative_model (StepMix instance) – A StepMix model with k + 1 classes.

  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • n_repetitions (int) – Number of repetitions to fit.

  • random_state (int, default=None) –

Returns

p-value – Bootstrap p-value of the BLRT test. A significant test indicates the alternative k + 1 model provides a significantly better fit of the data.

Return type

float

stepmix.bootstrap.blrt_sweep(model, X, Y=None, low=1, high=5, n_repetitions=30, random_state=42, verbose=True)[source]

Sweep BLRT Test

Run BLRT test for a range of number of classes. For example, if you set low=1 and high=4, the function will return the result of 3 tests [1 vs 2, 2 vs 3, 3 vs 4].

Parameters
  • model (StepMix instance) – A StepMix model.

  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • low (int, default=1) – Minimum number of classes to test.

  • high (int, default=5) – Maximum number of classes to test.

  • n_repetitions (int) – Number of repetitions to fit.

  • random_state (int, default=None) –

  • verbose (bool, default=True) –

Returns

p-values – Bootstrap p-values of the BLRT test for each comparison.

Return type

DataFrame

stepmix.bootstrap.bootstrap(estimator, X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, sampler=None, identify_classes=True, progress_bar=True, random_state=None)[source]

Parametric or Non-parametric boostrap of a StepMix estimator.

Fit n_repetitions clones of the estimator on resampled datasets.

If identify_classes=True, repeated parameter estimates are aligned with the class order of the main estimator using a permutation search.

Parameters
  • estimator (StepMix instance) – A fitted StepMix estimator. Used as a template to clone bootstrap estimator.

  • X (array-like of shape (n_samples, n_features)) –

  • Y (array-like of shape (n_samples, n_features_structural), default=None) –

  • n_repetitions (int) – Number of repetitions to fit.

  • sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored if parametric=True.

  • parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.

  • sampler (bool, default=None) – Another fitted estimator to use for sampling instead of the main estimator. Only used for parametric bootstrapping.

  • identify_classes (bool, default=True) – Run a permutation test to align the classes of the repetitions to the classes of the main estimator. This is required if inference on the model parameters is needed, but can be turned off if only the likelihood needs to be bootstrapped to save computations.

  • progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.

  • random_state (int, default=None) –

Returns

  • samples (DataFrame) – DataFrame of all repetitions. Follows the convention of StepMix.get_parameters_df() with an additional ‘rep’ column.

  • rep_stats (DataFrame) – Likelihood statistics of each repetition. ‘rep’ column. None if identy_classes=False.

  • stats (DataFrame) – Various statistics of bootstrapped estimators.

stepmix.bootstrap.find_best_permutation(reference, target, criterion=<function mse>)[source]

Find the best permutation of the columns in target to minimize some criterion comparing to reference.

Parameters
  • reference (ndarray of shape (n_samples, n_columns)) – Reference array.

  • target (ndarray of shape (n_samples, n_columns)) – Target array.

  • criterion (Callable returning a scalar used to find the permutation.) –

Emission Models

Categorical

Categorical emission models.

class stepmix.emission.categorical.Bernoulli(**kwargs)[source]

Bases: Emission

Bernoulli (binary) emission model.

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)[source]

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)[source]

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.categorical.BernoulliNan(**kwargs)[source]

Bases: Bernoulli

Bernoulli (binary) emission model supporting missing values (Full Information Maximum Likelihood).

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.categorical.Multinoulli(n_components=2, random_state=None, integer_codes=True, max_n_outcomes=None, total_outcomes=None)[source]

Bases: Emission

Multinoulli (categorical) emission model

Uses one-hot encoded features. Expected data formatting: X[n,k*L+l]=1 if l is the observed outcome for the kth attribute of data point n, where n is the number of observations, K=n_features, L=max_n_outcomes for each multinoulli where max_n_outcomes represents the maximum number of outcomes for a given feature.

If integer_codes is set to True, the model will expect integer-encoded categories and will one-hot encode the data itself. In this case, max_n_outcomes and total_outcomes are inferred by the model.

Parameters
  • n_components (int, default=2) – The number of latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

  • integer_codes (bool, default=True) – Input X should be integer-encoded zero-indexed categories.

  • max_n_outcomes (int, default=None) – Maximum number of outcomes for a single categorical feature. Each column in the input will have max_n_outcomes associated columns in the one-hot encoding. If None and integer_codes=True, will be inferred from the data.

  • total_outcomes (int, default=None) – Total outcomes over all features. E.g., if we provide a categorical variable with two outcomes and another with 4 outcomes, total_outcomes = 6. If None and integer_codes=True, will be inferred from the data.

pis[k*L+l,c]=P[ X[n,k*L+l]=1 | n belongs to class c]
check_parameters()

Validate class attributes.

encode_features(X)[source]
get_default_feature_names(n_features)
get_n_features()[source]
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Expand the feature names since each feature may have up to max_n_outcomes outcomes.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)[source]

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)[source]

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.categorical.MultinoulliNan(**kwargs)[source]

Bases: Multinoulli

Multinoulli (categorical) emission model supporting missing values (Full Information Maximum Likelihood).

check_parameters()

Validate class attributes.

encode_features(X)
get_default_feature_names(n_features)
get_n_features()
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Expand the feature names since each feature may have up to max_n_outcomes outcomes.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

Gaussian

Gaussian emission models.

class stepmix.emission.gaussian.Gaussian(n_components=2, covariance_type='spherical', init_params='random', reg_covar=1e-06, random_state=None)[source]

Bases: Emission

Gaussian emission model with various covariance options.

This class spoofs the scikit-learn Gaussian Mixture class by reusing the same attributes and calls its methods.

check_parameters()[source]

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()[source]

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)[source]

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

M step.

Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm, axis=0)[source]

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(params)[source]

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianDiag(**kwargs)[source]

Bases: Gaussian

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

M step.

Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm, axis=0)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(params)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianDiagNan(**kwargs)[source]

Bases: GaussianNan

Gaussian emission model with diagonal covariance supporting missing values (Full Information Maximum Likelihood)

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianFull(**kwargs)[source]

Bases: Gaussian

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Return self.parameters into a long dataframe.

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

M step.

Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm, axis=0)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['class_no', 'param'], columns=['model_name', 'variable'])[source]

Flipping class_no and variable is nicer for full covariances.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(params)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianNan(debug_likelihood=False, **kwargs)[source]

Bases: Emission

Gaussian emission model supporting missing values (Full Information Maximum Likelihood)

This class assumes a diagonal covariance structure. The covariances are therefore represented as a (n_components, n_features) array.

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

abstract property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianSpherical(**kwargs)[source]

Bases: Gaussian

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

M step.

Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm, axis=0)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(params)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianSphericalNan(**kwargs)[source]

Bases: GaussianNan

Gaussian emission model with spherical covariance supporting missing values (Full Information Maximum Likelihood)

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianTied(**kwargs)[source]

Bases: Gaussian

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Return self.parameters into a long dataframe.

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

M step.

Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.

Parameters
  • X (array-like of shape (n_samples, n_features)) –

  • resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm, axis=0)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['class_no', 'param'], columns=['model_name', 'variable'])[source]

Flipping class_no and variable is nicer for full covariances.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(params)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianUnit(**kwargs)[source]

Bases: Emission

Gaussian emission model with fixed unit variance.

sklearn.mixture.GaussianMixture does not have an implementation for fixed unit variance, so we provide one.

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

class stepmix.emission.gaussian.GaussianUnitNan(**kwargs)[source]

Bases: GaussianNan

Gaussian emission model with unit covariance supporting missing values (Full Information Maximum Likelihood)

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

Covariate

Covariate emission model.

class stepmix.emission.covariate.Covariate(tol=0.0001, max_iter=1, lr=0.001, intercept=True, method='newton-raphson', **kwargs)[source]

Bases: Emission

Covariate model with descent update.

Parameters
  • tol (float, default=1e-4) – Absolute tolerance applied to each component of the gradient.

  • max_iter (int, default=100) – The maximum number of steps to take per M-step.

  • lr (float, default=1e-3) – Learning rate.

  • intercept (bool, default=True) – If an intercept parameter should be fitted.

  • method ({"gradient", "newton-raphson"}, default="gradient") – Optimization method.

check_parameters()[source]

Validate class attributes.

get_default_feature_names(n_features)
get_full_matrix(X)[source]
get_parameters()

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Return self.parameters into a long dataframe.

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)[source]

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm)

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(X)[source]

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

Nested

class stepmix.emission.nested.Nested(descriptor, emission_dict, n_components, random_state, **kwargs)[source]

Bases: Emission

Nested emission model.

The descriptor must be a dict of dicts, where the nested dicts hold arguments for nested models. Each nested dict is expected to have a model key referring to a valid emission model as well as an n_columns key describing the number of columns (i.e. features for univariate variables or features*n_outcomes for one-hot encoded variables) associated with that model. For example, a model where the first 3 features are gaussian with unit variance, the next 3 are multinoulli with 5 possible outcomes (for a total of 3*5=15 columns) and the last 4 are covariates would be described likeso :

descriptor = {
   'model_1': {
           'model': 'gaussian_unit',
           'n_columns':3
    },
   'model_2': {
           'model': 'multinoulli',
           'n_columns': 15,
           'n_outcomes': 5
    },
   'model_3': {
           'model': 'covariate',
           'n_columns': 4,
           'method': "newton-raphson",
           'lr': 1e-3,
    }
}

The above model would then expect an n_samples x 22 matrix as input (3 + 15 + 4 = 22) where columns follow the same order of declaration (i.e., the columns of model_1 are first, columns of model_2 come after etc.).

As demonstrated by the covariate argument, additional arguments can be specified and are passed to the associated Emission class. Particularly useful to specify optimization parameters for stepmix.emission.covariate.Covariate.

check_parameters()

Validate class attributes.

get_default_feature_names(n_features)
get_parameters()[source]

Get a copy of model parameters.

Returns

parameters – Copy of model parameters.

Return type

dict

get_parameters_df(feature_names=None)[source]

Return self.parameters into a long dataframe.

Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].

Call self._to_df or implement custom method.

initialize(X, resp, random_state=None)[source]

Initialize parameters.

Simply performs the m-step on the current responsibilities to initialize parameters.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.

  • random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.

log_likelihood(X)[source]

Return the log-likelihood of the input data.

Parameters

X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.

Returns

ll – Log-likelihood of the input data conditioned on each component.

Return type

ndarray of shape (n_samples, n_components)

m_step(X, resp)[source]

Update model parameters via maximum likelihood using the current responsibilities.

Parameters
  • X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.

  • resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.

property n_parameters

Number of free parameters in the model.

permute_classes(perm, axis=0)[source]

Permute the latent class and associated parameters of this estimator.

Effectively remaps latent classes.

Parameters
  • perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).

  • axis (int) – Axis to use for permuting the parameters.

predict(log_resp)

Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Argmax P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

predict_proba(log_resp)

Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).

This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.

Parameters

log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.

Returns

resp – Conditional probabilities P(Y|X) of each sample.

Return type

ndarray of shape (n_samples, n_columns)

print_parameters(indent=1, feature_names=None)[source]

Print parameters with nice formatting.

This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.

Parameters
  • indent (int) – Add indent to print.

  • features_names (List of str) – Variable names.

  • index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.

  • columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.

  • model_name (str) – str to display as model name.

sample(class_no, n_samples)[source]

Sample n_samples conditioned on the given class_no.

Parameters
  • class_no (int) – Class int.

  • n_samples (int) – Number of samples.

Returns

samples – Samples

Return type

ndarray of shape (n_samples, n_columns)

set_parameters(parameters)[source]

Set current parameters.

Parameters

parameters (dict) – Model parameters. Should be the same format as the dict returned by self.get_parameters.

Datasets

Various synthetic datasets.

stepmix.datasets.bakk_measurements(n_classes, n_mm, sep_level)[source]

Binary measurement parameters in Bakk 2018.

Parameters
  • n_classes (int) – Number of latent classes. Use 3 for the paper simulation.

  • n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.

  • sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.

Returns

pis – Conditional bernoulli probabilities.

Return type

ndarray of shape (n_mm, n_classes)

stepmix.datasets.data_bakk_complete(n_samples, sep_level, n_mm=6, random_state=None, nan_ratio=0.0)[source]

Stitch together data_bakk_covariate and data_bakk_response to get a complete model.

stepmix.datasets.data_bakk_complex(n_samples, sep_level, random_state=None, nan_ratio=0.0)[source]

Build a simulated example with mixed data and missing values.

Measurements: 3 binary variables + 1 continuous variable.

Structural: 3 binary response variables + 1 continuous response variable + 1 covariate.

Missing values everywhere except in the covariate.

Return data as a dataframe.

stepmix.datasets.data_bakk_covariate(n_samples, sep_level, n_mm=6, random_state=None)[source]

Simulated data for the covariate simulations in Bakk 2018.

Parameters
  • n_samples (int) – Number of samples.

  • sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.

  • n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.

  • random_state (int) – Random state.

Returns

  • X (ndarray of shape (n_samples, n_mm)) – Binary measurement samples.

  • Y (ndarray of shape (n_samples, 1)) – Covariate structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

References

Bakk, Z. and Kuha, J. Two-step estimation of models between latent classes and external variables. Psychometrika, 83(4):871–892, 2018

stepmix.datasets.data_bakk_response(n_samples, sep_level, n_classes=3, n_mm=6, random_state=None)[source]

Simulated data for the response simulations in Bakk 2018.

Parameters
  • n_samples (int) – Number of samples.

  • sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.

  • n_classes (int) – Number of latent classes. Use 3 for the paper simulation.

  • n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.

  • random_state (int) – Random state.

Returns

  • X (ndarray of shape (n_samples, n_mm)) – Binary measurement samples.

  • Y (ndarray of shape (n_samples, 1)) – Response structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

References

Bakk, Z. and Kuha, J. Two-step estimation of models between latent classes and external variables. Psychometrika, 83(4):871–892, 2018

stepmix.datasets.data_gaussian_binary(n_samples, random_state=None)[source]

Full Gaussian measurement model with 2 binary responses.

The data has 4 latent classes.

Parameters
  • n_samples (int) – Number of samples.

  • random_state (int) – Random state.

Returns

  • X (ndarray of shape (n_samples, 2)) – Gaussian Measurement samples.

  • Y (ndarray of shape (n_samples, 2)) – Binary Structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

stepmix.datasets.data_gaussian_categorical(n_samples, random_state=None)[source]

Full Gaussian measurement model with 2 categorical responses.

The data has 4 latent classes.

Parameters
  • n_samples (int) – Number of samples.

  • random_state (int) – Random state.

Returns

  • X (ndarray of shape (n_samples, 2)) – Gaussian Measurement samples.

  • Y (ndarray of shape (n_samples, 2)) – Categorical Structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

stepmix.datasets.data_gaussian_diag(n_samples, sep_level, n_mm=6, random_state=None, nan_ratio=0.0)[source]

Bakk binary measurement model with 2D diagonal gaussian structural model.

Optionally, a random proportion of values can be replaced with missing values to test FIML models.

Parameters
  • n_samples (int) – Number of samples.

  • sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.

  • n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.

  • random_state (int) – Random state.

  • nan_ratio (float) – Ratio of values to replace with missing values.

Returns

  • X (ndarray of shape (n_samples, n_mm)) – Binary ,easurement samples.

  • Y (ndarray of shape (n_samples, 2)) – Gaussian structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

stepmix.datasets.data_generation_gaussian(n_samples, sep_level, n_mm=6, random_state=None)[source]

Bakk binary measurement model with more complex gaussian structural model.

Parameters
  • n_samples (int) – Number of samples.

  • sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.

  • n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.

  • random_state (int) – Random state.

Returns

  • X (ndarray of shape (n_samples, n_mm)) – Binary Measurement samples.

  • Y (ndarray of shape (n_samples, 2)) – Gaussian Structural samples.

  • labels (ndarray of shape (n_samples,)) – Ground truth class membership.

stepmix.datasets.random_nan(X, Y, nan_ratio, random_state=None)[source]

Randomly replace values in X and Y with NaNs with probability nan_ratio.