API
StepMix
- class stepmix.stepmix.StepMix(n_components=2, *, n_steps=1, measurement='bernoulli', structural='gaussian_unit', assignment='modal', correction=None, abs_tol=1e-10, rel_tol=0.0, max_iter=1000, n_init=1, save_param_init=False, init_params='random', random_state=None, verbose=0, progress_bar=1, measurement_params=None, structural_params=None)[source]
Bases:
BaseEstimator
StepMix estimator for Latent Class Analysis.
Multi-step EM estimation of latent class models with measurement and structural models. The measurement and structural models can be fit together (1-step) or sequentially (2-step and 3-step). This estimator implements the BCH and ML bias correction methods for 3-step estimation.
The measurement and structural models can be any of those defined in stepmix.emission. The measurement model can be used alone to effectively fit a latent mixture model.
This class was adapted from the scikit-learn BaseMixture and GaussianMixture classes.
New in version 0.00.
- Parameters
n_components (int, default=2) – The number of latent classes.
n_steps ({1, 2, 3}, default=1) –
Number of steps in the estimation. Must be one of :
1: run EM on both the measurement and structural models.
2: first run EM on the measurement model, then on the complete model, but keep the measurement parameters fixed for the second step. See Bakk, 2018.
3: first run EM on the measurement model, assign class probabilities, then fit the structural model via maximum likelihood. See the correction parameter for bias correction.
measurement ({'bernoulli', 'bernoulli_nan', 'binary', 'binary_nan', 'categorical', 'categorical_nan', 'continuous', 'continuous_nan', 'covariate', 'gaussian', 'gaussian_nan', 'gaussian_unit', 'gaussian_unit_nan', 'gaussian_spherical', 'gaussian_spherical_nan', 'gaussian_tied', 'gaussian_diag', 'gaussian_diag_nan', 'gaussian_full', 'multinoulli', 'multinoulli_nan', dict}, default='bernoulli') –
String describing the measurement model. Must be one of:
’bernoulli’: the observed data consists of n_features bernoulli (binary) random variables.
’bernoulli_nan’: the observed data consists of n_features bernoulli (binary) random variables. Supports missing values.
’binary’: alias for bernoulli.
’binary_nan’: alias for bernoulli_nan.
’categorical’: alias for multinoulli.
’categorical_nan’: alias for multinoulli_nan.
’continuous’: alias for gaussian_diag.
’continuous_nan’: alias for gaussian_diag_nan. Supports missing values.
’covariate’: covariate model where class probabilities are a multinomial logistic model of the features.
’gaussian’: alias for gaussian_unit.
’gaussian_nan’: alias for gaussian_unit. Supports missing values.
’gaussian_unit’: each gaussian component has unit variance. Only fit the mean.
’gaussian_unit_nan’: each gaussian component has unit variance. Only fit the mean. Supports missing values.
’gaussian_spherical’: each gaussian component has its own single variance.
’gaussian_spherical_nan’: each gaussian component has its own single variance. Supports missing values.
’gaussian_tied’: all gaussian components share the same general covariance matrix.
’gaussian_diag’: each gaussian component has its own diagonal covariance matrix.
’gaussian_diag_nan’: each gaussian component has its own diagonal covariance matrix. Supports missing values.
’gaussian_full’: each gaussian component has its own general covariance matrix.
’multinoulli’: the observed data consists of n_features multinoulli (categorical) random variables.
’multinoulli_nan’: the observed data consists of n_features multinoulli (categorical) random variables. Supports missing values.
Models suffixed with
_nan
support missing values, but may be slower than their fully observed counterpart.Alternatively accepts a dict to define a nested model, e.g., 3 gaussian features and 2 binary features. Please refer to
stepmix.emission.nested.Nested
for details.structural ({'bernoulli', 'bernoulli_nan', 'binary', 'binary_nan', 'categorical', 'categorical_nan', 'continuous', 'continuous_nan', 'covariate', 'gaussian', 'gaussian_nan', 'gaussian_unit', 'gaussian_unit_nan', 'gaussian_spherical', 'gaussian_spherical_nan', 'gaussian_tied', 'gaussian_diag', 'gaussian_diag_nan', 'gaussian_full', 'multinoulli', 'multinoulli_nan', dict}, default='bernoulli') – String describing the structural model. Same options as those for the measurement model.
assignment ({'soft', 'modal'}, default='modal') –
Class assignments for 3-step estimation. Must be one of:
’soft’: keep class responsibilities (posterior probabilities) as is.
’modal’: assign 1 to the class with max probability, 0 otherwise (one-hot encoding).
correction ({None, 'BCH', 'ML'}, default=None) –
Bias correction for 3-step estimation. Must be one of:
None : No correction. Run Naive 3-step.
’BCH’ : Apply the empirical BCH correction from Vermunt, 2004.
’ML’ : Apply the ML correction from Vermunt, 2010; Bakk et al., 2013.
abs_tol (float, default=1e-10) – The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
rel_tol (float, default=0.00) – The convergence threshold. EM iterations will stop when the relative lower bound average gain is below this threshold.
max_iter (int, default=1000) – The number of EM iterations to perform.
n_init (int, default=1) – The number of initializations to perform. The best results are kept.
save_param_init (bool, default=False) – Save the estimated parameters of all initializations to self.param_buffer_.
init_params ({'kmeans', 'random'}, default='random') –
The method used to initialize the weights, the means and the precisions. Must be one of:
’kmeans’ : responsibilities are initialized using kmeans.
’random’ : responsibilities are initialized randomly.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
verbose (int, default=0) – Enable verbose output. If 1, will print detailed report of the model and the performance metrics after fitting.
progress_bar (int, default=1) –
Display a tqdm progress bar during fitting.
0 : No progress bar.
1 : Progress bar for initializations.
2 : Progress bars for initializations and iterations. This requires a nested tqdm bar and may not work properly in some terminals.
measurement_params ({dict, None}, default=None) – Additional params passed to the measurement model class. Particularly useful to specify optimization parameters for
stepmix.emission.covariate.Covariate
. Ignored if the measurement descriptor is a nested object (seestepmix.emission.nested.Nested
).structural_params ({dict, None}, default=None) – Additional params passed to the structural model class. Particularly useful to specify optimization parameters for
stepmix.emission.covariate.Covariate
. Ignored if the structural descriptor is a nested object (seestepmix.emission.nested.Nested
).
- weights_
The weights of each mixture components.
- Type
ndarray of shape (n_components,)
- _mm
Measurement model, including parameters and estimation methods.
- Type
stepmix.emission.Emission
- _sm
Structural model, including parameters and estimation methods.
- Type
stepmix.emission.Emission
- log_resp_
Initial log responsibilities.
- Type
ndarray of shape (n_samples, n_components)
- lower_bound_
Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.
- Type
- lower_bound_buffer_
Lower bound values on the log-likelihood (of the training data with respect to the model) of all EM initializations.
- Type
- param_buffer_
Final parameters of all initializations. Only updated if save_param_init=True.
- Type
Notes
References
Bolck, A., Croon, M., and Hagenaars, J. Estimating latent structure models with categorical variables: One-step versus three-step estimators. Political analysis, 12(1): 3–27, 2004.
Vermunt, J. K. Latent class modeling with covariates: Two improved three-step approaches. Political analysis, 18 (4):450–469, 2010.
Bakk, Z., Tekle, F. B., and Vermunt, J. K. Estimating the association between latent class membership and external variables using bias-adjusted three-step approaches. Sociological Methodology, 43(1):272–311, 2013.
Bakk, Z. and Kuha, J. Two-step estimation of models between latent classes and external variables. Psychometrika, 83(4):871–892, 2018
Examples
from stepmix.datasets import data_bakk_response from stepmix.stepmix import StepMix # Soft 3-step X, Y, _ = data_bakk_response(n_samples=1000, sep_level=.7, random_state=42) model = StepMix(n_components=3, n_steps=3, measurement='bernoulli', structural='gaussian_unit', random_state=42, assignment='soft') model.fit(X, Y) model.score(X, Y) # Average log-likelihood # Equivalently, each step can be performed individually. See the code of the fit method for details. model = StepMix(n_components=3, measurement='bernoulli', structural='gaussian_unit', random_state=42) model.em(X) # Step 1 probs = model.predict_proba(X) # Step 2 model.m_step_structural(probs, Y) # Step 3 model.score(X, Y) # Average log-likelihood
- aic(X, Y=None)[source]
Akaike information criterion for the current model on the measurement data X and optionally the structural data Y.
Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
aic – The lower the better.
- Return type
- bic(X, Y=None)[source]
Bayesian information criterion for the current model on the measurement data X and optionally the structural data Y.
Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
bic – The lower the better.
- Return type
- bootstrap(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, sampler=None, identify_classes=True, progress_bar=True, random_state=None)[source]
Parametric or Non-parametric boostrap of this estimator.
Fit n_repetitions clones of the estimator on resampled datasets.
If identify_classes=True, repeated parameter estimates are aligned with the class order of the main estimator using a permutation search.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored if parametric=True.
n_repetitions (int) – Number of repetitions to fit.
parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.
sampler (bool, default=None) – Another fitted estimator to use for sampling instead of the main estimator. Only used for parametric bootstrapping.
identify_classes (bool, default=True) – Run a permutation test to align the classes of the repetitions to the classes of the main estimator. This is required if inference on the model parameters is needed, but can be turned off if only the likelihood needs to be bootstrapped to save computations. progress_bar : bool, default=True Display a tqdm progress bar for repetitions.
progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.
random_state (int, default=None) – If none, use self.random_state.
- Returns
samples (DataFrame) – Parameter DataFrame for all repetitions. Follows the convention of StepMix.get_parameters_df() with an additional ‘rep’ column.
rep_stats (DataFrame) – Likelihood statistics of each repetition.
- bootstrap_stats(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, progress_bar=True)[source]
Bootstrapping of a StepMix estimator. Obtain boostrapped parameters and some statistics (mean and standard deviation).
If a covariate model is used in the structural model, the output keys “cw_mean” and “cw_std” are omitted.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
n_repetitions (int) – Number of repetitions to fit.
parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.
progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.
- Returns
bootstrap_and_stats – Dictionary of dataframes { ‘samples’: Parameters estimated by self.boostrap in a long-form DataFrame, ‘rep_stats’: Likelihood statistics of each repetition provided by self.boostrap, ‘cw_mean’: Bootstrapped means of the class weights, ‘cw_std’: Bootstrapped standard deviations of the class weights, ‘mm_mean’: Bootstrapped means of the measurement model parameters, ‘mm_std’: Bootstrapped standard deviations of the measurement model parameters, ‘sm_mean’: Bootstrapped means of the structural model parameters, ‘sm_std’: Bootstrapped standard deviations of the structural model parameters, }.
- Return type
dict,
- caic(X, Y=None)[source]
Consistent AIC.
References
Bozdogan, H. 1987. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 52: 345–370.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
caic – The lower the better.
- Return type
- em(X, Y=None, sample_weight=None, freeze_measurement=False, log_emission_pm=None)[source]
EM algorithm to fit the weights, measurement parameters and structural parameters.
Adapted from the fit_predict method of the sklearn BaseMixture class to include (optional) structural model computations.
Setting Y=None will run EM on the measurement model only. Providing both X and Y will run EM on the complete model, unless otherwise specified by freeze_measurement.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
freeze_measurement (bool, default =False) – Run EM on the complete model, but do not update measurement model parameters. Useful for 2-step estimation and 3-step with ML correction.
log_emission_pm (ndarray of shape (n, n_components), default=None) – Log probabilities of the predicted class given the true latent class for ML correction.
- entropy(X, Y=None)[source]
Entropy of the posterior over latent classes.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
entropy
- Return type
- fit(X, Y=None, sample_weight=None, y=None)[source]
Fit StepMix measurement model and optionally the structural model.
Setting Y=None will fit the measurement model only. Providing both X and Y will fit the full model following the self.n_steps argument.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
y (array-like of shape (n_samples, n_features), default=None) – Alias for Y. Ignored if Y is provided.
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
- get_cw_df(x_names=None, y_names=None)[source]
Get class weights as DataFrame with classes as columns.
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- get_mm_df(x_names=None, y_names=None)[source]
Get measurement model parameters as DataFrame with classes as columns.
- get_parameters()[source]
Get model parameters as a Python dictionary.
- Returns
params – Nested dict {‘weights’: Current class weights, ‘measurement’: dict of measurement params, ‘structural’: dict of structural params, ‘measurement_in’: number of measurements, ‘structural_in’: number of structural features, }.
- Return type
dict,
- get_parameters_df(x_names=None, y_names=None)[source]
Get model parameters as a long-form DataFrame.
- get_params(deep=True)
Get parameters for this estimator.
- get_sm_df(x_names=None, y_names=None)[source]
Get structural model parameters as DataFrame with classes as columns.
- m_step_structural(resp, Y, sample_weight=None)[source]
M-step for the structural model only.
Handy for 3-step estimation.
- Parameters
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in Y.
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
- property n_parameters
Get number of free parameters.
- permute_classes(perm)[source]
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
- predict(X, Y=None)[source]
Predict the cluster/latent class/component labels for the data samples in X.
Optionally, an array-like Y can be provided to predict the labels based on both the measurement and structural models.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
labels – Component labels.
- Return type
array, shape (n_samples,)
- predict_Y(X)[source]
Call the predict method of the structural model to predict argmax P(Y|X) (Supervised prediction).
Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
predictions – Y predictions.
- Return type
array, shape (n_samples, n_columns)
- predict_class(X, Y=None)[source]
Predict the cluster/latent class/component labels for the data samples in X.
Optionally, an array-like Y can be provided to predict the labels based on both the measurement and structural models.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
labels – Component labels.
- Return type
array, shape (n_samples,)
- predict_proba(X, Y=None)[source]
Predict the latent class probabilities for the data samples in X using the measurement model.
Optionally, an array-like Y can be provided to predict the posterior based on both the measurement and structural models.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
resp – P(class|X, Y) for each sample in X.
- Return type
array, shape (n_samples, n_components)
- predict_proba_Y(X)[source]
Call the predict method of the structural model to predict the full conditional P(Y|X).
Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
conditional – P(Y|X).
- Return type
array, shape (n_samples, n_columns)
- predict_proba_class(X, Y=None)[source]
Predict the latent class probabilities for the data samples in X using the measurement model.
Optionally, an array-like Y can be provided to predict the posterior based on both the measurement and structural models.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
resp – P(class|X, Y) for each sample in X.
- Return type
array, shape (n_samples, n_components)
- relative_entropy(X, Y=None)[source]
Scaled Relative Entropy of the posterior over latent classes.
Ramaswamy et al., 1993.
1 - entropy / (n_samples * log(n_components))
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
relative_entropy
- Return type
- report(X, Y=None, sample_weight=None)[source]
Print detailed report of the model and performance metrics.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
- sabic(X, Y=None)[source]
Sample-Sized Adjusted BIC.
References
Sclove SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika. 1987;52(3):333–343.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
ssa_bic
- Return type
- sample(n_samples, labels=None)[source]
Sample method for fitted StepMix model.
Adapted from the sklearn BaseMixture sample method.
- Parameters
n_samples (int) – Number of samples.
labels (ndarray of shape (n_samples,)) – Predetermined class labels. Will ignore class weights if provided.
- Returns
X (array-like of shape (n_samples, n_columns)) – Measurement samples.
Y (array-like of shape (n_samples, n_columns_structural)) – Structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.
- score(X, Y=None, sample_weight=None)[source]
Compute the average log-likelihood over samples.
Setting Y=None will ignore the structural likelihood.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
- Returns
avg_ll – Average log likelihood over samples.
- Return type
- set_fit_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMix
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_parameters(params)[source]
Set parameters.
- Parameters
params (dict,) – Same format as self.get_parameters().
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMix
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- class stepmix.stepmix.StepMixClassifier(n_components=2, *, n_steps=1, measurement='bernoulli', structural='gaussian_unit', assignment='modal', correction=None, abs_tol=1e-10, rel_tol=0.0, max_iter=1000, n_init=1, save_param_init=False, init_params='random', random_state=None, verbose=0, progress_bar=1, measurement_params=None, structural_params=None)[source]
Bases:
StepMix
StepMix Supervised Classifier
Identical to a StepMix estimator, but we remap predict and predict_proba to perform inference over Y instead of the latent class. This follows the supervised learning convention of sklearn.
We call this a classifier since inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.
Also works with the aliases ‘bernoulli’, ‘bernoulli_nan’, ‘multinoulli’ and ‘multinoulli_nan’.
- aic(X, Y=None)
Akaike information criterion for the current model on the measurement data X and optionally the structural data Y.
Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
aic – The lower the better.
- Return type
- bic(X, Y=None)
Bayesian information criterion for the current model on the measurement data X and optionally the structural data Y.
Adapted from https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/mixture/_base.py
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
bic – The lower the better.
- Return type
- bootstrap(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, sampler=None, identify_classes=True, progress_bar=True, random_state=None)
Parametric or Non-parametric boostrap of this estimator.
Fit n_repetitions clones of the estimator on resampled datasets.
If identify_classes=True, repeated parameter estimates are aligned with the class order of the main estimator using a permutation search.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored if parametric=True.
n_repetitions (int) – Number of repetitions to fit.
parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.
sampler (bool, default=None) – Another fitted estimator to use for sampling instead of the main estimator. Only used for parametric bootstrapping.
identify_classes (bool, default=True) – Run a permutation test to align the classes of the repetitions to the classes of the main estimator. This is required if inference on the model parameters is needed, but can be turned off if only the likelihood needs to be bootstrapped to save computations. progress_bar : bool, default=True Display a tqdm progress bar for repetitions.
progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.
random_state (int, default=None) – If none, use self.random_state.
- Returns
samples (DataFrame) – Parameter DataFrame for all repetitions. Follows the convention of StepMix.get_parameters_df() with an additional ‘rep’ column.
rep_stats (DataFrame) – Likelihood statistics of each repetition.
- bootstrap_stats(X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, progress_bar=True)
Bootstrapping of a StepMix estimator. Obtain boostrapped parameters and some statistics (mean and standard deviation).
If a covariate model is used in the structural model, the output keys “cw_mean” and “cw_std” are omitted.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
n_repetitions (int) – Number of repetitions to fit.
parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.
progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.
- Returns
bootstrap_and_stats – Dictionary of dataframes { ‘samples’: Parameters estimated by self.boostrap in a long-form DataFrame, ‘rep_stats’: Likelihood statistics of each repetition provided by self.boostrap, ‘cw_mean’: Bootstrapped means of the class weights, ‘cw_std’: Bootstrapped standard deviations of the class weights, ‘mm_mean’: Bootstrapped means of the measurement model parameters, ‘mm_std’: Bootstrapped standard deviations of the measurement model parameters, ‘sm_mean’: Bootstrapped means of the structural model parameters, ‘sm_std’: Bootstrapped standard deviations of the structural model parameters, }.
- Return type
dict,
- caic(X, Y=None)
Consistent AIC.
References
Bozdogan, H. 1987. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 52: 345–370.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
caic – The lower the better.
- Return type
- em(X, Y=None, sample_weight=None, freeze_measurement=False, log_emission_pm=None)
EM algorithm to fit the weights, measurement parameters and structural parameters.
Adapted from the fit_predict method of the sklearn BaseMixture class to include (optional) structural model computations.
Setting Y=None will run EM on the measurement model only. Providing both X and Y will run EM on the complete model, unless otherwise specified by freeze_measurement.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
freeze_measurement (bool, default =False) – Run EM on the complete model, but do not update measurement model parameters. Useful for 2-step estimation and 3-step with ML correction.
log_emission_pm (ndarray of shape (n, n_components), default=None) – Log probabilities of the predicted class given the true latent class for ML correction.
- entropy(X, Y=None)
Entropy of the posterior over latent classes.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
entropy
- Return type
- fit(X, Y=None, sample_weight=None, y=None)
Fit StepMix measurement model and optionally the structural model.
Setting Y=None will fit the measurement model only. Providing both X and Y will fit the full model following the self.n_steps argument.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
y (array-like of shape (n_samples, n_features), default=None) – Alias for Y. Ignored if Y is provided.
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
- get_cw_df(x_names=None, y_names=None)
Get class weights as DataFrame with classes as columns.
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- get_mm_df(x_names=None, y_names=None)
Get measurement model parameters as DataFrame with classes as columns.
- get_parameters()
Get model parameters as a Python dictionary.
- Returns
params – Nested dict {‘weights’: Current class weights, ‘measurement’: dict of measurement params, ‘structural’: dict of structural params, ‘measurement_in’: number of measurements, ‘structural_in’: number of structural features, }.
- Return type
dict,
- get_parameters_df(x_names=None, y_names=None)
Get model parameters as a long-form DataFrame.
- get_params(deep=True)
Get parameters for this estimator.
- get_sm_df(x_names=None, y_names=None)
Get structural model parameters as DataFrame with classes as columns.
- m_step_structural(resp, Y, sample_weight=None)
M-step for the structural model only.
Handy for 3-step estimation.
- Parameters
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in Y.
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
- property n_parameters
Get number of free parameters.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
- predict(X)[source]
Call the predict method of the structural model to predict argmax P(Y|X) (Supervised prediction).
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
predictions – Y predictions.
- Return type
array, shape (n_samples, n_columns)
- predict_Y(X)
Call the predict method of the structural model to predict argmax P(Y|X) (Supervised prediction).
Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
predictions – Y predictions.
- Return type
array, shape (n_samples, n_columns)
- predict_class(X, Y=None)
Predict the cluster/latent class/component labels for the data samples in X.
Optionally, an array-like Y can be provided to predict the labels based on both the measurement and structural models.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
labels – Component labels.
- Return type
array, shape (n_samples,)
- predict_proba(X)[source]
Call the predict method of the structural model to predict the full conditional P(Y|X).
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
conditional – P(Y|X).
- Return type
array, shape (n_samples, n_columns)
- predict_proba_Y(X)
Call the predict method of the structural model to predict the full conditional P(Y|X).
Inference over Y is only supported if the structural model (sm) is set to ‘binary’, ‘binary_nan’, ‘categorical’, or ‘categorical_nan’.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
conditional – P(Y|X).
- Return type
array, shape (n_samples, n_columns)
- predict_proba_class(X, Y=None)
Predict the latent class probabilities for the data samples in X using the measurement model.
Optionally, an array-like Y can be provided to predict the posterior based on both the measurement and structural models.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
- Returns
resp – P(class|X, Y) for each sample in X.
- Return type
array, shape (n_samples, n_components)
- relative_entropy(X, Y=None)
Scaled Relative Entropy of the posterior over latent classes.
Ramaswamy et al., 1993.
1 - entropy / (n_samples * log(n_components))
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
relative_entropy
- Return type
- report(X, Y=None, sample_weight=None)
Print detailed report of the model and performance metrics.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
sample_weight (array-like of shape(n_samples,), default=None) –
- sabic(X, Y=None)
Sample-Sized Adjusted BIC.
References
Sclove SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika. 1987;52(3):333–343.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
- Returns
ssa_bic
- Return type
- sample(n_samples, labels=None)
Sample method for fitted StepMix model.
Adapted from the sklearn BaseMixture sample method.
- Parameters
n_samples (int) – Number of samples.
labels (ndarray of shape (n_samples,)) – Predetermined class labels. Will ignore class weights if provided.
- Returns
X (array-like of shape (n_samples, n_columns)) – Measurement samples.
Y (array-like of shape (n_samples, n_columns_structural)) – Structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.
- score(X, Y=None, sample_weight=None)
Compute the average log-likelihood over samples.
Setting Y=None will ignore the structural likelihood.
- Parameters
X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points to fit the measurement model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
Y (array-like of shape (n_samples, n_features_structural), default=None) – List of n_features-dimensional data points to fit the structural model. Each row corresponds to a single data point. If the data is categorical, by default it should be 0-indexed and integer encoded (not one-hot encoded).
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
- Returns
avg_ll – Average log likelihood over samples.
- Return type
- set_fit_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMixClassifier
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_parameters(params)
Set parameters.
- Parameters
params (dict,) – Same format as self.get_parameters().
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') StepMixClassifier
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
Bootstrap
Utility functions for model bootstrapping and confidence intervals.
- stepmix.bootstrap.blrt(null_model, alternative_model, X, Y=None, n_repetitions=30, random_state=42)[source]
BLRT Test
References
Dziak, John J., Stephanie T. Lanza, and Xianming Tan. “Effect size, statistical power, and sample size requirements for the bootstrap likelihood ratio test in latent class analysis.” Structural equation modeling: a multidisciplinary journal 21.4 (2014): 534-552. Nylund, Karen L., Tihomir Asparouhov, and Bengt O. Muthén. “Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study.” Structural equation modeling: A multidisciplinary Journal 14.4 (2007): 535-569.
- Parameters
null_model (StepMix instance) – A StepMix model with k classes.
alternative_model (StepMix instance) – A StepMix model with k + 1 classes.
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
n_repetitions (int) – Number of repetitions to fit.
random_state (int, default=None) –
- Returns
p-value – Bootstrap p-value of the BLRT test. A significant test indicates the alternative k + 1 model provides a significantly better fit of the data.
- Return type
- stepmix.bootstrap.blrt_sweep(model, X, Y=None, low=1, high=5, n_repetitions=30, random_state=42, verbose=True)[source]
Sweep BLRT Test
Run BLRT test for a range of number of classes. For example, if you set low=1 and high=4, the function will return the result of 3 tests [1 vs 2, 2 vs 3, 3 vs 4].
- Parameters
model (StepMix instance) – A StepMix model.
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
low (int, default=1) – Minimum number of classes to test.
high (int, default=5) – Maximum number of classes to test.
n_repetitions (int) – Number of repetitions to fit.
random_state (int, default=None) –
verbose (bool, default=True) –
- Returns
p-values – Bootstrap p-values of the BLRT test for each comparison.
- Return type
DataFrame
- stepmix.bootstrap.bootstrap(estimator, X, Y=None, n_repetitions=1000, sample_weight=None, parametric=False, sampler=None, identify_classes=True, progress_bar=True, random_state=None)[source]
Parametric or Non-parametric boostrap of a StepMix estimator.
Fit n_repetitions clones of the estimator on resampled datasets.
If identify_classes=True, repeated parameter estimates are aligned with the class order of the main estimator using a permutation search.
- Parameters
estimator (StepMix instance) – A fitted StepMix estimator. Used as a template to clone bootstrap estimator.
X (array-like of shape (n_samples, n_features)) –
Y (array-like of shape (n_samples, n_features_structural), default=None) –
n_repetitions (int) – Number of repetitions to fit.
sample_weight (array-like of shape(n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored if parametric=True.
parametric (bool, default=False) – Use parametric bootstrap instead of non-parametric. Data will be generated by sampling the estimator.
sampler (bool, default=None) – Another fitted estimator to use for sampling instead of the main estimator. Only used for parametric bootstrapping.
identify_classes (bool, default=True) – Run a permutation test to align the classes of the repetitions to the classes of the main estimator. This is required if inference on the model parameters is needed, but can be turned off if only the likelihood needs to be bootstrapped to save computations.
progress_bar (bool, default=True) – Display a tqdm progress bar for repetitions.
random_state (int, default=None) –
- Returns
samples (DataFrame) – DataFrame of all repetitions. Follows the convention of StepMix.get_parameters_df() with an additional ‘rep’ column.
rep_stats (DataFrame) – Likelihood statistics of each repetition. ‘rep’ column. None if identy_classes=False.
stats (DataFrame) – Various statistics of bootstrapped estimators.
- stepmix.bootstrap.find_best_permutation(reference, target, criterion=<function mse>)[source]
Find the best permutation of the columns in target to minimize some criterion comparing to reference.
- Parameters
reference (ndarray of shape (n_samples, n_columns)) – Reference array.
target (ndarray of shape (n_samples, n_columns)) – Target array.
criterion (Callable returning a scalar used to find the permutation.) –
Emission Models
Categorical
Categorical emission models.
- class stepmix.emission.categorical.Bernoulli(**kwargs)[source]
Bases:
Emission
Bernoulli (binary) emission model.
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)[source]
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)[source]
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- class stepmix.emission.categorical.BernoulliNan(**kwargs)[source]
Bases:
Bernoulli
Bernoulli (binary) emission model supporting missing values (Full Information Maximum Likelihood).
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.categorical.Multinoulli(n_components=2, random_state=None, integer_codes=True, max_n_outcomes=None, total_outcomes=None)[source]
Bases:
Emission
Multinoulli (categorical) emission model
Uses one-hot encoded features. Expected data formatting: X[n,k*L+l]=1 if l is the observed outcome for the kth attribute of data point n, where n is the number of observations, K=n_features, L=max_n_outcomes for each multinoulli where max_n_outcomes represents the maximum number of outcomes for a given feature.
If integer_codes is set to True, the model will expect integer-encoded categories and will one-hot encode the data itself. In this case, max_n_outcomes and total_outcomes are inferred by the model.
- Parameters
n_components (int, default=2) – The number of latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
integer_codes (bool, default=True) – Input X should be integer-encoded zero-indexed categories.
max_n_outcomes (int, default=None) – Maximum number of outcomes for a single categorical feature. Each column in the input will have max_n_outcomes associated columns in the one-hot encoding. If None and integer_codes=True, will be inferred from the data.
total_outcomes (int, default=None) – Total outcomes over all features. E.g., if we provide a categorical variable with two outcomes and another with 4 outcomes, total_outcomes = 6. If None and integer_codes=True, will be inferred from the data.
- pis[k*L+l,c]=P[ X[n,k*L+l]=1 | n belongs to class c]
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Expand the feature names since each feature may have up to max_n_outcomes outcomes.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)[source]
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)[source]
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- class stepmix.emission.categorical.MultinoulliNan(**kwargs)[source]
Bases:
Multinoulli
Multinoulli (categorical) emission model supporting missing values (Full Information Maximum Likelihood).
- check_parameters()
Validate class attributes.
- encode_features(X)
- get_default_feature_names(n_features)
- get_n_features()
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Expand the feature names since each feature may have up to max_n_outcomes outcomes.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
Gaussian
Gaussian emission models.
- class stepmix.emission.gaussian.Gaussian(n_components=2, covariance_type='spherical', init_params='random', reg_covar=1e-06, random_state=None)[source]
Bases:
Emission
Gaussian emission model with various covariance options.
This class spoofs the scikit-learn Gaussian Mixture class by reusing the same attributes and calls its methods.
- get_default_feature_names(n_features)
- get_parameters()[source]
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)[source]
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
M step.
Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.
- Parameters
X (array-like of shape (n_samples, n_features)) –
resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm, axis=0)[source]
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- class stepmix.emission.gaussian.GaussianDiag(**kwargs)[source]
Bases:
Gaussian
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
M step.
Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.
- Parameters
X (array-like of shape (n_samples, n_features)) –
resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm, axis=0)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.gaussian.GaussianDiagNan(**kwargs)[source]
Bases:
GaussianNan
Gaussian emission model with diagonal covariance supporting missing values (Full Information Maximum Likelihood)
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.gaussian.GaussianFull(**kwargs)[source]
Bases:
Gaussian
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Return self.parameters into a long dataframe.
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
M step.
Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.
- Parameters
X (array-like of shape (n_samples, n_features)) –
resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm, axis=0)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['class_no', 'param'], columns=['model_name', 'variable'])[source]
Flipping class_no and variable is nicer for full covariances.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.gaussian.GaussianNan(debug_likelihood=False, **kwargs)[source]
Bases:
Emission
Gaussian emission model supporting missing values (Full Information Maximum Likelihood)
This class assumes a diagonal covariance structure. The covariances are therefore represented as a (n_components, n_features) array.
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- abstract property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- class stepmix.emission.gaussian.GaussianSpherical(**kwargs)[source]
Bases:
Gaussian
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
M step.
Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.
- Parameters
X (array-like of shape (n_samples, n_features)) –
resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm, axis=0)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.gaussian.GaussianSphericalNan(**kwargs)[source]
Bases:
GaussianNan
Gaussian emission model with spherical covariance supporting missing values (Full Information Maximum Likelihood)
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.gaussian.GaussianTied(**kwargs)[source]
Bases:
Gaussian
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Return self.parameters into a long dataframe.
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
M step.
Adapted from the gaussian mixture class to accept responsibilities instead of log responsibilities.
- Parameters
X (array-like of shape (n_samples, n_features)) –
resp (array-like of shape (n_samples, n_components)) – Posterior probabilities (or responsibilities) of the point of each sample in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm, axis=0)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['class_no', 'param'], columns=['model_name', 'variable'])[source]
Flipping class_no and variable is nicer for full covariances.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
- class stepmix.emission.gaussian.GaussianUnit(**kwargs)[source]
Bases:
Emission
Gaussian emission model with fixed unit variance.
sklearn.mixture.GaussianMixture does not have an implementation for fixed unit variance, so we provide one.
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- class stepmix.emission.gaussian.GaussianUnitNan(**kwargs)[source]
Bases:
GaussianNan
Gaussian emission model with unit covariance supporting missing values (Full Information Maximum Likelihood)
- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
- sample(class_no, n_samples)
Sample n_samples conditioned on the given class_no.
Covariate
Covariate emission model.
- class stepmix.emission.covariate.Covariate(tol=0.0001, max_iter=1, lr=0.001, intercept=True, method='newton-raphson', **kwargs)[source]
Bases:
Emission
Covariate model with descent update.
- Parameters
tol (float, default=1e-4) – Absolute tolerance applied to each component of the gradient.
max_iter (int, default=100) – The maximum number of steps to take per M-step.
lr (float, default=1e-3) – Learning rate.
intercept (bool, default=True) – If an intercept parameter should be fitted.
method ({"gradient", "newton-raphson"}, default="gradient") – Optimization method.
- get_default_feature_names(n_features)
- get_parameters()
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Return self.parameters into a long dataframe.
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)[source]
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm)
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(X)[source]
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None, index=['param', 'variable'], columns=['model_name', 'class_no'], model_name=None)
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
Nested
- class stepmix.emission.nested.Nested(descriptor, emission_dict, n_components, random_state, **kwargs)[source]
Bases:
Emission
Nested emission model.
The descriptor must be a dict of dicts, where the nested dicts hold arguments for nested models. Each nested dict is expected to have a model key referring to a valid emission model as well as an n_columns key describing the number of columns (i.e. features for univariate variables or features*n_outcomes for one-hot encoded variables) associated with that model. For example, a model where the first 3 features are gaussian with unit variance, the next 3 are multinoulli with 5 possible outcomes (for a total of 3*5=15 columns) and the last 4 are covariates would be described likeso :
descriptor = { 'model_1': { 'model': 'gaussian_unit', 'n_columns':3 }, 'model_2': { 'model': 'multinoulli', 'n_columns': 15, 'n_outcomes': 5 }, 'model_3': { 'model': 'covariate', 'n_columns': 4, 'method': "newton-raphson", 'lr': 1e-3, } }
The above model would then expect an n_samples x 22 matrix as input (3 + 15 + 4 = 22) where columns follow the same order of declaration (i.e., the columns of model_1 are first, columns of model_2 come after etc.).
As demonstrated by the covariate argument, additional arguments can be specified and are passed to the associated Emission class. Particularly useful to specify optimization parameters for
stepmix.emission.covariate.Covariate
.- check_parameters()
Validate class attributes.
- get_default_feature_names(n_features)
- get_parameters()[source]
Get a copy of model parameters.
- Returns
parameters – Copy of model parameters.
- Return type
- get_parameters_df(feature_names=None)[source]
Return self.parameters into a long dataframe.
Columns should be [“model_name”, “param”, “class_no”, “variable”, “value”].
Call self._to_df or implement custom method.
- initialize(X, resp, random_state=None)[source]
Initialize parameters.
Simply performs the m-step on the current responsibilities to initialize parameters.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes.
random_state (int, RandomState instance or None, default=None) – Controls the random seed given to the method chosen to initialize the parameters. Pass an int for reproducible output across multiple function calls.
- log_likelihood(X)[source]
Return the log-likelihood of the input data.
- Parameters
X (ndarray of shape (n_samples, n_columns)) – Input data for this emission model.
- Returns
ll – Log-likelihood of the input data conditioned on each component.
- Return type
ndarray of shape (n_samples, n_components)
- m_step(X, resp)[source]
Update model parameters via maximum likelihood using the current responsibilities.
- Parameters
X (ndarray of shape (n_samples, n_features)) – Input data for this emission model.
resp (ndarray of shape (n_samples, n_components)) – Responsibilities, i.e., posterior probabilities over the latent classes of each point in X.
- property n_parameters
Number of free parameters in the model.
- permute_classes(perm, axis=0)[source]
Permute the latent class and associated parameters of this estimator.
Effectively remaps latent classes.
- Parameters
perm (ndarray of shape (n_classes,)) – Integer array representing the target permutation. Should be a permutation of np.arange(n_classes).
axis (int) – Axis to use for permuting the parameters.
- predict(log_resp)
Compute argmax P(Y|X) given the log responsibilities P(Z|X) for supervised predictions.
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Argmax P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- predict_proba(log_resp)
Compute the conditional probabilities P(Y|X) given the log responsibilities P(Z|X).
This will only be used if the emission model is used as a structural model. X therefore represents the input and Y the output for supervised predictions.
- Parameters
log_resp (ndarray of shape (n_samples, n_components)) – Logarithm of the posterior probabilities P(Z|X) (or responsibilities) of each sample.
- Returns
resp – Conditional probabilities P(Y|X) of each sample.
- Return type
ndarray of shape (n_samples, n_columns)
- print_parameters(indent=1, feature_names=None)[source]
Print parameters with nice formatting.
This method works well for emission models where self.parameters[key_0] is a ndarray of shape (n_components, n_features) and key_0 is the only key.
- Parameters
indent (int) – Add indent to print.
features_names (List of str) – Variable names.
index (List of str) – Column names in self.get_parameters_df to use as index in the displayed dataframe.
columns (List of str) – Column names in self.get_parameters_df to use as columns in the displayed dataframe.
model_name (str) – str to display as model name.
Datasets
Various synthetic datasets.
- stepmix.datasets.bakk_measurements(n_classes, n_mm, sep_level)[source]
Binary measurement parameters in Bakk 2018.
- Parameters
- Returns
pis – Conditional bernoulli probabilities.
- Return type
ndarray of shape (n_mm, n_classes)
- stepmix.datasets.data_bakk_complete(n_samples, sep_level, n_mm=6, random_state=None, nan_ratio=0.0)[source]
Stitch together data_bakk_covariate and data_bakk_response to get a complete model.
- stepmix.datasets.data_bakk_complex(n_samples, sep_level, random_state=None, nan_ratio=0.0)[source]
Build a simulated example with mixed data and missing values.
Measurements: 3 binary variables + 1 continuous variable.
Structural: 3 binary response variables + 1 continuous response variable + 1 covariate.
Missing values everywhere except in the covariate.
Return data as a dataframe.
- stepmix.datasets.data_bakk_covariate(n_samples, sep_level, n_mm=6, random_state=None)[source]
Simulated data for the covariate simulations in Bakk 2018.
- Parameters
- Returns
X (ndarray of shape (n_samples, n_mm)) – Binary measurement samples.
Y (ndarray of shape (n_samples, 1)) – Covariate structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.
References
Bakk, Z. and Kuha, J. Two-step estimation of models between latent classes and external variables. Psychometrika, 83(4):871–892, 2018
- stepmix.datasets.data_bakk_response(n_samples, sep_level, n_classes=3, n_mm=6, random_state=None)[source]
Simulated data for the response simulations in Bakk 2018.
- Parameters
n_samples (int) – Number of samples.
sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.
n_classes (int) – Number of latent classes. Use 3 for the paper simulation.
n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.
random_state (int) – Random state.
- Returns
X (ndarray of shape (n_samples, n_mm)) – Binary measurement samples.
Y (ndarray of shape (n_samples, 1)) – Response structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.
References
Bakk, Z. and Kuha, J. Two-step estimation of models between latent classes and external variables. Psychometrika, 83(4):871–892, 2018
- stepmix.datasets.data_gaussian_binary(n_samples, random_state=None)[source]
Full Gaussian measurement model with 2 binary responses.
The data has 4 latent classes.
- stepmix.datasets.data_gaussian_categorical(n_samples, random_state=None)[source]
Full Gaussian measurement model with 2 categorical responses.
The data has 4 latent classes.
- Parameters
- Returns
X (ndarray of shape (n_samples, 2)) – Gaussian Measurement samples.
Y (ndarray of shape (n_samples, 2)) – Categorical Structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.
- stepmix.datasets.data_gaussian_diag(n_samples, sep_level, n_mm=6, random_state=None, nan_ratio=0.0)[source]
Bakk binary measurement model with 2D diagonal gaussian structural model.
Optionally, a random proportion of values can be replaced with missing values to test FIML models.
- Parameters
n_samples (int) – Number of samples.
sep_level (float) – Separation level in the measurement data. Use .7, .8 or .9 for the paper simulation.
n_mm (int) – Number of features in the measurement model. Use 6 for the paper simulation.
random_state (int) – Random state.
nan_ratio (float) – Ratio of values to replace with missing values.
- Returns
X (ndarray of shape (n_samples, n_mm)) – Binary ,easurement samples.
Y (ndarray of shape (n_samples, 2)) – Gaussian structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.
- stepmix.datasets.data_generation_gaussian(n_samples, sep_level, n_mm=6, random_state=None)[source]
Bakk binary measurement model with more complex gaussian structural model.
- Parameters
- Returns
X (ndarray of shape (n_samples, n_mm)) – Binary Measurement samples.
Y (ndarray of shape (n_samples, 2)) – Gaussian Structural samples.
labels (ndarray of shape (n_samples,)) – Ground truth class membership.