Deborah.DeborahCore.MLInputPreparer

Deborah.DeborahCore.MLInputPreparer.MLInputBundleType
struct MLInputBundle{T<:Real}

Container for machine learning input and target data used in the Deborah.DeborahCore pipeline.

This struct stores the full feature set X_data, target vectors Y_*_vec for each partition (training set, bias correction set, unlabeled set, and labeled set), the raw data matrix Y_df, and the corresponding configuration index arrays used to assemble them.

Type Parameters

  • T<:Real : Element type of all target vectors and matrices (typically Float64).

Fields

  • X_data::Dict{String, NamedTuple} : Preprocessed feature dictionary, keyed by filename.
  • Y_df::Matrix{T} : Original raw $Y$ matrix ($N_{\text{cfg}} \times N_{\text{src}}$).
  • Y_tr_vec::Vector{T} : Flattened $Y$ vector for training set.
  • Y_bc_vec::Vector{T} : Flattened $Y$ vector for bias-correction set.
  • Y_ul_vec::Vector{T} : Flattened $Y$ vector for unlabeled set.
  • Y_lb_vec::Vector{T} : Flattened $Y$ vector for labeled set.
  • conf_arr::Vector{Int} : Mapping from global row index to configuration index.
  • tr_conf_arr::Vector{Int} : Row indices used for training set $Y$.
  • bc_conf_arr::Vector{Int} : Row indices used for bias-correction set $Y$.
  • ul_conf_arr::Vector{Int} : Row indices used for unlabeled set $Y$.
  • lb_conf_arr::Vector{Int} : Row indices used for labeled set $Y$.
source
Deborah.DeborahCore.MLInputPreparer.prepare_ML_inputsMethod
prepare_ML_inputs(
    partition::DatasetPartitioner.DatasetPartitionInfo, 
    X_file_list::Vector{String}, 
    Y_file::String, 
    paths::PathConfigBuilderDeborah.DeborahPathConfig; 
    jobid::Union{Nothing, String}=nothing,
    dump::Bool=false, 
    read_column_X::Vector{Int},
    read_column_Y::Int,
    index_column::Int
) -> MLInputBundle

Load and organize all machine learning input data from raw .dat files.

This function loads the target data (Y_file) and a list of input feature files (X_file_list), extracts specific columns from each using read_column_X and read_column_Y, applies dataset partitioning according to the partition object, and returns the labeled and unlabeled splits of features and targets in a structured format suitable for training and evaluation.

Arguments

  • partition::DatasetPartitioner.DatasetPartitionInfo Struct defining how configuration indices are split into lb, tr, bc, and ul sets.

  • X_file_list::Vector{String} List of feature file names, e.g., ["plaq.dat", "rect.dat"].

  • Y_file::String Name of the target file to be used as Y.

  • paths::PathConfigBuilderDeborah.DeborahPathConfig Struct with directory and filename conventions used for reading/writing data.

  • jobid::Union{Nothing, String} Optional identifier used for structured logging or job tracking.

  • dump::Bool = false Whether to save preprocessed X feature vectors into disk files.

  • read_column_X::Vector{Int} A vector of $1$-based column indices, one for each feature file in X_file_list. read_column_X[i] is used to select a column from file X_file_list[i].

  • read_column_Y::Int $1$-based column index to extract from the Y_file.

  • index_column::Int $1$-based column index from which to read the configuration indices in the Y_file. If set to 0, configuration indices will be auto-generated as 1:N_cnf.

Returns

  • Deborah.DeborahCore.MLInputPreparer.MLInputBundle Composite struct containing:
    • X_dict::Dict{String, NamedTuple} → partitioned input feature vectors (:lb, :tr, :bc, :ul)
    • Y_lb, Y_tr, Y_bc, Y_ul → target label vectors
    • configuration index arrays for each group
source