Deborah.DeborahCore.FeaturePipeline

Deborah.DeborahCore.FeaturePipeline.build_namedtuple_splitsetFunction
build_namedtuple_splitset(
    X_data::Dict{String, NamedTuple},
    split::String,
    key_order::Vector{String},
    jobid::Union{Nothing, String} = nothing
) -> NamedTuple

Construct a NamedTuple of feature vectors corresponding to a given data split.

This function extracts the specified data split (tr, bc, ul, and lb) from each entry in the input feature dictionary and combines them into a column-indexed NamedTuple, suitable for JuliaAI/MLJ.jl model input.

Arguments

  • X_data::Dict{String, NamedTuple}: A dictionary mapping input keys to 4-way split feature tuples (lb, tr, bc, ul).
  • split::String: Which dataset split to extract (tr, bc, ul, lb).
  • key_order::Vector{String}: The order of keys to use when building columns.
  • jobid::Union{Nothing, String}: Optional identifier string for logging or debugging purposes.

Returns

  • NamedTuple: A tuple with keys :Column1, :Column2, ... containing the feature vectors for the specified split.
source
Deborah.DeborahCore.FeaturePipeline.run_feature_pipelineMethod
run_feature_pipeline(
    read_column_X::Vector{Int},
    keys::Vector{String},
    path::String,
    conf_arr::Vector{Int},
    partition::DatasetPartitioner.DatasetPartitionInfo,
    paths::PathConfigBuilderDeborah.DeborahPathConfig;
    dump::Bool=true,
    jobid::Union{Nothing, String}=nothing
) -> Dict{String, NamedTuple{(:lb, :tr, :bc, :ul), NTuple{4, Vector{Float64}}}}

Run the feature preprocessing pipeline across multiple input feature files.

For each feature key (e.g., "plaq.dat", "rect.dat"), this function:

  1. Loads the corresponding raw .dat file as a matrix,
  2. Extracts the column specified in read_column_X[i],
  3. Partitions the data into four groups (lb, tr, bc, ul),
  4. Optionally dumps each partition to disk.

This version allows specifying a separate column index for each input feature file.

Arguments

  • read_column_X::Vector{Int} A vector of $1$-based column indices, one for each feature in keys. read_column_X[i] is used to extract a column from the feature file keys[i]. Each column is treated as an independent scalar input feature.

  • keys::Vector{String} List of feature file base names, such as ["plaq.dat", "rect.dat"].

  • path::String Directory path containing the raw .dat feature files.

  • conf_arr::Vector{Int} Configuration indices associated with rows in the feature files.

  • partition::DatasetPartitioner.DatasetPartitionInfo Struct containing index vectors that define the four partitions:

    • lb: labeled set
    • tr: training set
    • bc: bias correction set
    • ul: unlabeled set
  • paths::PathConfigBuilderDeborah.DeborahPathConfig Struct containing global path settings, such as .analysis_dir, .overall_name.

  • dump::Bool = true If true, each split feature vector is saved to disk as .dat files.

  • jobid::Union{Nothing, String} Optional identifier string for logging or debugging purposes.

Returns

  • Dict{String, NamedTuple{(:lb, :tr, :bc, :ul), NTuple{4, Vector{Float64}}}} Dictionary mapping each feature name to a NamedTuple of split vectors.
source