Deborah.Rahab.HistogramOrigML

Deborah.Rahab.HistogramOrigML.plot_histogram_orig_vs_ml — Method

plot_histogram_orig_vs_ml(
    trace_data::Dict{String, Vector{Vector{Float64}}},
    trace_idx::Int,
    nbins::Int,
    overall_name::String;
    subset::AbstractString="all",
    x_min::Union{Nothing, Real}=nothing,
    x_max::Union{Nothing, Real}=nothing,
    print_clipping::Bool=true,
    clipping_report_prefix::AbstractString="",
    outfile::AbstractString="histogram_bins.dat",
    include_out_of_range_counts_in_file::Bool=true,
    save_file::Bool=false,
    plot_dir::AbstractString=""
) -> Nothing

Plot histograms comparing original vs. ML-predicted trace data for a given observable index, optionally clip the plotting range, optionally save the histogram plot as a cropped PDF (with a heatmap-style filename convention), and always save binned counts to a .dat file.

This function compares two datasets (OG vs. ML) selected by subset, renders overlaid histograms on a common bin grid, and writes the bin counts to outfile. If x_min and/or x_max are provided, values outside the chosen range are discarded (clipped) before histogramming; discarded counts are reported (optionally) and can be written to the .dat output. If save_file=true, the histogram figure is saved to PDF (and cropped via pdfcrop when available) using a rule-based basename that includes subset and overall_name.

Arguments

trace_data: Dictionary containing vectors of traces, keyed by:
- "Y_tr" : Training set values,
- "Y_bc" : Bias-correction set values,
- "Y_ul" : Unlabeled set values (original),
- "YP_ul": Unlabeled set values (ML-predicted),
- "YP_bc": Bias-correction set values (ML-predicted) (required when subset="bc").
trace_idx: Index of the observable in each trace_data[key] entry. For example, if trace_idx corresponds to $\mathrm{Tr}\,M^{-n}$, then trace_idx=1,2,3,4 typically represent different inverse-trace powers or related observables in your pipeline.
nbins: Number of histogram bins.
overall_name: Suffix used to construct the output histogram plot filename when save_file=true, analogous to the heatmap naming convention used elsewhere. (Does not affect outfile.)

Keyword Arguments

subset: Select which subset to compare. Allowed values (case-insensitive):
- "all": (default) Use the original combined behavior:
  - OG = vcat(Y_tr, Y_bc, Y_ul)
  - ML = vcat(Y_tr, Y_bc, YP_ul)
- "tr": Training-only comparison:
  - OG = Y_tr
  - ML = Y_tr
  (This is expected to match identically by construction.)
- "bc": Bias-correction-only comparison:
  - OG = Y_bc
  - ML = YP_bc
  (Requires trace_data["YP_bc"] to exist.)
- "ul": Unlabeled-only comparison:
  - OG = Y_ul
  - ML = YP_ul
x_min: If provided, forces the histogram minimum $x$-range (values below are discarded).
x_max: If provided, forces the histogram maximum $x$-range (values above are discarded).
print_clipping: If true, prints a short report of discarded points due to clipping.
clipping_report_prefix: Optional prefix prepended to clipping report lines (useful in batch logs).
outfile: Output filename for the tab-delimited bin-count table. This .dat file is always written.
include_out_of_range_counts_in_file: If true, appends two extra columns (OG_oob, ML_oob) holding the number of discarded points for each dataset (same value repeated per row for convenience).
save_file: If true, saves the histogram plot as a PDF using a rule-based filename and (if available) runs pdfcrop to produce a cropped PDF.
plot_dir: Output directory for the histogram plot PDF when save_file=true. If empty, the current directory (".") is used.

Behavior

First, selects the datasets to compare based on subset:
- subset="all":
  - OG = vcat(Y_tr, Y_bc, Y_ul)
  - ML = vcat(Y_tr, Y_bc, YP_ul)
- subset="tr":
  - OG = Y_tr
  - ML = Y_tr
- subset="bc":
  - OG = Y_bc
  - ML = YP_bc
- subset="ul":
  - OG = Y_ul
  - ML = YP_ul
Determines a common plotting/binning range:
- If x_min/x_max are both nothing, uses data-driven min/max from both datasets.
- Otherwise uses the user-specified bound(s) and fills any missing side from the data-driven bound.
If x_min and/or x_max are provided, values outside [final_min, final_max] are discarded before histogramming. The number of discarded points is computed separately for OG and ML, printed when print_clipping=true, and optionally written to the .dat file (see include_out_of_range_counts_in_file).
Renders overlaid histograms using a shared bin_edges grid. Legend labels are adjusted by subset:
- subset="all": legend shows OG and ML.
- Otherwise: legend shows OG-<SUBSET> and ML-<SUBSET> (e.g., OG-UL, ML-UL).
If save_file=true, saves the histogram plot as a cropped PDF (when pdfcrop is available) using a heatmap-style basename convention:
- If trace_idx == 1: histogram_pbp_<subset>_<overall_name>.pdf
- Else: histogram_trdinv<trace_idx>_<subset>_<overall_name>.pdf
The file is written under plot_dir (or "." if plot_dir is empty).
Always writes the binned histogram counts to outfile as a tab-delimited text file.

Output

Displays the histogram inline.
If save_file=true, writes a histogram plot PDF into plot_dir with a rule-based filename.
Writes outfile with columns:
Bin Min Max OG ML [OGoob MLoob]

Returns

Nothing (side effects: plot display, optional PDF save, and .dat output written).

source