AI supported Data categorization

This module performs an iterative, multi-step AI-based transaction categorization workflow for financial data. It automates the process of categorizing transactions using a series of external Python scripts and manages the merging of results across multiple iterations.

Overview

The workflow consists of the following main steps:

  1. Setup and Configuration

    • Defines paths for scripts, inputs, and outputs.

    • Sets constants such as the maximum number of iterations.

  2. Iterative Categorization Loop

    • Iterates up to MAX_ITERATIONS:

    • Runs the AI categorization script (GenCat.py) on the current set of transactions (starting with all, then only uncategorized).

    • Cleans the AI output using clean_ai_categorisation.py.

    • Evaluates the cleaned output with gen_cat_eval.py to identify which transactions remain uncategorized or are inconsistently categorized.

    • Checks for completion or progress:

    • If all transactions are categorized, or if no further progress is made, the loop stops.

    • Otherwise, the next iteration uses only the remaining uncategorized transactions as input.

  3. Result Merging

    • After the loop, the script merges all intermediate JSON results (from each iteration) into consolidated files for:

    • All categorized transactions

    • All uncategorized transactions

    • All inconsistently categorized transactions

    • Merged category/keyword mappings

  4. Helper Functions

    • Includes utilities for running scripts, counting uncategorized transactions, merging JSON files, and saving merged results.

Description of image

Iterative AI-based categorization workflow

Submodules

scripts/GenCat.py module

GenCat.py

This script uses AI or rule-based logic to categorize financial transactions based on their textual descriptions. It reads input transactions, applies categorization logic, and outputs categorized results for further processing.

Typical usage:

python GenCat.py input.json output.json

class 07_AI_categorisation.scripts.GenCat.GenAICat(*, Category: str, Keywords: list[~07_AI_categorisation.scripts.GenCat.KeywordEntry])

Bases: BaseModel

GenAICat represents a generic AI-generated category with associated keywords.

Category

The name of the category.

Type:

str

Keywords

A list of keywords relevant to the category.

Type:

list[str]

Category: str
Keywords: list[~07_AI_categorisation.scripts.GenCat.KeywordEntry]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class 07_AI_categorisation.scripts.GenCat.KeywordEntry(*, keyword: str, reasoning: str, confidence: float)

Bases: BaseModel

Represents a keyword entry with associated reasoning and confidence score.

keyword

The keyword for categorization.

Type:

str

reasoning

Explanation for why the keyword is relevant.

Type:

str

confidence

Confidence score for the keyword.

Type:

float

confidence: float
keyword: str
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

reasoning: str
07_AI_categorisation.scripts.GenCat.is_valid_json(text)

Checks if the provided text is a valid JSON string.

Parameters:

text (str) – The string to be checked for valid JSON format.

Returns:

True if the text is valid JSON, False otherwise.

Return type:

bool

07_AI_categorisation.scripts.GenCat.main()

Main function to categorize transactions using AI. Reads the prompt and transaction data, sends it to the AI model, and saves the response.

scripts/gen_cat_eval.py module

gen_cat_eval.py

This script evaluates the results of AI-based transaction categorization. It loads transactions and AI-generated category mappings, matches transactions to categories using keywords, identifies uncategorized and inconsistently categorized transactions, and generates summary statistics and visualizations.

Why this script is needed:

The output from the AI categorization process may contain duplicate, missing, or inconsistent keyword assignments, making it difficult to assess the quality and completeness of the categorization. This script provides essential evaluation and diagnostics by:

  • Matching transactions to categories using the AI-generated keywords.

  • Identifying transactions that remain uncategorized or are assigned to multiple conflicting categories.

  • Detecting duplicate keywords and missing categories.

  • Generating summary statistics and confidence score visualizations.

  • Exporting uncategorized and inconsistent transactions for further review or iterative improvement.

By using this script, you can systematically assess and improve the quality of your AI-driven categorization pipeline, ensuring that downstream analysis is based on reliable and well-structured data.

Typical usage:

python gen_cat_eval.py transactions.json AI_Categorisation_cleaned.json uncategorized_transactions.json inconsistent_categorizations.json

07_AI_categorisation.scripts.gen_cat_eval.build_keyword_mappings(ai_categories)

Builds mappings between keywords and categories from a list of AI category entries.

Parameters:

ai_categories (list) – A list of dictionaries, where each dictionary represents a category entry with the following structure:

Returns:

A tuple containing four elements:
  • keyword_to_category (dict): Maps each keyword (str) to a set of categories (set of str) it belongs to.

  • category_to_keywords (defaultdict): Maps each category (str) to a set of keywords (set of str) associated with it.

  • all_keywords (list): A list of all keywords (str) found in the input.

  • keyword_confidence (dict): Maps each keyword (str) to its confidence value (float or int), if provided.

Return type:

tuple

07_AI_categorisation.scripts.gen_cat_eval.check_category_consistency(transactions, keyword_to_category)

Loads transaction data and AI-generated categories from specified JSON files.

Parameters:
  • transactions_path (str) – Path to the JSON file containing transaction data.

  • categories_path (str) – Path to the JSON file containing AI-generated categories.

Returns:

A tuple containing:
  • transactions (dict or list): The loaded transaction data.

  • ai_categories (dict or list): The loaded AI-generated categories.

Return type:

tuple

07_AI_categorisation.scripts.gen_cat_eval.find_duplicate_keywords(keyword_to_category)

Finds keywords that are associated with more than one category.

Parameters:

keyword_to_category (dict) – A dictionary where keys are keywords and values are lists or sets of categories associated with each keyword.

Returns:

A dictionary containing only the keywords that are associated with more than one category, mapping each such keyword to its list or set of categories.

Return type:

dict

07_AI_categorisation.scripts.gen_cat_eval.find_missing_categories(category_to_keywords)

Finds and returns the set of desired categories that are missing from the provided category-to-keywords mapping.

Parameters:

category_to_keywords (dict) – A dictionary mapping category names (str) to their associated keywords.

Returns:

A set of category names (str) that are present in the predefined list of desired categories but missing from the input dictionary.

Return type:

set

07_AI_categorisation.scripts.gen_cat_eval.load_data(transactions_path, categories_path)

Loads transaction data and AI-generated categories from specified JSON files.

Parameters:
  • transactions_path (str) – Path to the JSON file containing transaction data.

  • categories_path (str) – Path to the JSON file containing AI-generated categories.

Returns:

A tuple containing:
  • transactions (dict or list): The loaded transaction data.

  • ai_categories (dict or list): The loaded AI-generated categories.

Return type:

tuple

07_AI_categorisation.scripts.gen_cat_eval.main()

Main entry point for evaluating AI-based transaction categorization. This function determines file paths for input and output data, either from command-line arguments or default locations. It loads transaction and category data, builds keyword mappings, matches transactions to categories, and identifies unmatched, duplicate, inconsistent, and missing categorizations. Results are saved to output files, and summary statistics are printed to the console.

Command-line arguments (optional):
  1. transactions_path: Path to the transactions JSON file.

  2. categories_path: Path to the AI-categorized categories JSON file.

  3. uncategorized_path: Path to save unmatched transactions.

  4. inconsistent_path: Path to save inconsistent categorizations.

If arguments are not provided, default paths in the outputs directory are used.

07_AI_categorisation.scripts.gen_cat_eval.match_transactions(transactions, keyword_to_category)

Matches transactions to categories based on provided keywords.

Parameters:
  • transactions (list of dict) – A list of transaction dictionaries. Each transaction should contain

  • 'sender_receiver' (the keys)

  • 'booking_text'

  • 'purpose'. (and)

  • keyword_to_category (dict) – A dictionary mapping keywords (str) to a list or set of categories.

Returns:

matched_categories (list): A list of categories matched from the transactions. unmatched (list): A list of transaction dictionaries that did not match any keyword.

Return type:

tuple

07_AI_categorisation.scripts.gen_cat_eval.print_confidence_statistics(keyword_confidence)

Prints statistical information and generates a histogram for keyword confidence scores.

Parameters:

keyword_confidence (dict) – A dictionary where values are confidence scores (int or float) associated with keywords.

Behavior:
  • Calculates and prints the minimum, maximum, and average confidence scores.

  • Plots and saves a histogram of the confidence scores as ‘confidence_histogram.png’ in the outputs directory.

  • If no valid confidence values are found, prints a corresponding message.

Notes

  • The function expects matplotlib.pyplot as plt and os to be imported.

  • The outputs directory is assumed to be located one level above the script’s directory.

07_AI_categorisation.scripts.gen_cat_eval.print_summary(transactions, unmatched, inconsistent)

Prints a summary of transaction categorization results.

Parameters:
  • transactions (list) – The list of all transactions processed.

  • unmatched (list) – The list of transactions that could not be matched to a category.

  • inconsistent (list) – The list of transactions with inconsistent categorizations.

Prints:
  • Total number of transactions.

  • Number of matched transactions.

  • Number of unmatched transactions.

  • Number of inconsistent categorizations.

  • Coverage percentage of matched transactions.

07_AI_categorisation.scripts.gen_cat_eval.save_inconsistent_categorizations(inconsistent, transactions, output_path='inconsistent_categorizations.json')

Saves inconsistent categorizations of transactions to a JSON file. This function takes a dictionary of inconsistent categorizations, a list of transaction dictionaries, and writes the inconsistent entries to a JSON file. Each entry in the output contains the full transaction information (if available) and the list of inconsistent categories. If the full transaction cannot be found, the normalized transaction text is included instead.

Parameters:
  • inconsistent (dict) – A mapping from normalized transaction text to a set or list of inconsistent categories.

  • transactions (list) – A list of transaction dictionaries, each containing transaction details.

  • output_path (str, optional) – The file path to save the output JSON. Defaults to “inconsistent_categorizations.json”.

Returns:

None

07_AI_categorisation.scripts.gen_cat_eval.save_unmatched_transactions(unmatched, output_path=None)

Saves a list of unmatched transactions to a JSON file.

Parameters:
  • unmatched (list) – A list of unmatched transaction records to be saved.

  • output_path (str, optional) – The file path where the JSON file will be saved.

  • None (If)

  • directory. (saves to 'uncategorized_transactions.json' in the current script's)

Returns:

None

Raises:

OSError – If there is an error writing to the specified file path.

scripts/clean_ai_categorisation.py module

clean_ai_categorisation.py

This script cleans and consolidates AI-generated transaction categorization data. It merges duplicate keywords (preferring non-‘Other’ categories), sorts categories and keywords, and outputs a cleaned JSON file for downstream analysis and reporting.

Note

The output from the AI categorization process often contains duplicate or inconsistent keyword assignments, making it difficult to use directly for analysis or reporting. This script is essential for cleaning, consolidating, and standardizing the AI-generated categorization data, ensuring that the resulting JSON file is well-structured and reliable for further processing (such as in analysis.py).

Typical usage:

python clean_ai_categorisation.py input.json output.json

07_AI_categorisation.scripts.clean_ai_categorisation.clean_ai_categorisation(input_path, output_path)

Cleans and consolidates AI categorisation data from a JSON file. This function reads a JSON file containing categories and associated keywords (with reasoning and confidence), merges duplicate keywords (preferring non-‘Other’ categories when conflicts arise), and outputs a cleaned, sorted mapping of categories to their keywords.

Parameters:
  • input_path (str) – Path to the input JSON file containing the raw categorisation data.

  • output_path (str) – Path where the cleaned JSON file will be saved.

The output JSON will be a list of dictionaries, each with:
  • “Category”: The category name.

  • “Keywords”: A sorted list of keyword entries (each with “keyword”, “reasoning”, and “confidence”).

Side Effects:
  • Writes the cleaned data to the specified output path.

  • Prints a confirmation message upon successful save.

07_AI_categorisation.scripts.clean_ai_categorisation.load_json(path)

Loads and returns the contents of a JSON file from the specified path.

Parameters:

path (str) – The file path to the JSON file.

Returns:

The parsed JSON data from the file.

Return type:

dict or list

Raises:
  • FileNotFoundError – If the specified file does not exist.

  • json.JSONDecodeError – If the file is not valid JSON.

07_AI_categorisation.scripts.clean_ai_categorisation.save_json(data, path)

Save the given data as a JSON file at the specified path.

Parameters:
  • data (Any) – The data to be serialized and saved as JSON.

  • path (str or Path) – The file path where the JSON data will be written.

Raises:
  • TypeError – If the data provided is not serializable to JSON.

  • OSError – If the file cannot be written due to an OS-related error.

scripts/refine_other_loop.py module

07_AI_categorisation.scripts.refine_other_loop.build_batch_prompt(transactions)
07_AI_categorisation.scripts.refine_other_loop.load_transactions(path)
07_AI_categorisation.scripts.refine_other_loop.main()
07_AI_categorisation.scripts.refine_other_loop.process_batch(batch)
07_AI_categorisation.scripts.refine_other_loop.refine(transactions, batch_size=10)
07_AI_categorisation.scripts.refine_other_loop.save_transactions(data, path)

scripts/prompt_for_other_loop.py module

07_AI_categorisation.scripts.prompt_for_other_loop.build_batch_prompt(transactions)

main_categorization.py module

main_categorization.py

This script orchestrates the iterative AI-based categorization workflow for financial transactions. It automates the process of running external scripts for categorization, cleaning, and evaluation in multiple rounds, each time focusing on transactions that remain uncategorized. The workflow continues until all transactions are categorized, no further progress is made, or a maximum number of iterations is reached.

Key features:
  • Runs the AI categorization, cleaning, and evaluation scripts in sequence for each iteration.

  • Tracks and manages uncategorized and inconsistently categorized transactions.

  • Merges and consolidates results from all iterations into final output files.

  • Ensures robust, automated processing for high-quality, comprehensive categorization.

This script is intended to be the main entry point for the categorization pipeline and should be run after preparing the input transaction data and AI models.

Dependencies:
  • Python 3.x

  • Required Python packages: subprocess, json, os, time, glob

Usage:
  1. Set up the environment: Ensure Python 3.x is installed and the required packages are available.

  2. Prepare input data: Place the input transaction data file (transactions.json) in the ‘inputs’ directory.

  3. Configure AI models: Ensure the AI models and scripts are available in the ‘scripts’ directory.

  4. Run the script: Execute this script (main_categorization.py) to start the categorization process.

  5. Review results: Check the ‘outputs’ directory for the categorized transaction files and reports.

07_AI_categorisation.main_categorization.load_uncategorized_count(uncategorized_path)

Loads and returns the count of uncategorized items from a JSON file.

Parameters:

uncategorized_path (str) – The file path to the JSON file containing uncategorized items.

Returns:

The number of uncategorized items. Returns 0 if the file does not exist.

Return type:

int

07_AI_categorisation.main_categorization.main()

Run the iterative categorization workflow for financial transactions.

Executes categorization, cleaning, and evaluation scripts in multiple rounds, manages uncategorized transactions, and merges results into consolidated output files.

07_AI_categorisation.main_categorization.merge_category_json_files(pattern)

Merge multiple JSON files containing category-keyword mappings into a single list.

Parameters:

pattern (str) – Glob pattern to match JSON files.

Returns:

List of dictionaries, each with “Category” and “Keywords” keys.

Return type:

list

07_AI_categorisation.main_categorization.merge_json_files(pattern, key=None)

Merge multiple JSON files matching a glob pattern into a single list, removing duplicates.

Parameters:
  • pattern (str) – Glob pattern to match JSON files.

  • key (str, optional) – Key for identifying unique entries; if None, the entire entry is used.

Returns:

Merged and deduplicated JSON objects from all matched files.

Return type:

list

07_AI_categorisation.main_categorization.run_script(script_and_args)

Executes a Python script with the specified arguments as a subprocess.

Parameters:

script_and_args (list) – A list containing the script name followed by its arguments.

Prints:

The command being run, the standard output of the subprocess, and any errors encountered.

07_AI_categorisation.main_categorization.save_merged(pattern, outname, key=None)

Merge JSON files matching a pattern and save the result to an output file.

Parameters:
  • pattern (str) – Glob pattern to match JSON files.

  • outname (str) – Output filename for the merged JSON data.

  • key (str, optional) – Key for deduplication.

Returns:

None

Side Effects:

Writes the merged JSON data to the specified output file and prints a confirmation message.

07_AI_categorisation.main_categorization.save_merged_categories(pattern, outname)

Merges multiple category JSON files matching a given pattern and saves the merged result to a specified output file.

Parameters:
  • pattern (str) – The glob pattern to match input JSON files for merging.

  • outname (str) – The filename for the merged output JSON file.

Returns:

None

Side Effects:

Writes the merged JSON data to the specified output file and prints a confirmation message.