Data cleaning¶

This section describes the data cleaning process for FinTrack. It covers how raw bank data is standardized, cleaned, and imported into the database.

Description of image — workflow data cleaning¶

Submodules¶

db_update.py module¶

db_update.py¶

This module provides tools for importing, cleaning, and unifying bank transaction data from various CSV formats into a single SQLite database. It supports multiple banks, automatic column mapping, duplicate detection, and robust logging.

Main features: - Automatically detects and maps columns from different banks to a unified schema. - Handles various encodings and CSV header formats. - Cleans and normalizes data for consistency. - Generates a unique hash for each transaction to prevent duplicates. - Saves all transactions into an SQLite database. - Logs all processing steps and errors for traceability.

Usage:: Run this script to process all CSV files in the 02_raw_data directory and update the bank_statements.db database.

03_data_cleaning.db_update.create_transactions_table(db_path, table_name='transactions')¶

Create the transactions table with all possible columns and a unique constraint on transaction_hash.

Parameters:

db_path (str) – Path to the SQLite database file.
table_name (str, optional) – Name of the table to create (default is “transactions”).

Returns:

None

03_data_cleaning.db_update.detect_bank_name(filename)¶

Extract only the bank name from filename (without extension and year/suffix).

Examples

‘Bank_A-2024.csv’ -> ‘Bank_A’ ‘Bank_B_2023.csv’ -> ‘Bank_B’

Parameters:: filename (str) – The filename to extract the bank name from.
Returns:: The extracted bank name.
Return type:: str

03_data_cleaning.db_update.find_header_row(file_path, sep=';', encoding='utf-8')¶

Find the header row in a CSV file by matching known column names.

Parameters:

file_path (str) – Path to the CSV file.
sep (str, optional) – CSV separator (default is ‘;’).
encoding (str, optional) – File encoding (default is ‘utf-8’).

Returns:

The row index of the header.

Return type:

int

Raises:

ValueError – If the header row cannot be found.

03_data_cleaning.db_update.load_csv_with_mapping(file_path, bank_name, sep=';')¶

Load a CSV file, map its columns to unified names, clean and convert data types, and return a DataFrame ready for database insertion.

Tries UTF-8 encoding first, then falls back to ISO-8859-1.

Parameters:

file_path (str) – Path to the CSV file.
bank_name (str) – Name of the bank (used for the ‘bank_name’ column).
sep (str, optional) – CSV separator (default is ‘;’).

Returns:

DataFrame with unified columns and a transaction_hash column.

Return type:

pandas.DataFrame

Raises:

ValueError – If the file cannot be read with the supported encodings.

03_data_cleaning.db_update.row_hash(row)¶

Generate a SHA-256 hash for a row based on all available data.

Parameters:: row (pandas.Series) – A row of transaction data.
Returns:: The SHA-256 hash of the row as a hexadecimal string.
Return type:: str

03_data_cleaning.db_update.save_to_sqlite(df, db_path='C:\\public_projects\\portfolio\\FinTrack\\03_data_cleaning\\bank_statements.db', table_name='transactions')¶

Save the DataFrame into an SQLite database, ignoring duplicates.

Logs the number of records saved and duplicates skipped.

Parameters:

df (pandas.DataFrame) – DataFrame to save.
db_path (str, optional) – Path to the SQLite database file.
table_name (str, optional) – Name of the table to save to.

Returns:

None

Raises:

Exception – If there is an error saving to the database.

Data cleaning¶

Submodules¶

db_update.py module¶

db_update.py¶

FinTrack

Navigation

Related Topics