From 0589b957a41303a2ab8a241d957f1799eb8c74fe Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:09:05 +0000 Subject: update table of contents in README.md --- README.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 430fcdc..ebf0c4a 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,9 @@ -- [gdpr-obfuscator](#gdpr-obfuscator) - * [Minimum Viable Product (MVP)](#minimum-viable-product--mvp-) - * [Setup](#setup) - * [Usage](#usage) +# GDPR Obfuscator - Launchpad Project + +1. [Overview](#overview) +2. [Minimum Viable Product (MVP)](#minimum-viable-product-mvp) +3. [Setup](#setup) +4. [Usage](#usage) ## Overview -- cgit v1.2.3 From 308f0e7befddae8b2306c9e57bdd5903c55ec171 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:13:09 +0000 Subject: update overview section to include JSON and parquet compatibility --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index ebf0c4a..b82ccbb 100644 --- a/README.md +++ b/README.md @@ -7,10 +7,12 @@ ## Overview -A Python library designed to detect and remove Personally Identifiable Information (PII) from CSV files stored in an AWS S3 bucket. +A Python library designed to detect and remove Personally Identifiable Information (PII) from data formats such as CSV, JSON and Parquet formats. ## Minimum Viable Product (MVP) + + ## Setup ## Usage -- cgit v1.2.3 From 7823850692c12bb8a7155c5c26e66bd8129c9b4a Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:21:21 +0000 Subject: update MVP section to include the minimum requirements of the project --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index b82ccbb..222808e 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,13 @@ A Python library designed to detect and remove Personally Identifiable Informati ## Minimum Viable Product (MVP) +The MVP covers: +1. Reading a JSON string containing the S3 location of the CSV file and the names of the fields that are required to be obfuscated +2. Ingesting the CSV file containing data records (with a primary key) from an AWS S3 bucket +3. Obfuscating chosen PII fields (e.g. `name`, `email_address`) by replacing their values with an obfuscated string (`***`) +4. Producing an output CSV file (or a byte-stream) that maintains the original structure but with sensitive fields changed +This meets the requirements under the General Data Protection Regulation [(GDPR)](https://ico.org.uk/media/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr-1-1.pdf) to ensure that all data containing information that can be used to identify an individual should be anonymised. ## Setup -- cgit v1.2.3 From 98fc0c2b71ae1c900ecacc19eb185a2542d4e8c4 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:25:27 +0000 Subject: add additional features section to README.md --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 222808e..0a3857b 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,9 @@ 1. [Overview](#overview) 2. [Minimum Viable Product (MVP)](#minimum-viable-product-mvp) -3. [Setup](#setup) -4. [Usage](#usage) +3. [Additional Features](#additional-features) +4. [Setup](#setup) +5. [Usage](#usage) ## Overview @@ -19,6 +20,14 @@ The MVP covers: This meets the requirements under the General Data Protection Regulation [(GDPR)](https://ico.org.uk/media/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr-1-1.pdf) to ensure that all data containing information that can be used to identify an individual should be anonymised. +### Additional Features + +*(Ranked in order of priority from high to low)* + +- [ ] **Support for JSON and Parquet formats**: Extend the library to support reading and writing data in JSON and Parquet formats +- [ ] **Command-line interface**: Create a command-line interface to allow users to run the obfuscation process from the terminal +- [ ] **Support for multiple sources**: Extend the library to support reading data from multiple sources (e.g. local file system) + ## Setup ## Usage -- cgit v1.2.3 From 83569d8facffeedb325d63364dab91abe71ba3c6 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:27:52 +0000 Subject: add setup/prerequisites section --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index 0a3857b..113d841 100644 --- a/README.md +++ b/README.md @@ -30,4 +30,9 @@ This meets the requirements under the General Data Protection Regulation [(GDPR) ## Setup +### Prerequisites + +- Python >= 3.13 +- Poetry >= 2.0.1 + ## Usage -- cgit v1.2.3 From 00d940f72c4633855075cfb797732ef02588bba9 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:32:05 +0000 Subject: add setup/installation section --- README.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/README.md b/README.md index 113d841..225a999 100644 --- a/README.md +++ b/README.md @@ -35,4 +35,22 @@ This meets the requirements under the General Data Protection Regulation [(GDPR) - Python >= 3.13 - Poetry >= 2.0.1 +### Installation + +1. Clone the repository: + +``` +git clone --recurse-submodules https://github.com/ajschofield/gdpr-obfuscator.git +cd gdpr-obfuscator +``` + +2. Install dependencies using poetry + +``` +# Production +poetry install +# Developer (optional) +poetry install --dev +``` + ## Usage -- cgit v1.2.3 From 0e89a74646d1dcb23c313ca13052c73f9b6c7989 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:32:24 +0000 Subject: update table of contents in README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 225a999..f575e59 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ 2. [Minimum Viable Product (MVP)](#minimum-viable-product-mvp) 3. [Additional Features](#additional-features) 4. [Setup](#setup) + 1. [Prerequisites](#prerequisites) + 2. [Installation](#installation) 5. [Usage](#usage) ## Overview -- cgit v1.2.3 From 4740873482831a77c253bee3c0b521e09a3059a9 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:33:49 +0000 Subject: move additional features to subsection under MVP in README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f575e59..33c3987 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ 1. [Overview](#overview) 2. [Minimum Viable Product (MVP)](#minimum-viable-product-mvp) -3. [Additional Features](#additional-features) + 1. [Additional Features](#additional-features) 4. [Setup](#setup) 1. [Prerequisites](#prerequisites) 2. [Installation](#installation) -- cgit v1.2.3 From 6f58a724cfbf88fe12c96fdb4a038e65012a3b88 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:35:37 +0000 Subject: update code block formatting in README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 33c3987..5cd4bfb 100644 --- a/README.md +++ b/README.md @@ -41,14 +41,14 @@ This meets the requirements under the General Data Protection Regulation [(GDPR) 1. Clone the repository: -``` +```bash git clone --recurse-submodules https://github.com/ajschofield/gdpr-obfuscator.git cd gdpr-obfuscator ``` 2. Install dependencies using poetry -``` +```bash # Production poetry install # Developer (optional) -- cgit v1.2.3 From 8a4a382c9c29de92c6c1cae3094b09ad0c892731 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:44:12 +0000 Subject: add comments to cli.py to explain code --- cli.py | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/cli.py b/cli.py index c6442c7..bd4f79d 100644 --- a/cli.py +++ b/cli.py @@ -4,31 +4,43 @@ from obfuscator.csv_reader import CSVReader from obfuscator.obfuscate import obfuscate from obfuscator.logger import get_logger +# Create the logger logger = get_logger("CLI") - def main(): + # Create an argument parser parser = argparse.ArgumentParser(description="gdpr-obfuscator") # Require user to either choose a local file or an S3 object + # The user can only choose one of these options or the program will exit + # If not provided, the program will exit loc = parser.add_mutually_exclusive_group(required=True) loc.add_argument("--local") loc.add_argument("--s3") + # Require user to provide a list of PII fields to obfuscate + # e.g. --pii name email_address + # If not provided, the program will exit parser.add_argument("--pii", nargs="+", required=True) + # Parse the arguments args = parser.parse_args() + # Read the CSV data based on the user's choice of local or S3 if args.local and not args.s3: logger.debug("User chose to read CSV from local path") + # Create a CSVReader object and read the local CSV file reader = CSVReader() data = reader.read_local(args.local) + # For debug purposes, log the data read from the CSV logger.debug(data) else: logger.debug("User chose to read CSV from S3") + # Obfuscate the data based on the user's choice of PII fields obfuscated_data = obfuscate(data, args.pii) + # For debug purposes, log the obfuscated data as JSON for readability logger.debug(json.dumps(obfuscated_data, indent=4)) - +# If the script is run directly (as it should be), call the main function if __name__ == "__main__": main() -- cgit v1.2.3 From de33c0c98201a275244a71826d11bb8ee3a12245 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:52:48 +0000 Subject: add comments to csv_reader.py to explain code --- obfuscator/csv_reader.py | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/obfuscator/csv_reader.py b/obfuscator/csv_reader.py index b9dccdb..23a34fc 100644 --- a/obfuscator/csv_reader.py +++ b/obfuscator/csv_reader.py @@ -3,32 +3,61 @@ import io from typing import List, Dict from obfuscator.logger import get_logger +# Create the logger logger = get_logger("CSVReader") - +# Putting the CSV reading components into a class may seem like overkill +# for a simple script, but it allows for better organization and scalability. +# @staticmethod is used to define the method without an instance of the class +# being required. The methods could be defined just as functions, and this +# may still be changed. class CSVReader: + """ + A class to read CSV data from a local file, S3 object, or string. Near + the project completion, support for JSON/Parquet files will be added. + """ @staticmethod def read_local(path) -> List[Dict[str, str]]: + """ + A method to read a local CSV file and return the data as a list of + dictionaries. + """ + # Log the path of the file being read for debugging logger.debug(f"Reading local CSV from: {path}") - + + # Attempt to read the file and return the data as a list of dictionaries + # However, if the file isn't found or there is a generic exception, log + # the error and raise an exception try: with open(path, mode="r", encoding="utf-8") as f: reader = csv.DictReader(f) return [dict(row) for row in reader] except FileNotFoundError: logger.error(f"File not found: {path}") + raise except Exception as e: logger.error(f"Error reading file: {e}") @staticmethod def read_s3(path) -> List[Dict[str, str]]: + """ + A method to read an S3 object containing CSV data + and return the data as a list of dictionaries. + """ + # Yet to be implemented. return [] @staticmethod def read_string(content: str) -> List[Dict[str, str]]: + """ + A method to read CSV data from a string and return the data as a list + of dictionaries. + """ + # If the content is empty, return an empty list if not content.strip(): return [] + # Treat the string as a file-like object and return as list of dictionaries f = io.StringIO(content) reader = csv.DictReader(f) return [dict(row) for row in reader] -- cgit v1.2.3 From 3837fb40c1f70fa8bfd65872cc1c85963903fe3a Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 14:55:53 +0000 Subject: add comments to obfuscate.py to explain code --- obfuscator/obfuscate.py | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/obfuscator/obfuscate.py b/obfuscator/obfuscate.py index ac0bd21..3da9155 100644 --- a/obfuscator/obfuscate.py +++ b/obfuscator/obfuscate.py @@ -1,16 +1,24 @@ from typing import List, Dict from obfuscator.logger import get_logger +# Create the logger logger = get_logger("Obfuscator") - def obfuscate( data: List[Dict[str, str]], pii_fields: List[str] ) -> List[Dict[str, str]]: + """ + A function to obfuscate PII fields in a list of dictionaries, replacing + sensitive values with a string of asterisks. + """ + # If no data is provided, log a message and return an empty list if not data: logger.info("No valid data was provided to obfuscate") return [] + # Obfuscate the PII fields in each record using a list/dict comprehension + # This code is good but makes debugging a bit tricky. I may consider + # breaking it down into a for loop. return [ {k: ("***" if k in pii_fields else v) for k, v in record.items()} for record in data -- cgit v1.2.3 From 4069a46dfb70ca98d8b2dfb671673c41b0f7c2e5 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 15:58:14 +0000 Subject: add comments for description of obfuscator.py tests --- test/test_obfuscator.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/test/test_obfuscator.py b/test/test_obfuscator.py index c77b6b4..0245a26 100644 --- a/test/test_obfuscator.py +++ b/test/test_obfuscator.py @@ -1,6 +1,7 @@ from obfuscator.obfuscate import obfuscate - +# Check if the function can obfuscate valid PII fields in a list +# of dictionaries def test_obfuscate_data_with_valid_pii_fields(): data = [ { @@ -35,7 +36,8 @@ def test_obfuscate_data_with_valid_pii_fields(): result = obfuscate(data, pii_fields) assert result == expected - +# Check if the function can obfuscate data even when some PII +# fields are missing from some of the data def test_obfuscate_data_with_missing_pii_field(): data = [ {"student_id": "1234", "name": "John Smith", "course": "Software"}, @@ -60,7 +62,7 @@ def test_obfuscate_data_with_missing_pii_field(): result = obfuscate(data, pii_fields) assert result == expected - +# Check if the function can handle an empty list of data def test_obfuscate_data_with_no_data(): data = [] pii_fields = ["name", "email_address"] @@ -69,7 +71,7 @@ def test_obfuscate_data_with_no_data(): result = obfuscate(data, pii_fields) assert result == expected - +# Check if the function can handle an empty list of PII fields def test_obfuscate_data_with_empty_pii_fields(): data = [ { -- cgit v1.2.3 From 7a4057196cb9282355b1eff6f06ee5a3c33e4e67 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 15:58:58 +0000 Subject: remove unused import in test_csv_reader.py --- test/test_csv_reader.py | 1 - 1 file changed, 1 deletion(-) diff --git a/test/test_csv_reader.py b/test/test_csv_reader.py index e62c093..7f54c25 100644 --- a/test/test_csv_reader.py +++ b/test/test_csv_reader.py @@ -2,7 +2,6 @@ # Author: Alex Schofield from obfuscator.csv_reader import CSVReader -import pytest reader = CSVReader() -- cgit v1.2.3 From 227b6a86d3658845441d13779d147d8892216618 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 16:01:46 +0000 Subject: add expected outputs and more detail to test descriptions in test_obfuscator --- test/test_obfuscator.py | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/test/test_obfuscator.py b/test/test_obfuscator.py index 0245a26..cc7d2c1 100644 --- a/test/test_obfuscator.py +++ b/test/test_obfuscator.py @@ -1,7 +1,7 @@ from obfuscator.obfuscate import obfuscate -# Check if the function can obfuscate valid PII fields in a list -# of dictionaries +# Check if the function does what its supposed to and can obfuscate +# valid PII fields in a list of dictionaries def test_obfuscate_data_with_valid_pii_fields(): data = [ { @@ -37,7 +37,8 @@ def test_obfuscate_data_with_valid_pii_fields(): assert result == expected # Check if the function can obfuscate data even when some PII -# fields are missing from some of the data +# fields are missing from some of the data, returning a list of dictionaries +# but with the missing PII fields obfuscated and the rest of the data intact def test_obfuscate_data_with_missing_pii_field(): data = [ {"student_id": "1234", "name": "John Smith", "course": "Software"}, @@ -62,7 +63,7 @@ def test_obfuscate_data_with_missing_pii_field(): result = obfuscate(data, pii_fields) assert result == expected -# Check if the function can handle an empty list of data +# Check if the function can handle an empty list of data, returning an empty list def test_obfuscate_data_with_no_data(): data = [] pii_fields = ["name", "email_address"] @@ -71,7 +72,8 @@ def test_obfuscate_data_with_no_data(): result = obfuscate(data, pii_fields) assert result == expected -# Check if the function can handle an empty list of PII fields +# Check if the function can handle an empty list of PII fields, returning the data as is +# without mutating it def test_obfuscate_data_with_empty_pii_fields(): data = [ { -- cgit v1.2.3 From 7d9725dcc3270c13fccdb35cbe499ac7a99b87ec Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 16:04:09 +0000 Subject: add comments for description of csv_reader.py tests --- test/test_csv_reader.py | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/test/test_csv_reader.py b/test/test_csv_reader.py index 7f54c25..0ce3401 100644 --- a/test/test_csv_reader.py +++ b/test/test_csv_reader.py @@ -5,21 +5,24 @@ from obfuscator.csv_reader import CSVReader reader = CSVReader() - +# Check if the function can read a CSV string with no content and return +# an empty list def test_empty_csv_should_return_no_content(): content = "" result = reader.read_string(content) expected = [] assert result == expected - +# Check if the function can read a CSV string with only a header and return +# an empty list def test_csv_with_header_only_should_return_no_content(): content = "student_id,name,course\n" result = reader.read_string(content) expected = [] assert result == expected - +# Check if the function can read a CSV string with valid data and return +# a list of dictionaries def test_csv_with_valid_data(): content = ( "student_id,name,course\n" @@ -33,7 +36,8 @@ def test_csv_with_valid_data(): ] assert result == expected - +# Check if the function can read a CSV string with quoted fields and return +# a list of dictionaries with the quoted fields intact def test_csv_with_quoted_fields_should_run_as_expected(): content = ( "student_id,name,course\n" -- cgit v1.2.3 From e2b0f2553b8dfcbe39f6e6fdc86ca68cc63f5705 Mon Sep 17 00:00:00 2001 From: Alex Schofield Date: Mon, 17 Feb 2025 16:04:27 +0000 Subject: remove repeated test in test_csv_reader.py --- test/test_csv_reader.py | 7 ------- 1 file changed, 7 deletions(-) diff --git a/test/test_csv_reader.py b/test/test_csv_reader.py index 0ce3401..1b3d071 100644 --- a/test/test_csv_reader.py +++ b/test/test_csv_reader.py @@ -50,10 +50,3 @@ def test_csv_with_quoted_fields_should_run_as_expected(): {"student_id": "5678", "name": "Student 2", "course": "Course 2"}, ] assert result == expected - - -def test_non_csv_file_should_return_no_content(): - content = "" - result = reader.read_string(content) - expected = [] - assert result == expected -- cgit v1.2.3