diff options
| author | Alex <git@ajschof.me> | 2025-02-17 16:47:47 +0000 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2025-02-17 16:47:47 +0000 |
| commit | 00917b8ecf67de9e955479be555d74fcc8257020 (patch) | |
| tree | 17dd9b2e85866f85bdbb3702185463b13c911a28 /README.md | |
| parent | bf323b8c2ebd47bb446ba773027f389a0887e325 (diff) | |
| parent | e2b0f2553b8dfcbe39f6e6fdc86ca68cc63f5705 (diff) | |
| download | gdpr-obfuscator-00917b8ecf67de9e955479be555d74fcc8257020.tar.gz gdpr-obfuscator-00917b8ecf67de9e955479be555d74fcc8257020.zip | |
Merge pull request #3 from ajschofield/add-docs
update README & add comments in src code
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 54 |
1 files changed, 49 insertions, 5 deletions
@@ -1,14 +1,58 @@ -- [gdpr-obfuscator](#gdpr-obfuscator) - * [Minimum Viable Product (MVP)](#minimum-viable-product--mvp-) - * [Setup](#setup) - * [Usage](#usage) +# GDPR Obfuscator - Launchpad Project + +1. [Overview](#overview) +2. [Minimum Viable Product (MVP)](#minimum-viable-product-mvp) + 1. [Additional Features](#additional-features) +4. [Setup](#setup) + 1. [Prerequisites](#prerequisites) + 2. [Installation](#installation) +5. [Usage](#usage) ## Overview -A Python library designed to detect and remove Personally Identifiable Information (PII) from CSV files stored in an AWS S3 bucket. +A Python library designed to detect and remove Personally Identifiable Information (PII) from data formats such as CSV, JSON and Parquet formats. ## Minimum Viable Product (MVP) +The MVP covers: +1. Reading a JSON string containing the S3 location of the CSV file and the names of the fields that are required to be obfuscated +2. Ingesting the CSV file containing data records (with a primary key) from an AWS S3 bucket +3. Obfuscating chosen PII fields (e.g. `name`, `email_address`) by replacing their values with an obfuscated string (`***`) +4. Producing an output CSV file (or a byte-stream) that maintains the original structure but with sensitive fields changed + +This meets the requirements under the General Data Protection Regulation [(GDPR)](https://ico.org.uk/media/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr-1-1.pdf) to ensure that all data containing information that can be used to identify an individual should be anonymised. + +### Additional Features + +*(Ranked in order of priority from high to low)* + +- [ ] **Support for JSON and Parquet formats**: Extend the library to support reading and writing data in JSON and Parquet formats +- [ ] **Command-line interface**: Create a command-line interface to allow users to run the obfuscation process from the terminal +- [ ] **Support for multiple sources**: Extend the library to support reading data from multiple sources (e.g. local file system) + ## Setup +### Prerequisites + +- Python >= 3.13 +- Poetry >= 2.0.1 + +### Installation + +1. Clone the repository: + +```bash +git clone --recurse-submodules https://github.com/ajschofield/gdpr-obfuscator.git +cd gdpr-obfuscator +``` + +2. Install dependencies using poetry + +```bash +# Production +poetry install +# Developer (optional) +poetry install --dev +``` + ## Usage |
