The Microsoft Presidio SDK is not offered, maintained, or supported by Unstructured. For questions or issues
related to Presidio, see the following resources:
- For general discussions, use the discussion board in the Presidio repositiory on GiHub.
- For questions or issues, file an issue in the Presidio repository on GitHub.
- For other matters, email presidio@microsoft.com.
image_base64
and orig_elements
within Unstructured metadata, and it does not look for PII in images.Requirements
To use this example, you will need:-
An Unstructured account, as follows:
-
If you do not already have an Unstructured account, sign up for free.
After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io.
To sign up for a Team or Enterprise account instead, contact Unstructured Sales, or learn more.
-
If you have an Unstructured Starter or Team account and are not already signed in, sign in to your account at https://platform.unstructured.io.
For an Enterprise account, see your Unstructured account administrator for instructions, or email Unstructured Support at support@unstructured.io.
-
If you do not already have an Unstructured account, sign up for free.
After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io.
- A set of one or more Unstructured JSON output files that have been generated by Unstructured and stored in a folder within an Amazon S3 bucket that you have access to. One way to generate these files is to use an Unstructured workflow that relies on an S3 destination connector to store these Unstructured JSON output files. Learn how to create an S3 destination connector and create a custom workflow that uses your S3 destination connector.
- Python installed on your local development machine.
Create and run the Python code
-
In your local Python virtual environment, install the following libraries:
boto3
presidio_analyzer
presidio_anonymizer
uv
virtual environment with the following command: -
In your local Python virtual environment, install the appropriate natural language processing (NLP) models for
spaCy, which Presidio relies on for various internal tasks related to named entity recognition (NER) and
PII identification.
To find the appropriate model for your use case, do the following:
a. Go to spaCy Trained Models & Pipelines.
b. On the sidebar, click your target language, for example English.
c. Click the model you want to use, for example en_core_web_lg.
d. Click Release details.
e. At the bottom of the release details page, in the Assets section, right-click the filename ending in.whl
, for example en_core_web_lg-3.8.0-py3-none-any.whl, and select Copy Link Address from the context menu.
f. From the release details page, copy the URL from your web browser’s address bar.
g. Install the model into your local Python virtual environment by using the model’s name and the URL that you just copied. For example, if you are usinguv
, you can install the preceding model with a command such as the following:
- Set up Boto3 credentials for your AWS account. The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting environment variables or configuring a shared credentials file, One approach to getting and setting up Boto3 credentials is to create an AWS access key and secret access key and then use the AWS Command Line Interface (AWS CLI) to set up your credentials on your local development machine.
-
Add the following code to a Python script file in your virtual environment, replacing the following placeholders:
- Replace
<input-bucket-name>
with the name of the Amazon S3 bucket that contains your original Unstructured JSON files. This is the same bucket that you used for your S3 destination connector. - Replace
<input-folder-prefix>
with the path to the folder within the input bucket that contains your original Unstructured JSON files. - Replace
<output-bucket-name>
with the name of the S3 bucket that will contain copies of the contents of your Unstructured JSON files, with the redacted content within those files’ copies. This can be the same bucket as the input bucket, or a different bucket. - Replace
<output-folder-prefix>
with the path to the folder within the output bucket that will contain copies of the contents of your Unstructured JSON files, with the redacted content within those files’ copies. This must not be the same folder as the input folder. - Replace
<bucket-region-short-id>
with the short ID of the region where your buckets are located, for exampleus-east-1
.
operators
variable, a list of operators for built-in Presidio entities is specified. These operators look for common entities such as credit card numbers, email addresses, phone numbers, and more. You can remove any entities from this list that you do not want your code to look for. You can also add operators to this list for additional built-in entities that you want your code to also look for. And you can also add your own custom entities to this list. - Replace
- Run the Python script.
-
Go to the output folder in S3 and explore the generated files, searching for the
<REDACTED_
placeholders in the generated files’ contents.