AI & ML interests

Privacy and artificial intelligence. NER. Token Classification.

Recent Activity

MikeDoes 
posted an update 4 days ago
view post
Post
2165
Stop sending sensitive data across the network. Sanitize it directly in the browser. 💡

A recent blog post by A. Christmas provides a practical guide on how to achieve exactly that. They demonstrated a powerful form of anonymization: PII masking at the edge. The vision is simple but profound: keep sensitive data off the network entirely by sanitizing it in the browser.

With the Ai4Privacy pii-masking-200k dataset serving as the foundation for their work. It provided the high-quality, diverse examples of PII needed to fine-tune a specialized DistilBERT model, one that is accurate, fast, and light enough to run client-side.

This is the future we are working towards: a world where developers are empowered with the tools and data to build powerful AI systems that respect user privacy by design. This is exactly why we build our datasets, and we're thrilled to showcase this project that turns the principles of data privacy into a practical, deployable solution.

🔗 See their innovative approach in action: https://ronathan.esr-inc.com/automatically-sanitize-data-in-the-users-browser-with-ai/

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #World's largest open privacy masking dataset
MikeDoes 
posted an update 6 days ago
view post
Post
4487
At Ai4Privacy, our goal is to empower researchers to build a safer AI ecosystem. Today, we're highlighting crucial research that does just that by exposing a new vulnerability.

The paper "Forget to Flourish" details a new model poisoning technique. It's a reminder that as we fine-tune LLMs, our anonymization and privacy strategies must evolve to counter increasingly sophisticated threats.

We're proud that the Ai4Privacy dataset was instrumental in this study. It served two key purposes:

Provided a Realistic Testbed: It gave the researchers access to a diverse set of synthetic and realistic PII samples in a safe, controlled environment.

Enabled Impactful Benchmarking: It allowed them to measure the actual effectiveness of their data extraction attack, proving it could compromise specific, high-value information.

This work reinforces our belief that progress in AI security is a community effort. By providing robust tools for benchmarking, we can collectively identify weaknesses and build stronger, more resilient systems. A huge congratulations to the authors on this important contribution.

🔗 Read the full paper: https://arxiv.org/html/2408.17354v1

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #Worldslargestopensourceprivacymaskingdataset
MikeDoes 
posted an update 11 days ago
view post
Post
1774
How do you prove a new AI privacy tool actually works? You test it against a world-class benchmark.

That's why we're proud our data played a key role in the research for "Rescriber," a new browser extension for user-led anonymization. To objectively measure their tool's performance against other methods, the researchers needed a diverse and challenging evaluation set.

They built their benchmark using 240 samples from the Ai4Privacy open dataset.

This is a win-win for the ecosystem: our open-source data helps researchers validate their innovative solutions, and in turn, their work pushes the entire field of privacy-preserving AI forward. The "Rescriber" tool is a fantastic step towards on-device, user-controlled privacy.

🔗 Learn more about their data-driven findings in the full paper: https://arxiv.org/pdf/2410.11876

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #HealthcareAI #Ai4Privacy
MikeDoes 
posted an update 13 days ago
view post
Post
223
State-of-the-art AI doesn't start with a model. It starts with the data.

Achieving near-perfect accuracy for PII & PHI

anonymization is one of the toughest challenges in NLP. A model is only as good as the data it learns from, providing this foundational layer is central to our mission. The

ai4privacy/pii-masking-400k dataset was built for this exact purpose: to serve as a robust, large-scale, open-source training ground for building high-precision privacy tools.


To see the direct impact of this data-first approach, look at the ner_deid_aipii model for Healthcare NLP by johnsnow lab. By training on our 400,000 labeled examples, the model achieved incredible performance:

100% F1-score on EMAIL detection.

99% F1-score on PHONE detection.

97% F1-score on NAME detection.

This is the result of combining a cutting-edge architecture with a comprehensive, high-quality dataset. We provide the open-source foundation so developers can build better, safer solutions.


Explore the dataset that helps power these next-generation privacy tools: ai4privacy/pii-masking-400k

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #HealthcareAI #Ai4Privacy
MikeDoes 
posted an update 18 days ago
view post
Post
5411
Can you teach a giant like Google's Gemini to protect user privacy? A new step-by-step guide shows that the answer is a resounding "yes."

While powerful, large language models aren't specialized for privacy tasks. This tutorial by Analytics Vidhya walks through how to fine-tune Gemini into a dedicated tool for PII anonymization.

To teach the model this critical skill, the author needed a robust dataset with thousands of clear 'before' and 'after' examples.

We're thrilled they chose the Ai4Privacy pii-masking-200k dataset for this task. Our data provided the high-quality, paired examples of masked and unmasked text necessary to effectively train Gemini to identify and hide sensitive information accurately.

This is a perfect example of how the community can use open-source data to add a crucial layer of safety to the world's most powerful models. Great work!

🔗 Check out the full tutorial here: https://www.analyticsvidhya.com/blog/2024/03/guide-to-fine-tuning-gemini-for-masking-pii-data/

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#DataPrivacy #AI #LLM #FineTuning #Anonymization #GoogleGemini #Ai4Privacy #World's largest open privacy masking dataset
MikeDoes 
posted an update 20 days ago
view post
Post
3702
You don't need a massive research lab to build a privacy-preserving AI tool thanks to open datasets. With the right ingredients, anyone can.

A fantastic new guide shows how the democratization of AI is helping to advance safety. It walks through how to use Google's new fine-tuning API to turn Gemini into a powerful tool for PII anonymization.

This project was powered by two key components:

An accessible platform from Google.

High-quality, open-source training data.

We are honored that the author chose the Ai4Privacy pii-masking-200k dataset to provide the crucial data foundation. Our dataset delivered the volume and structure needed to successfully teach a state-of-the-art model how to perform a critical privacy function.

This is the future we're working towards: powerful platforms combined with open, safety-focused data to create tools that benefit everyone. Kudos to the author for showcasing what's possible!

🔗 Read the full step-by-step guide: https://www.analyticsvidhya.com/blog/2024/03/guide-to-fine-tuning-gemini-for-masking-pii-data/

🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#AIforGood #DemocratizeAI #DataPrivacy #Anonymization #OpenSource #LLM #Ai4Privacy
  • 2 replies
·