🚨New Data🚨Privacy won't be solved behind closed doors by big techs. To build privacy systems/agents, we need actual sensitive data to train/eval/red-team them Meet Privasis-USA🇺🇸 the 1st census-grounded 1M personal docs full of all sorts of sensitive info that you can imagine🧵
Researcher Launches Privasis-USA Dataset With 1M Sensitive Census Documents
Users praise the Privasis-USA dataset and open synthetic data models for privacy because of the major cross-institutional collaboration and thanks to the talented researchers involved.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Working on profile based #synthetic data for mimicking #users with history?
Looking on models for data abstraction and minimization, on device?
Looking for privacy label annotations?
Check out our #ICML paper, with data, models and code all released!!
H/t @hyunw_kim 🫡🤩
🚨New Data🚨Privacy won't be solved behind closed doors by big techs. To build privacy systems/agents, we need actual sensitive data to train/eval/red-team them Meet Privasis-USA🇺🇸 the 1st census-grounded 1M personal docs full of all sorts of sensitive info that you can imagine🧵

Privasis-USA is grounded on Nemotron-Personas-USA (https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), a set of 1M realistic synthetic persona profiles strictly based on the US census. Privasis-USA expands this with sensitive personal info, including SSN, medical doc, tax form, email, pay stub, bank, etc

Check our dataset: https://huggingface.co/datasets/nvidia/Privasis-USA We release the 1M main corpus and a 107K training set
Our work has also been accepted to ICML 2026 🇰🇷 Come talk to us if you're interested in expanding this effort or building something cool on top of it! Paper: https://openreview.net/forum?id=k0VozzvIqL

With this, you can train your own strong privacy systems and agents! We provide heavy amount of more than 40M annotations: profile, background context, sensitive info span location We built our own model Privasis-Cleaner 4B that outperforms OpenAI models! https://huggingface.co/nvidia/Privasis-Cleaner-4B

@niloofar_mire @hyunw_kim This is really cool! Is the goal here to create evals around privacy in the near future?

Our work has been a major cross-institutional collaboration, involving NVIDIA, CMU, USC, UW, and Stanford! Huge thanks to @niloofar_mire , Michael Duan, @rui_xin31, @StellaLisy, @jaehunjung_com, @davidjesusacu, Qi Pang, Hanshen Xiao, Ed Suh, @sewoong79, @tsvetshop, @PangWeiKoh, @YejinChoinka

@hyunw_kim Thanks to the very talented @hyunw_kim, we are releasing our synthetic Privasis datasets to study data privacy, safely.

@henrytdowling @niloofar_mire We also released an evaluation benchmark for sanitization: https://github.com/skywalker023/privasis
And we plan to release more benchmarks related to privacy in the near future!