/Tech2h ago

Researcher Launches Privasis-USA Dataset With 1M Sensitive Census Documents

56810216.7K

Original post

🚨New Data🚨Privacy won't be solved behind closed doors by big techs. To build privacy systems/agents, we need actual sensitive data to train/eval/red-team them Meet Privasis-USA🇺🇸 the 1st census-grounded 1M personal docs full of all sorts of sensitive info that you can imagine🧵

2:17 PM · Jul 2, 2026 · 4K Views

Sentiment

Users praise the Privasis-USA dataset and open synthetic data models for privacy because of the major cross-institutional collaboration and thanks to the talented researchers involved.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.7KBOOKMARKS13LIKES44RETWEETS3REPLIES1

Niloofar ✈️ icml@niloofar_mire

Working on profile based #synthetic data for mimicking #users with history?

Looking on models for data abstraction and minimization, on device?

Looking for privacy label annotations?

Check out our #ICML paper, with data, models and code all released!!

H/t @hyunw_kim 🫡🤩

Hyunwoo Kim@hyunw_kim

2h2.7K4413

Hyunwoo Kim@hyunw_kim

Privasis-USA is grounded on Nemotron-Personas-USA (https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), a set of 1M realistic synthetic persona profiles strictly based on the US census. Privasis-USA expands this with sensitive personal info, including SSN, medical doc, tax form, email, pay stub, bank, etc

2h4811

Hyunwoo Kim@hyunw_kim

Check our dataset: https://huggingface.co/datasets/nvidia/Privasis-USA We release the 1M main corpus and a 107K training set

Our work has also been accepted to ICML 2026 🇰🇷 Come talk to us if you're interested in expanding this effort or building something cool on top of it! Paper: https://openreview.net/forum?id=k0VozzvIqL

2h351

Hyunwoo Kim@hyunw_kim

With this, you can train your own strong privacy systems and agents! We provide heavy amount of more than 40M annotations: profile, background context, sensitive info span location We built our own model Privasis-Cleaner 4B that outperforms OpenAI models! https://huggingface.co/nvidia/Privasis-Cleaner-4B

2h331

Henry Dowling@henrytdowling

@niloofar_mire @hyunw_kim This is really cool! Is the goal here to create evals around privacy in the near future?

2h5

Hyunwoo Kim@hyunw_kim

Our work has been a major cross-institutional collaboration, involving NVIDIA, CMU, USC, UW, and Stanford! Huge thanks to @niloofar_mire , Michael Duan, @rui_xin31, @StellaLisy, @jaehunjung_com, @davidjesusacu, Qi Pang, Hanshen Xiao, Ed Suh, @sewoong79, @tsvetshop, @PangWeiKoh, @YejinChoinka

2h291

Sewoong Oh@sewoong79

@hyunw_kim Thanks to the very talented @hyunw_kim, we are releasing our synthetic Privasis datasets to study data privacy, safely.

2h191

Hyunwoo Kim@hyunw_kim

@henrytdowling @niloofar_mire We also released an evaluation benchmark for sanitization: https://github.com/skywalker023/privasis

And we plan to release more benchmarks related to privacy in the near future!

2h2