Discover and share the best of the web!
Learn more about Digg by taking the tour.
Researchers Yearn to Use AOL Logs, but They Hesitate
nytimes.com — When AOL researchers released three months ’ worth of users’ query logs to a publicly accessible Web site late last month, Jon Kleinberg, a professor of computer science at Cornell, downloaded the data right away. But when a firestorm over privacy breaches erupted, he decided against using it.
- 393 diggs
- digg it
- RichPowers, on 10/12/2007, -1/+54Why? There are tons of sites out there that now display the search logs. I'd rather have Cornell profs using the logs for research purposes than YTMND.com using them for stupid humor...
The data shouldn't have been released, but now that it has it's too late to just pretend the logs don't exist.- dmurray14, on 10/12/2007, -3/+9Well put
- Shinta, on 10/12/2007, -2/+21I prefer stupid humor, but I see your point.
- ROFISH, on 10/12/2007, -3/+20True, but there is something called academic honesty. Just because there's valid data doesn't mean you have to use it. If it's against your morals, don't research it. However someone eventually will. (We learned a lot from the human body due to the Nazi "medical program", and still use that knowledge today, but they definitely weren't humane in any sense of the word.)
- pbaehr, on 10/12/2007, -22/+5@ROFISH:
Are you seriously comparing AOL releasing search logs to the Holocaust?
And you defeat your own point by going on to say that medical science today uses the information gained from inhumane experiments. It can't be undone. It may as well be put to use. I don't think it is immoral to use information which was gained through someone else's immorality. - aaronm769, on 10/12/2007, -3/+1What I find gets lost in this issue is that the search logs have a lot of research potential. Data this detailed has never been available to average researchers, and we have no idea what new and innovative uses they can come up with. I agree that AOL poorly implemented this, but I still give them credit for thinking about the research possibilities.
- eysen, on 10/12/2007, -0/+0One thing people often forget is that research on search data has been done before, using things like metaspay, which allows real time peeks into what people are currently entering into search engines. See http://yi.com/home/EysenbachGunther/publications/2003/eysenbach2003e-proc_amia_fall_conf-prevalnce.pdf#search=%22eysenbach%20kohler%20What%20is%20the%20prevalence%20of%20health-related%20searches%20on%20the%20World%20Wide%20Web%22 or http://yi.com/home/EysenbachGunther/publications/2004/Eysenbach2004c-jama-searches.pdf Sure, these search terms were not linked by an ID, but if somebidy was stupid enough to enter SSNs or credit card information, he/she is out of luck too.
BTW Note that a wiki has been set up at http://www.jmir.org/wiki/index.php/AOL500k by researchers to discuss the AOL dataset. - nfph, on 10/12/2007, -0/+2Regardless of the simple +/- arguments, most Universities have research boards that enforce rules regulating information gathering where human subjects are involved. The data was released without the knowledge, intent or consent of the searchees (regardless of whether it serves them right for using AOL), and the researcher and University would be complicit in that violation of privacy by incorporating the data into their sets. I suppose the counter-argument is that the data could be kept sterile enough to overlook the fact that even the company that made the data available has pulled it and deemed it a privacy violation.
- szembek, on 10/12/2007, -0/+1Because he's got too much sand in his vagina.
- ActiveMatx, on 10/12/2007, -10/+2It's like some rich crazy person went flying around in a helicopter throwing $100 bills out the window with a total of billions of dollars. Although the goverment is upset, and the fate of the economy is at risk...
...its not the people who find the $100 bills and spend them fault..... its the crazy idiot who went flying around in the helicopter. The money is on the streets already, and if you want to spend it go right ahead, because if you don't someone else will.- HoboMaster, on 10/12/2007, -1/+4Why is the government upset? I've never heard a government protest people giving away money.
- KJay, on 10/12/2007, -5/+2Decent analogy, but the economy would not be at risk. In fact, it would have quite the boom with $1 billion being spent rather quickly in a relatively small area.
- berty38, on 10/12/2007, -0/+6that's a bad analogy. Your rich crazy person doesn't really reflect the innocent AOL users whose privacy was violated. In fact, I have no idea what the rich crazy person represents...
Also academic researchers using the data or not will not affect whether or not others will use it. So we can't say to ourselves "it's better that I use it than some other person with nefarious intentions." - Dhalgren, on 10/12/2007, -1/+6I think it would be more like someone steals billions of dollars and then spreads it out via a helicopter. AOL stole the "money" (data) from innocent people (searchers) and has spread it out over the world. Now say you have $1000 as a result of this. Wouldn't you feel a little guilty spending that money?
- Chakz, on 10/12/2007, -3/+3No I wouldn't feel guilty, because if I don't spend it someone with nefarious intentions may use it to buy drugs ;)
- doubledangerbat, on 10/12/2007, -0/+3That's a horrible analogy. Good lord.
- mandarin, on 10/12/2007, -1/+6Anything marked by AOL is....well.. marked as AOL. Unreliable.
- unquist, on 10/12/2007, -1/+2Ethics are a strange beast. Clearly it's a judgement call on the part of the scientist whether or not the benefits of the research outweigh moral qualms he or she might have about using data gleaned from questionable sources.
- Neuro99, on 10/12/2007, -0/+4Agreed it was wrong on AOL's part to publish the data. But if you're a researcher, this data is pure gold. If you're an academic, you're not interested in Mr. X or Mrs. Y's searches. You're interested in spotting general trends from lots of individuals. So go ahead, use it. We can learn from that data.
- Yashu, on 10/12/2007, -0/+3He only *says* he isn't using it... It is a text article so you are not able to see him wink.
- Seumas, on 10/12/2007, -1/+7How is AOL search data valuable to anyone?!
I gaurantee AOL users do not search for things the same way you or I or most people on the internet do. These people are dumb enough to search for their SS# in their browser. They probably have never even heard of boolean expressions or even the fact that you can put quotes around two words to search for a phrase.
Google search records, on the other hand, would be very useful.- 2L84ME, on 10/12/2007, -0/+4It's just plain frustrating to see how many people were searching for actual URLs.
- jbritton, on 10/12/2007, -0/+5While AOL users are without a doubt dumb, they do represent a large group of consumers. Their search data is very valuable to anyone who wants to target their ads to these people. These are the same dumb people who actually click on the ads and buy stuff.
- berty38, on 10/12/2007, -2/+3This issue is similar to the stem cell research issue. There, some people who are pro-life understand that researching stem-cells can help save lots of lives, but to condone such research is too close to condoning the way we get stem-cells.
Sure we in the academic community can learn a lot from this data, most of which will be to the benefit of society. But using it is like condoning AOL's invasion of privacy. - Otto, on 10/12/2007, -1/+1I'm slightly confused here... I can see this sort of data being valuable to search engines, to improve their systems. But I'm at a loss to come up with reasonable uses for the data other than that.
How would a bunch of search engine queries help anybody with meaningful research? - Shinta, on 10/12/2007, -0/+1None of the AOL users consented to giving this data, while in a survey, people consent to giving the info.
- wurzelgummage, on 10/12/2007, -0/+2It hadn't ever occured to me that anyone was keeping records of searches.
Nobody warned me the Way Back Machine was storing everything either.
Information on the internet is a really unflushable turd. - digitallysick, on 10/12/2007, -0/+1Its a good way to target data, i mean most aol users are "dumb" therefore, easier to market to, chances for them to fall for "anything" are higher, than the rest of us. they are the type that would "trade the cow for the magic beans"
- Snoopsor, on 10/12/2007, -1/+2I've recently grabbed this data as well, and while I do believe it's an issue of privacy for this information to be publicly available, we must remember ISP's, search engines like google get far more data with much more detailed information like this every day. So the switch is the morals between only letting the 'powers' or large corporations or companies see this information to letting everyone see this information. Power to the people? I'm sure google has people researching their data all the time to find trends in almost every aspect of online culture then they think of ways to improve that, and thus create revenue while improving the user experience.
This data that I got, well I've already imported into a mysql database, which was fairly simple, and then over the last couple of days I've been experimenting with finding the best ways to index this data.
To import this data is fairly easy, I'm sure there are better ways, but this was fast and simple:
CREATE DATABASE AOL;
CREATE TABLE `AOL`.`WebQueries` (
`ID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`AnonID` INT UNSIGNED NOT NULL,
`Query` VARCHAR(255) NOT NULL,
`QueryTime` DATETIME,
`ItemRank` INT,
`ClickURL` VARCHAR(255),
PRIMARY KEY(`ID`),
INDEX `AnonIDIndex`(`AnonID`)
)
ENGINE = MYISAM
COMMENT = '~36M queries collected from ~650K AOL users during a period of 3 months.';
Now to load the data:
LOAD DATA INFILE "user-ct-test-collection-[number].txt" INTO TABLE AOL.WebQueries IGNORE 1 LINES (AnonID, Query, QueryTime, ItemRank, ClickURL);
Repeat 10 times, incrementing the number each time from 01 to 10.
Here are some initial findings while trying to calculate the best indexes to use for the Query column:
select count(*) as 'total rows', count(distinct Query) as 'distinct values', count(*) - count(distinct Query) as 'duplicate values' from AOL.WebQueries;
+------------+-----------------+------------------+
| total rows | distinct values | duplicate values |
+------------+-----------------+------------------+
| 36389567 | 10154411 | 26235156 |
+------------+-----------------+------------------+
1 row in set (26 min 38.46 sec)
This following query just uses a the left function to take only the first 5 characters from the query column then compares:
select count(distinct left(Query,5)) as 'distinct prefix values', count(*) - count(distinct left(Query,5)) as 'duplicate prefix values' from AOL.WebQueries;
+------------------------+-------------------------+
| distinct prefix values | duplicate prefix values |
+------------------------+-------------------------+
| 647633 | 35741934 |
+------------------------+-------------------------+
1 row in set (2 min 9.30 sec)
Now trying with the first 10 characters in the query column:
+------------------------+-------------------------+
| distinct prefix values | duplicate prefix values |
+------------------------+-------------------------+
| 4301630 | 32087937 |
+------------------------+-------------------------+
1 row in set (2 min 50.72 sec)
Now with 15:
+------------------------+-------------------------+
| distinct prefix values | duplicate prefix values |
+------------------------+-------------------------+
| 7597975 | 28791592 |
+------------------------+-------------------------+
1 row in set (3 min 10.45 sec)
With 20:
+------------------------+-------------------------+
| distinct prefix values | duplicate prefix values |
+------------------------+-------------------------+
| 9175345 | 27214222 |
+------------------------+-------------------------+
1 row in set (3 min 20.00 sec)
With 25:
+------------------------+-------------------------+
| distinct prefix values | duplicate prefix values |
+------------------------+-------------------------+
| 9784466 | 26605101 |
+------------------------+-------------------------+
1 row in set (3 min 28.15 sec)
With 30:
+------------------------+-------------------------+
| distinct prefix values | duplicate prefix values |
+------------------------+-------------------------+
| 10006317 | 26383250 |
+------------------------+-------------------------+
1 row in set (3 min 36.11 sec)
The interesting part of this data is, from 5 to 20 letters we're seeing massive jumps in unique data, then after 20, the data levels out, and only increases by small numbers (relatively) each time it's incremented by 5. For instance the difference between the distinct prefix values from the first 30 characters to the first 40 characters is 10123736 (for 40 characters) - 10006317 (for 30 characters) = 117418. While this is a large number, compared to the 10154411 total unique queries, it's very small. You could say that only 117418/10154411*100 = 1.156325068977413% of AOL users over a 3 month period use searches that're over 30 to 40 characters in length.
This means I put an index on the query column for the first 30 characters, which took far less time. I've also added a fulltext index to the query column to test the speed between fulltext searches and regular tuned queries.
These kind of findings are by no means unheard of, but from it we can hope software will be far more fine tuned in the future to queries from say 1 to 30 characters (not that they aren't already)...
If google (and other search engines) and ISP's can collect this sort of information, why shouldn't the general public (while taking into account a percentage of people will use it for malicious purposes)?
Cheers,
Ryan- javierror, on 10/12/2007, -1/+1too long no thanks
Digg is coming to a city (and computer) near you! Check out all the details on our