Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Arxiv Sanity Preserver (arxiv-sanity.com)
129 points by stared on March 19, 2016 | hide | past | favorite | 39 comments


Being in the deep learning community, the number of papers appearing has been getting out of control. Most of the papers appear on arxiv and are of low quality. This is particularly problematic right before conference deadlines.

Karpathy's Arxiv-sanity helps a lot to keep in touch with the latest and greatest deep learning without having to spend all my time reading papers.


As someone who uses arxiv only very rarely, can someone please explain how this preserves sanity?


Not having anything to do with publishing papers myself I thought the 100,000+ papers submitted in 2015 sounded like a lot, until I looked into it [1] and saw there's likely more than a million academic papers published every year.

What are all those papers like? I wonder if a large proportion are the equivalent of peer-reviewed blog posts.

[1] https://www.quora.com/How-many-academic-papers-are-published...


I wonder how many research professors there are in the world. Basic googling suggests there are order-of-magnitude 1000000 university faculty in the world, so a million papers is at most 1 paper per faculty member per year, which doesn't seem unreasonable. Of course, how many of those faculty are serious researchers, and how many are just instructors, is a harder question.


Part of it comes from the fact that a lot of institutions require continuous publication by their faculty/research staff. If you don't publish because you're focusing on a long term paper, it can look bad on you during annual review.

To combat this a fair number of people will publish many times on the same developing experiment (not that this is necessarily a bad thing).

Don't forget grad students and dedicated undergrads. That adds a fair bit to your existing numbers.


It's not only research professors who publish. Corporate research scientists also publish, including those at Microsoft, Google, and Intel.


Grad students publish too!


True! And undergrads. And private scholars. And ..

The easy way to answer this would be to use something like Web of Knowledge to get a rough sense of the number of distinct authors.



To get some sense of the arXiv submission growth: http://arxiv.org/help/stats/2015_by_area/index


> because things were seriously getting out of hand.

Whatever that means.


Imagine waking up every morning with 50 new arxiv papers uploaded that night. You panic and quickly scan through the papers - any of them could be very related to your research, or scoop your latest idea, or have good ideas you can use in your own work. Arxiv makes no attempts to filter these for you, so it's up to you to carefully scan through this unlabeled list of paper titles. You eventually find 3 papers that you have to read and put them on your list. You manage to read 1 that day. Next day you wake up and 50 new papers are up. You iterate for a few weeks and suddenly you have a toread list of 20 papers and 100 new arxiv papers just came in that evening. That's what's currently happening in research at least in deep learning (but I imagine more widely too), especially around big conference deadlines, and that's what I label "things seriously getting out of hand".

That's a first use case. The second way things are out of hand is that you remember this paper from 3 years ago that was very related to this one, but can't remember it's name anymore. Here you can sort by similarity to any paper, and usually these papers come up on top of the sorted list. This is also useful for finding related work. Another use case is a peace of mind that you somehow did not miss some papers that you definitely should know about.

Google Scholar is supposed to have similar features: it emails you papers it thinks you would be interested in and can in principle show similar papers. I don't know what they do internally but these features are quite terrible and low quality in my own experience compared to what I get here. More generally the amount of innovation in Google Scholar over the last few years is sadly either zero or negative (but overall I still get nightmares about what would happen to academia if Google pulled a Google Reader with Scholar). For arxiv-sanity it's tfidf vectors of bigrams from full text of each paper and I do L2 lookups for similarity ranking and train personalized SVMs for people for recommendations. The results are, at least for me, significantly better.


I have no real point here, only historical commentary.

I've been reading papers from the 1960s, which is when the term "information explosion" was coined. People then were struggling to stay current with the literature, and thought 'things were seriously getting out of hand.'

This was the start of abstracting services, like ISI, where you could even arrange the results of a keyword search of all the new papers to be sent to you each week - a clear predecessor to personalized RSS feeds.

Going back even further to the immediate post-war era, the library systems of the time, which were structured around books and journals and organized by topic, couldn't keep up with the deluge of research reports which cut across multiple topics. The field of information retrieval, using first punched cards and then computers, started because the publication flow was 'seriously getting out of hand'.

Or for a specific example, after high T_c superconductors were discovered in 1986, there was a mad rush of interest as solid state physicists from around the world explored the new territory. A Google Scholar search for "high temperature superconductor" finds:

  1986 -   846 publications
  1987 - 2 600
  1988 - 3 900
  1989 - 4 780
  1990 - 4 870
  1991 - 5 250
That's 14 papers per day, any one of which might be "very related to your research, or scoop your latest idea, or have good ideas you can use in your own work."

Granted, 14 << 50, but that doesn't include some of the papers about "high Tc" which don't use the whole phrase. Also, those are 14 peer-reviewed papers per day, so there has been some filter, and experimental research in high Tc research requires more equipment than deep learning.

Think of my comment as a reminder that things have been out of hand for most of a century, and dealing with that deluge emotionally connects you to the headache that generations of researchers before you have had to suffer with. :)


I'm not sure how bad the problem is in other fields though. I subscribe to a daily arXiv search alert covering physics.comp-ph and physics.flu-dyn, as well as any cross-posts to these. It averages ~20 titles and abstracts per day. I skim the titles, and if the title looks interesting I read the abstract, and if the abstract is interesting I open the full link in a background tab. This takes three minutes each morning, plus the time it takes to read any full papers where I found the abstract interesting, which is typically 2-3 papers a week. By now I've learned to read papers quickly and save them in a filename system for future reference.


Beware research paralysis. There's a point at which you have to ignore the work others are doing so that you can make progress on your own.


As somebody from outside academia, what's the worst that can happen by not reading every paper related to your research, you accidentally end up replicating (part of) somebody else's results? That sounds useful in itself.


I'm not in the field, but have friends who are. But from what I understand it, replicated experiments are not as highly regarded and does not get published in the same nice journals. Getting published in high ranking journals is very important if you want a career in research. Another thing you might miss is if somebody does a similar or the same experiment and comes to a negative result, reading that might save you a few days or months of unnecessary work.


> Another thing you might miss is if somebody does a similar or the same experiment and comes to a negative result, reading that might save you a few days or months of unnecessary work.

Given how low the reproduction rate in science is, I'm not sure that time would be wasted.


Yeah that's basically the worst case. But it isn't useful in itself. Most research isn't of the type that having someone do it again would be any use.

For example, suppose you discover the structure of DNA, you try to publish but find someone already published it last year. You've just wasted a lot of time.

I don't know what the solution is.


In computer science, that's going to be less helpful. In fields like psychology, sociology and even biology, the initial idea may not be too hard. The onerous part is in designing an experiment, running the experiment, and performing the data analysis. The sorts of questions that the experiment are trying to answer tend to be multivariate, which means it's easy to do all of the prior things wrong. Replication is key in sussing out if any of that went wrong. For psychology and sociology, replicating research can take just as long as the original.

In the fields of computer science where you actually implement something, it's the design and implementation of that new artifact that takes up most of your time. The experiments are not nothing, but they tend to be the sort of thing you can script: run a bunch of programs, accumulate results, do data analysis. Even the data analysis tends to be scripted. If, at the last moment you discover a small tweak that could improve your implementation, it can be trivial (in effort, not necessarily time) to re-run your experiments.

Now, the experimental evaluation is still important, and it is also easy to do wrong. But I also claim it's more deterministic. If an author is honest in describing their experiment, it's easier for reviewers to cry foul in computer science systems research than in, say, psychology. There are ways in computer science to design poor experiments that show bogus results, but it tends to be more obvious.

If, upon trying to publish, you discover that someone else had a similar idea and implemented something similar, you have replicated the hardest part. You spent a lot of time and effort designing this new thing that overcomes all of these challenges. If someone else already did that, you could have just skipped all of that, and started on improving it right away. In computer science systems research, replicating someone's research may actually be must faster. Sometimes you can view their code directly, or you can implement their idea in another system. Re-implementing an idea in a new context can take a lot of engineering effort, but it can still be a lot less work than doing it the first time.

Now, what happens if that was a bogus technique, and you can't replicate the results? That's a publishable result, but it tends not to be the whole paper. You figure out a better way, and explicitly compare your new way to that old published way. Again, that's because in computer science systems research, you're not discovering fundamental properties of things. Instead, you're discovering better ways of doing things.

I do sometimes read computer science systems papers and think "Eh, I don't buy this result". That's usually not because they did anything wrong (although sometimes it is), but because I just think that what they are investigating does not matter. "Sure, I believe you figured out a reliable way to optimize a three wheeled car, but four wheels is still better."

Theoretical computer science is not impacted by this at all, as there usually are not experiments in such papers. Their "results" tends to be a proof.


What about using deep learning to properly classify arxiv papers about deep learning (and other things, perhaps)? ;)


the right tool for the job :) In this case I'm perfectly happy with SVMs over tfidf bigrams and where that places you in the tradeoffs space.


Do you want Skynet? This sounds like the start of Skynet...


> Imagine waking up every morning with 50 new arxiv papers uploaded that night.

Others in this thread propose RSS and aggregation of abstracts as a solution. My proposal is to have reviews of literature, and then just read the reviews instead. This should save a bunch of time.


I thought it meant a filter for nonsense articles, but apparently arXiv is not very searchable anymore.


I have been checking this site out lately and using it to download PDFs to read on my phone later. I really like it! The options to see most popular papers and to search by field is really nice.

Thank you Andrej for putting this together and maintaining it.


Who curates the top recent papers seen on first visit and is personal preference only accounted for in recommended?


For talking about arXiv papers and recommending them to others, there is also: https://scirate.com/


I wrote the first version of Scirate exactly because the number of papers I had to eyeball each day was so high. Now that I've left academia I find it even more useful (so huge thanks to those who rewrote it from scratch)! If you are in quantum computing it definitely helps your sanity.


That's an awesome website.

I hope it will support TLS for registration or login mechanisms.


Hey, if you need any help expanding the site, I'd be game to help out!


Looks like the site is down.


Yeah sorry about that - I see cryptic errors in the server logs popping up at random. I think it must be something to do with the scale of requests coming in and breaking the site in some way I don't currently understand. I have near zero experience with scaling web sites, if anyone who does is passionate about meta research you're very welcome to look through serve.py and help me out.

One of the problems I caught: error: [Errno 24] Too many open files from tornado. Trying to fix (edit ok made ulimit -n larger and I don't see this error anymore at least)


It is a funny error

Does this happen always with the same traceback (that is, with the same area of your code in the stack trace)?


I'd offer but I have other commitments this afternoon; I hope you find someone who can help!


Would be cool if you could upvote/downvote papers


I thought about this quite a bit. I don't think I want downvoting. And I don't want effectless upvoting. The way it is right now is that you can add a paper to your library, and it will then feed into your personal SVM as a positive example of papers you like to see more of. Adding paper to your library as a type of paper you like to see more of effectively counts as an "up vote", and is what is sorted by when you go to the "top" tab.


Does that mean papers are "downvoted" by default, i.e. added to the negative example list?


Ironically it says "The connection was reset".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: