Mohamed Elashri

How to read arXiv daily

A good researcher is a good reader and a needs to be updated on the latest research directions in the field. I'm experimental particle physicist, so arXiv is my main source of research papers. Just because it is the norm in the field to publish on arXiv before the peer review process. So if you can keep up with the arXiv, you can keep up with the field (most of the time).

But my interest is not limited to particle physics which would be covered by the hep-ex and exp-ph categories. I also read papers in machine learning, statistics, and other fields that I find interesting. I also don't read every paper, I just read the ones that I find interesting or relevant to my work. But how would I actually do that without spending too much time on it?

The process of vetting and choose which papers to read is both time-consuming and requires a lot of effort. So it might not even be possible. I take recommendations from friends, colleagues, and other researchers occasionally. But I'm not very social person (I'm not hanging out much at R1 -The center of universe- or other places).

But I'm also a developer/programmer, so I like to automate things. And I like to abuse the tools I have at my disposal. One of these tools is the very generous arXiv feeds, the other tool is the generous GitHub Action minutes. I abuse GitHub Actions a LOT.

I have monitoring service, Several automated tasks like build docker images of things I use and publish them to Docker Hub and many other things. So I thought, why not automate the process of reading arXiv papers?

Or to be specific why not automate the process of deciding which papers to read? I can automate at least part of this process, can't I? So I wrote a python script that does the following:

  1. Fetches the latest arXiv papers from the categories I care about (hep-ex, hep-ph, cs.LG, stat.ML, etc.).
  2. Filters the papers based on my interests. I have a list of keywords that I care about, so I filter the papers based on these keywords.
  3. Publish GitHub Page with the filtered papers. It does this by reading the JSON files -one per day-of the papers and generating a simple HTML page with the titles and links to the papers. It does generate them in one page from HTML/CSS/JS template I wrote.

Then there is a GitHub Action that runs this script daily and publishes the results to a GitHub Page. The action is triggered by a cron job that runs every day at 00:00 UTC. The action is defined in the .github/workflows/update-papers.yml file in my GitHub repository. And the result is published to melashri.net/arxiv.

This how it looks like:

arXiv daily page, https://melashri.net/arxiv

And this is how a paper view looks like:

arXiv daily paper view, https://melashri.net/arxiv/2024-09-30

Now I actually do highlight the papers that contain the interesting keywords in the title. I don't remove the others because I have not composed a good list of keywords yet. I also want to keep the option open to read other papers that might be interesting but don't contain the keywords. I'm still working on this part, but I think it is a good start.

I'm happy with the result so far. I skim through the papers daily and read the ones that catch my interest. I also share some of the interesting papers with my colleagues. I think this is a good way to keep up with the latest research in the field without spending too much time on it. And I hope that my abuse of GitHub Actions wouldn't be too much of a problem for GitHub. I don't think it is, but they would be at least helping a young researcher to keep up with the field.