1 minute read

TL;DR

Replace www.arxiv.org/abs/... with www.arxiv-txt.org/abs/... to get LLM-friendly ArXiv papers.

Try it out at arxiv-txt.org or view it on GitHub

Introduction

LLMs are super useful assistants, but they can often make mistakes. One easy way to get them to be slightly more reliable is to provide them with the right context, like a research paper.

However, providing LLMs with santized text can sometimes be painful. So I decided to make the process easy and with as little friction as possible.

Approach

I wanted the process to be simple, portable and easy to remember. I also wanted this to be as easy to integrate into future apps with as little extra dependencies as possible.

Since ArXiv already hosts a vast majority of research papers, and because it already has its own API, I decided to simply wrap it to make it easier for everyone else.

arxiv-txt.org/abs/[id] will automatically fetch the abtract, title, published date and authors and parse it into markdown. arxiv-txt.org/pdf/[id] will scrape the html endpoint of a paper and return it in plaintext.

Both those endpoints are pages with an easy-to-use “copy to clipboard” functionality.

API

If you want to query it instead via api, simply add raw to the url, which will be the endpoint for a GET request:

arxiv-txt.org/raw/abs/[id] will return the plaintext metadata of the paper. arxiv-txt.org/raw/pdf/[id] will return the plaintext paper content.

This makes introducing it into a python app super easy with no additional dependencies:

import requests

arxiv_url = "https://arxiv.org/abs/1706.03762"
arxiv_txt_url = arxiv_url.replace("arxiv.org", "arxiv-txt.org/raw/")
summary: str = requests.get(arxiv_txt_url).text
print(summary)

# Pass this to your favorite LLM

Comments