ArXiv-txt, LLM-friendly ArXiv papers

1 minute read

TL;DR

Replace www.arxiv.org/abs/... with www.arxiv-txt.org/abs/... to get LLM-friendly ArXiv papers.

Try it out at arxiv-txt.org or view it on GitHub

Introduction

LLMs are super useful assistants, but they can often make mistakes. One easy way to get them to be slightly more reliable is to provide them with the right context, like a research paper.

However, providing LLMs with santized text can sometimes be painful. So I decided to make the process easy and with as little friction as possible.

Approach

I wanted the process to be simple, portable and easy to remember. I also wanted this to be as easy to integrate into future apps with as little extra dependencies as possible.

Since ArXiv already hosts a vast majority of research papers, and because it already has its own API, I decided to simply wrap it to make it easier for everyone else.

arxiv-txt.org/abs/[id] will automatically fetch the abtract, title, published date and authors and parse it into markdown. arxiv-txt.org/pdf/[id] will scrape the html endpoint of a paper and return it in plaintext.

Both those endpoints are pages with an easy-to-use “copy to clipboard” functionality.

API

If you want to query it instead via api, simply add raw to the url, which will be the endpoint for a GET request:

arxiv-txt.org/raw/abs/[id] will return the plaintext metadata of the paper. arxiv-txt.org/raw/pdf/[id] will return the plaintext paper content.

This makes introducing it into a python app super easy with no additional dependencies:

import requests

arxiv_url = "https://arxiv.org/abs/1706.03762"
arxiv_txt_url = arxiv_url.replace("arxiv.org", "arxiv-txt.org/raw/")
summary: str = requests.get(arxiv_txt_url).text
print(summary)

# Pass this to your favorite LLM

Share on

Twitter Facebook LinkedIn

jerpint

ArXiv-txt, LLM-friendly ArXiv papers

TL;DR

Introduction

Approach

API

Share on

Comments

You May Also Enjoy

Python’s ‘shelve’ is really useful for LLM debugging

Realtime HTML Rendering with LLMs

I had different agents play ‘The Password Game’ - they didn’t do so well

Performance of LLMs on Advent of code 2024