Rare Book Monthly

Articles - September - 2023 Issue

Where Do AI Programs Get Their Data? It Turns Out Some Comes from Copyrighted Books, Without Permission

CatGPT?

Where does the information you get from artificial intelligence (AI) sources like ChatGPT come from? It comes from a lot places, including the reams of data on the internet, but a significant source is books. Many, if not most, are of recent vintage as up-to-date information is needed for best answers. As such, most of these books are under copyright. However, the authors and publishers of these books have not been asked for permission nor compensated. Is this legal, an acceptable use of copyrighted works, or a violation of copyright law? Good question. No one knows the answer since it has not been adjudicated in court.

 

AI programs gain a lot of their data, and learn how language is used so they can give understandable answers, from training databases. These are databases filled with an enormous amount of information. How about the best known AI program, ChatGPT? Did it learn from a training database? To answer this, we went to the ultimate authority to ask, ChatGPT itself. It responded, “Yes, ChatGPT, like other GPT-3 models, is trained on a large and diverse dataset containing a wide range of text from the internet. This dataset includes books, articles, websites, and other sources of human-generated text. The model learns patterns, language structures, and information from this training data, which it then uses to generate responses to user inputs.”

 

One such online training database is called “The Pile,” and a subset of The Pile is Books3. The Pile contains data from numerous sources, with Book3 providing the book element. It contains 196,000 books, converted to searchable text. It is not necessarily in a format that would allow you to read it as a book, but the text is there. Most are likely copyrighted but used without permission. It was freely available on the internet to anyone seeking to build an AI model. Its creator made it so, as he wanted even small developers to have a shot at creating a model.

 

Books3 was recently removed from the internet. It was taken down after Rights Alliance, a group representing Danish publishers, made the request. They determined that 150 titles used were published by their members. The Eye, the website hosting Books3, complied.

 

This issue is already starting to appear in court and we can expect to see more of this until some sort of decision is reached on where AI training databases and copyright law intersect. It is argued that this is “Fair Use,” a doctrine that allows you to quote brief parts of a book without running afoul of copyright law. This can be argued to be similar, without even direct quoting. It is sort of like conducting research in a library. However, it is also true these databases have copied entire books to do their searching. It is also notable that the authors are not being compensated, while at risk of losing sales to people who would rather do their research through services like ChatGPT. Of course, the database compiler can license the material from the publisher, but that would require many deals with many people, and it might be prohibitively expensive for all but the largest corporations. That is what the Books3 founder sought to avoid. Maybe ChatGPT can come up with an answer to this dilemma.

 

 

Note on illustration. What the...? I asked ChatGPT's image generator for a picture of ChatGPT. This is what it gave me. Why? Who knows. Perhaps it has to do with the French word for “cat” being “chat,” but who knows what it's artificial mind was thinking. Hopefully, it's textual answers are a little better.


Posted On: 2023-09-01 12:11
User Name: PeterReynolds

Textual answers better? Not in my experience. I asked it for the chapter titles of a book which it knew how to find online, formatted as a numbered list. It would only give me a list of chapters that it felt ought to be in books of this type, not the ones in the particular book, despite being able to point me to where I could find and read the book online.


Rare Book Monthly

  • Ketterer Rare Books
    Auction May 27th
    Ketterer Rare Books, May 27:
    K. Marx, Das Kapital,1867. Dedication copy. Est: € 120,000
    Ketterer Rare Books, May 27:
    Latin and French Book of Hours, around 1380. Est: € 25,000
    Ketterer Rare Books, May 27:
    Theodor de Bry, Indiae Orientalis, 1598-1625. Est: € 80,000
    Ketterer Rare Books
    Auction May 27th
    Ketterer Rare Books, May 27:
    Breviary, Latin manuscript, around 1450-75. Est: € 10,000
    Ketterer Rare Books, May 27:
    G. B. Piranesi, Vedute di Roma, 1748-69. Est: € 60,000
    Ketterer Rare Books, May 27:
    K. Schmidt-Rottluff, Arbeiter, 1921. Orig. watercolour on postcard. Est: € 18,000
    Ketterer Rare Books
    Auction May 27th
    Ketterer Rare Books, May 27:
    Breviarium Romanum, Latin manuscript, 1474. Est: € 20,000
    Ketterer Rare Books, May 27:
    C. J. Trew, Plantae selectae, 1750-73. Est: € 28,000
    Ketterer Rare Books, May 27:
    M. Beckmann, Apokalypse, 1943. Est: € 50,000
    Ketterer Rare Books
    Auction May 27th
    Ketterer Rare Books, May 27:
    Ulrich von Richenthal, Das Concilium, 1536. Est: € 9,000
    Ketterer Rare Books, May 27:
    I. Kant, Critik der reinen Vernunft, 1781. Est: €12,000
    Ketterer Rare Books, May 27:
    Arbeiter-Illustrierte Zeitung (AIZ) / Die Volks-Illustrierte (VI), 1932-38. Est: €8,000
  • ALDE, May 28: KIPLING (RUDYARD). Le Livre de la Jungle. – Le IIe livre de la Jungle. Paris, Sagittaire, Simon Kra, 1924-1925. €3,000 to €4,000.
    ALDE, May 28: NOAILLES (ANNA DE). Les Climats. Paris, Société du Livre contemporain, 1924. €50,000 to €60,000.
    ALDE, May 28: MILTON (JOHN). Paradis perdu. Quatrième chant. S.l., Les Bibliophiles de l'Automobile-Club de France, 1974. €2,000 to €3,000.
    ALDE, May 28: LEBEDEV (VLADIMIR). Russian Placards - Placard Russe 1917-1922. Saint-Petersbourg, Sterletz, 1923. €1,000 to €1,200.
    ALDE, May 28: MARDRUS (JOSEPH-CHARLES). Histoire charmante de l'adolescente sucre d'amour. Paris, F.-L. Schmied, 1927. €1,500 to €2,000.
    ALDE, May 28: TABLEAUX DE PARIS. Paris, Émile-Paul Frères, 1927. €2,000 to €3,000.
    ALDE, May 28: LA FONTAINE (JEAN DE). Les Fables illustrées par Paul Jouve. S.l. [Lausanne], Gonin & Cie, 1929. €4,000 to €5,000.
    ALDE, May 28: SARTRE (JEAN-PAUL). Vingt-deux dessins sur le thème du désir. Paris, Fernand Mourlot, 1961. €1,500 to €2,000.
    ALDE, May 28: [BRAQUE (GEORGES)]. 13 mai 1962. Alès, PAB, 1962. €3,000 to €4,000.
    ALDE, May 28: MIRÓ (JOAN). Je travaille comme un jardinier. Avant-propos d'Yvon Taillandier. Paris, Société intenationale d'art XXe siècle, 1963. €1,000 to €2,000.
    ALDE, May 28: MAGNAN (JEAN-MARIE). Taureaux. Paris, Michèle Trinckvel, 1965. €3,000 to €4,000.
    ALDE, May 28: PICASSO (PABLO). Dans l'atelier de Picasso. 1960. €15,000 to €20,000.
  • Sotheby’s
    Modern First Editions
    Available for Immediate Purchase
    Sotheby’s, Available Now: Winston Churchill. The Second World War. Set of First-Edition Volumes. 6,000 USD
    Sotheby’s, Available Now: A.A. Milne, Ernest H. Shepard. A Collection of The Pooh Books. Set of First-Editions. 18,600 USD
    Sotheby’s, Available Now: Salvador Dalí, Lewis Carroll. Alice's Adventures in Wonderland. Finely Bound and Signed Limited Edition. 15,000 USD
    Sotheby’s
    Modern First Editions
    Available for Immediate Purchase
    Sotheby’s, Available Now: Ian Fleming. Live and Let Die. First Edition. 9,500 USD
    Sotheby’s, Available Now: J.K. Rowling. Harry Potter Series. Finely Bound First Printing Set of Complete Series. 5,650 USD
    Sotheby’s, Available Now: Ernest Hemingway. A Farewell to Arms. First Edition, First Printing. 4,200 USD

Article Search

Archived Articles

Ask Questions