Submissions/The Internet Archive and Wikimedia - Common Knowledge Goals/notes

From Wikimania
  • Andrew Lih: First ever session about the Internet Archive at Wikimania!
  • We collaborate on dead links, but there is more potential!
  • Andrew recommends visiting IA on any Friday at lunch to meet with them ; it's an open thing and you might meet founder Brewster Kahle (pronounced Kayle), have lunch, and get a tour of the place which is amazing
  • Wendy Hanamura, director of partnerships
  • Mark Graham, director of the Wayback Machine
  • Plan for today: 30min about existing collaboration
  • 20min questions
  • discussion about ideas

Internet Archive = infrastructure for the free knowledge world

  • WayBackMachine has been running for 21 years
  • InternetArchiveBot, running for more than 1 year on enwiki. Fixed more than 1 million links (~2% error rate)
  • GreenC bot (WaybackMedic) checks for errors in Internet Archive Boy (miniscule error rate)

Every new link added to a Wikipedia article (in any of 288 languages) is crawled by the Wayback Machine

Many books since 1923 are missing from the Internet Archive because of US copyright law

Hoping to add 4 million more books with grant

1000 books/day archiveden

Tools and demonstations

  • InternetArchiveBot -- fixes broken links on en.wp and is being put on other sites/pojects/languages
    • IAbot aka InternetArchiveBot on-wiki - it fixes broken links, with an error rate of 1% or 2%
    • WaybackMedic 2.1 is a second bot that reviews the edits of IABot and compares the broken link not just to IA but to other archives
    • This/these bots will be on other Wikimedia soon: Norwegian, Wikispecies, Dutch, English . The first bot fixed 2.7m links by linking to IA, and another 500K to other archives
    • they are deploying to other languages
    • they archive every change to en.wp or maybe other WMF sites based on the IRC flow — 10m changes a week
  • "Analyze a page" - archive all citations on wikipedia page:
  • transfer a book from IA to Wikisource (with a button?)
  • ; see also
  • Face-O-Matic can find faces in archived video footage
  • has captioned video from TV: "TV News Archive" search for "dark side of Wikipedia to get to whence Wendy can shows Sharyl Atkisson of program Full Measure discussing "the dark side of Wikipedia"

they have 90K software titles ; 2M moving images ; 2.m audio recordings, notably Greatful Dead 3m hours of television ; 3m eBooks ; 302B web pages in the Wayback machine

How to get IA to archive a web page: go to the wayback machine front page and enter its URL there. There are several ways to get it to archive pages but this is a quick front end

100 & change competition for 100m$ ; 8 are semi-finalists including Catholic Relief Services, internet archives …

Executive Order 9066 — title of a book Wendy found in the 6th grade in Glendale Calif. Her relatives had been in the prison camps in the hinterlands of Utah. This book was published in 1972. Changed her life but is now out of print.

  • Books are not always available.
  • Youngest digital natives like her son don't think of books as even existing if they are not online.
  • IA is proposing to digitize 4 million important 20th century books published after 1923 for browsing and will create lending systems for local libraries to hand off ebooks.
  • The Marrakesh treaty makes such books available to the blind (?)
  • Wendy asks Wikimedians for requests for which should be in the 4m collection
  • they are working with the Digital Library Federation
  • Tweet to #100andChange
  • Internet archive digitizes 1000 books a day! or was it 100?

Metadata of the digitized books for Wikidata / Wikicite

  • IA is working with MusicBrainz to archive music of say the 1970s, the music recorded on "78s" records
  • Canadian law defines "fair dealing" which is like U.S. "fair use" but, Wendy says, a bit "stronger"

Gittenberg project -- combo of git and Gutenberg


  • help us make video in an open format, or maybe easy links to IA videos from Commons -- this is underway ; IA will accept proprietary formats and emit a public-domain format one. 2-way file metadata editing planned for long term (after Commons Structured Data arrives). Contact Brion if interested in plans/details!

For books, journals, etc., one often wants to survey a particular topic. The Internet Archive keyword search provides one dimension for doing this, but arrangement by the traditional Library of Congress Classification number or Dewey Decimal System number is also really, really helpful. Last I knew, the LOC arrangement was hidden behind a paywall. A "browse on this shelf" by LOC number or Dewey Decimal System number would be most helpful, although I acknowledge only a few "power users" might benefit from it. I acknowledge that this suggestion is very US-EN-centric, so to anyone outside the US or interested in non-English, feel free to propose in your country/language/library catalog system of interest.