Libraries and Librarianship

1. Personal Background
2. Overview
3. Life Cycle
4. Finding Information
5. Computing
- 5.1. Automating the Library
- 5.2. Algorithms

1. Personal Background

My grandmother was a Latin scholar and a librarian. My mother was a mechanical engineer and a librarian. My father was an electrical engineer and a librarian. My mother-in-law was a teacher and a librarian. It was more or less inevitable that my first professional career was as a librarian. After getting a B.S. in biology and a M.L.S., I ran the Biomedical Information Center at Hanford, Washington 1976-1980. I later became an electrical engineer, completing the family engineering/librarian trifecta.

I left the profession for the same reason many do: Low pay and frustration with shrinking acquisition budgets. I was spending more time worrying about staffing, building expansions, and equipment repairs than on reading the books.

Nevertheless, I remain passionate about libraries as a treasure house of cultural wealth. I mostly used up the local public library system years ago, but find the occasional new (to me) topic of interest. While at Yale, I browsed most of the libraries (yes, it took months). While at the University or Washington, I browsed all 20 libraries, and continue to use them regularly. I became a lifetime alumnus so I would always have library privileges.

2. Overview

Humans are by nature obligate model-builders and model-sharers. They are by culture model-preservers in the form of books, films, maps, etc. (Creation_myth).

While we will get to esoteric knowledge-bases and semantic nets, the practical world of libraries is based on these notions:

Work

The unit of intellectual effort. Some human or group of humans working together collected the data, prepared the text, edited for readability, etc. Typically a book, but might be a journal article, report, map, pamphlet.

Edition

A variant of the work, usually with a specific date. Typically done to correct errors and to add new material. Every work has an implied first edition.

Publication

The design of a physical implementation of a work. This might distinguish the paperback from the hardback rendition.

Copy

A specific physical implementation of the publication. Often thousands or millions of copies are produced. Sometimes a single library may have several copies of the same publication, and must find ways to distinguish them.

3. Life Cycle

3.1. Make the model

The model-builder must know something of value to the culture at large (or at least to you), and be capable of structuring this knowledge in a learnable manner.

Problem: The reader needs to pick the best model-builders and model-sharers from among the second-rate or outright charlatans.

3.2. Make it persistent

Preserve the model in persistent form. Oral tradition and one-on-one tutoring are important, but can't keep up with volume of information needed to run modern civilization. Written forms with manual reproduction (hand-copied or typewriter-copied) can generate a few copies, which tend to be precious. Printing presses made mass production possible, though as volumes increase so do the incentives to limit the number of works. The internet provides free duplication, and unlimited variety of works, but it can be hard to distinguish the wheat from the chaff.

3.3. Store it safely

Physical books (and maps, films, and audio recordings) may deteriorate from acid in the paper, or insects, or water damage, or mildew, or rough handling. A lot of work goes into solving these problems.

In the case of "rare books", the copy itself is of interest, and museum-level climate control is used. In most cases though, it is adequate to preserve the intellectual content. For that, microfilming, digital scanning, and photocopying are used to refresh the work.

Due to the physical constraints, we typically have specialized buildings or rooms for the task (libraries). We invoke a cultural rule-set for use of libraries, e.g.:

Checkout books in a way we can get them back in reasonable time
Don't even checkout high-use or high value books (reference books)
Use books in the library (provide reading rooms, be quiet, no food or drink)

If the storage medium is not human-readable, we must also assure we have reading devices for the indefinite future. A collection stored on 5 1/2" floppies may be hard to access 20 years from now. A collection on DVD's, stored with a proprietary encryption scheme may be hard to access 5 years from now.

3.4. Store it retrievable

Once a copy is safely stored, we need to get it back out.

Discover existence of the work and the edition
Identify the publication
Locate a copy

If we trusted our indexing schemes, we could just stuff the books in a warehouse and grab them back out when needed. However, no indexing scheme is perfect or even generally adequate. So we index as best we can yet store in open shelving so the the user can browse. Browsing is technically "full-text searching, with wetware neural net matching."

Open shelving requires some mechanism for systematically linearizing all of knowledge. This mechanism must be able to adjust to changing memes over time, and changing linear feet of shelving. Further, the mechanism must usefully group ideas so that if a user gets in the right area, he/she will find other useful texts nearby.

Classification schemes (e.g., Dewey Decimal, Library of Congress (LC)) are the answer. These in turn must adjust to shifting cultural memes. After all LC was originally designed for military texts about tents and horse-drawn cannon. Fitting in ballet or laser-driven DVDs is a trick. Further, the code must be attached to the spines of books, and thus must be terse. Finally, it must be unique, so that if a user somehow obtains the code the library system can find the copy.

These problems led to notions of:

Controlled constructive vocabularies
Hospitable schemes
Colon classification
Phoenixing
Cutter coding
Copy numbers

3.5. Index it

We now have copies of published works stored on open shelving in a classification order. A user with infinite time could browse the full collection, and could redo this periodically to find items which were checked-out when he/she first browsed the collection.

To make a more efficient search, we index the collection. For starters, we can prepare small records of each work, and store these in the same classification order as the physical collection. The records traditionally include the classification code, the author, publisher, copyright date, a few subject terms, and perhaps a brief description.

Next, we can make copies of the same records and sort them by the subject terms. Or author. Or title.

In a world of physically prepared index cards and card catalogs, that is about it. In fact it was so tedious that libraries banded together and did the cataloging centrally. They also worked with publishers to put cataloging information in the books themselves (on the verso or back of the title page), and created International Standard Book Numbers (ISBN) to uniquely identify the publication.

In a world of automated systems, we can pre-build a few standard indexes (for speed) and also allow text searching of the whole index record.

All of these indexes (manual or automated) give just enough information to decide the work is probably relevant and enough coding to find a copy in the open shelving.

4. Finding Information

4.1. Instant

There is no substitute for memory. Of course this works only for really important stuff, like your spouse's birthday and (depending on your profession) the isotopic distribution of C14, the phone number of your travel agent, the turnout in precinct X during the 2002 elections, etc.

4.2. Really quick

You should have some texts and reference books at hand, and key URLs on your browser bookmarks.

4.3. 5 Minutes in the Library

You barely have time to call or get to the local library. Walk over to the Reference Librarian and ask for help.

Reference Librarians are the master indexes of human knowledge. They know some stuff off the tops of their heads. They know the reference books inside and out. They probably have browsed the open shelves many times and generally know which books to grab for what kinds of problems. They live and breathe the local catalogs and know the online search services and (these days) web engines. They know who to call for help (other libraries around the world, subject specialists, etc.).

I played that role for 5 years. It took 3 years to ramp up, leaving 2 years at peak efficiency. I was world-class in biomedical and environmental topics. Yet if I needed to get answers in that field today, I'd go find a subject-specific Reference Librarian.

Of course, when you first walk in, you and the Librarian are strangers. There is a social courtship ritual where you give a little and the Librarian tries to guess what the heck you really need. If you want to save time later, go in and talk with the Librarian when you are not under pressure. Explain your general field of interest, and ask to see what the local collection has to offer. You can see the Librarian in action and he/she can grasp your context. Later, when you come in with a urgent problem, you can both jump right to the task.

Even then, there is a tendency to ask for what you think is the answer instead of explaining the root problem. Reference Librarians learn to detect these miscues. I found that I could start by getting an immediate answer for the asked question, and then ask "What is the context?" Often this was when the real search interview began. (Asking "What is the context?" at the first is off-putting for people who are still at the courtship phase.)

4.4. Several hours available

The classic library search: Search the catalog for subjects vaguely like what you want. Get the shelving code for what looks like a good book. Go to the shelves and find the area. Look for your book. It isn't there. Browse a couple of nearby books. Pick 1 to checkout. Go home. Scan it. It wasn't what you wanted.

A better approach: Ask the Reference Librarian if there are any annotated bibliographies on the topic. That is a shortcut. If you are on your own, once you find a good work in the catalog, notice its subject terms and search those. Keep branching until the leads peter out. Write down the shelving codes of good candidates -- you may find they are in widely separate sections of the library.

Go to the areas and browse within 3 feet each direction from your target book. Classification schemes are not perfect, but they do usually get you in the ballpark.

You can also use other readers as a guide to your search:

Notice books that have multiple editions. That is an indication of a knowledgeable audience's (and thus a publisher's) trust in the author and the work. Search the catalog for recent editions.
Notice books that are smudged or worn. That is an indication of a well-used book, probably worth examining. By contrast, a new-looking book which is several years old may be a loser. Then again, it may have been so popular that the library had several copies, or had to buy a new one when the old one wore out.

Ok, you know what you want. If it is in your hands right now, check it out. The only exception is if you do not want the FBI knowing about it (see PATRIOT_ACT). In that case, read the book in the library. Personally, I make a point of checking out "those" books, to keep the 1st amendment alive.

If you found it in the catalogs but it wasn't on the shelves, ask to put it on hold or get via interlibrary loan.

This will generate maybe a dozen books to checkout. You can be pretty sure not one of them is the perfect book for your needs. What you are doing is surveying a whole field of human effort. You want to know the players. You want to see how they refer to each other. You want to see what they all think is important and where they disagree.

Take notes on the ones that seemed exceptionally relevant or exceptionally well written. I just take enough to find the book again (author, title, publisher, date, and ISBN), plus comments.

4.5. Several days available

By now you know the terms of art in your topic area. You recognize some of the key authors. You are in the game but about 2-10 years out of date.

The problem is that it takes a long time to get a book written, edited, published, purchased, cataloged, and onto the shelf. During the interim, a fast moving topic will have changed ideas. Maybe the best books are about to come out with new editions. It is time to shift to more current literature.

4.5.1. Text Books

About this point, I head to a university bookstore and browse the graduate text sections for the field. I'm looking for a good treatment with solved problems (so I can do it without a professor). These tend to cost maybe $100, but are worth it if you want to ramp up in a field. Of course we may be moving beyond "a few days" into "a few months".

Libraries seldom try to keep up with text books. They go in and out of favor too quickly. Also, if it is a good one, someone is bound to check it out for the whole quarter, or steal it outright.

4.5.2. Reviewed Literature

This is the world of reports, journals, conference proceedings, etc. Either the publisher or the author has a reputation to uphold and makes a conscious effort to assure quality. That doesn't mean there can't be hidden agendas or outright mistakes. But is does mean that getting caught has consequences.

In the olden days we used printed indexes. Some were massive -- maybe a shelf per year. This was one of the first areas to be automated in libraries, back in the 1960's and 1970's. These days you do it all with on-line pre- and post-coordinate indexing.

Indexing is expensive. Some services are free or covered by library overhead. Others you pay for.

Indexing on this scale brings massive response lists. You need to tune the search so that you get adequate recall (percent found of all available relevant items) and precision (percent of those found which are relevant).

Given a list, you tick-mark the relevant articles and then try to find them.

In a big university library, you may find the actual articles and can make copies (while paying copyright fees of course). Otherwise, you will need to order them through the library system.

Once you have the article, scan it for relevance. Check the bibliography or list of citations. Find even more relevant books, journal articles, etc ("citation chasing"). You can also run searches in the reverse order: Find which other articles cite the one you found interesting ("citation indexing").

Given a stack of relevant articles, read for premise and method. Often you can toss out grant-fodder articles which from their method are seen to be too simplistic or flawed to be useful. If you accept the method, then look at the conclusions. If that confirms what you've learned elsewhere, you can assume the detailed discussion probably holds no surprises. If the conclusions elicit "What the heck...?", read the full article. At this point you have a learning opportunity -- something genuinely new is about to enter your mental space. You probably want to do citation index searches to find who else read the article, and see if anyone reproduced the results.

4.5.3. Other

You have done your homework. You are willing to let anything at all flow over your neural nets, confident you know enough to survive the experience. You are ready for trade rags, websites, and other venues. For all you know, they are specifically constructed by your enemies to confused and mislead you. So what? You can handle it.

You are still learning the players. Find the relevant magazines. Sit down and scan 2 years worth of each. You are looking for trends and hot topics in the cover stories, the ads, and the choice of articles.

Ads are particularly useful. It is easy to spoof articles, with impressive but fake credentials for the authors and pompous text. It is a lot harder to spoof full companies and product lines. Some serious thinking goes into a business plan, and ads are the best way those thinkers have to strut their stuff. Look for features where competitors compete (and on what metrics), and look for features where a single vendor claims unique advantage.

Do broad-ranging web searches. You are looking for tutorials and unifying central sites. These in turn link to more specialized sites.

4.6. On-going

Once you are up-to-speed in a field, you can choose to walk away, or can try to keep current. Libraries can provide monthly searches tuned to your criteria. They can notify you of new books in a field. Web sites and email lists can let you know of day-to-day changes. Newsgroups can put you in touch with key players in the field.

Mostly what you want from libraries at this point are on-going subscriptions to the key journals and magazines, and an on-going acquisition plan for buying new books in your field. In tight budget times, libraries sacrifice book purchase to keep the journal collections intact. Hmmm. Maybe it is time to get involved in politics

5. Computing

I can only give illustrative links here. In a sense the whole web is an extension of the library paradigm.

5.1. Automating the Library

5.1.1. Systems

http://www.oss4lib.org/

5.1.2. Indexes

The first part to go was indexing for the journal and report literature. From KWIC (Key Word In Context) in the 1950's to the bibliographic databases like DIALOG and MEDLINE in the 1960's and 1970's, this is a major resource.

5.1.3. Cataloging

First libraries automated generation of catalog cards. Then automated the catalogs themselves. Then applied the same indexing technologies from journal/report collections to book collections.

http://www.oclc.org/

5.1.4. Persistent storage

Once secondary storage was cheap enough, libraries used it for full-text storage. This depends mainly on mass consumer culture leading the way, but some efforts are consciously formed by library paradigm.

http://www.gutenberg.org/

5.1.5. Distributed authoring

The web is a distributed authoring mechanism. Some areas in the web are consciously formed by library paradigms.

5.1.6. Distributed access

Given machine-sensible data, and given the internet, the next step is web-access to the publications themselves. The big stumbling point here is the purchase of the US Congress by powerful "owners of intellectual property".

Never mind that model-sharing is the essence of humanity. Never mind that the "property" was strip-mined from a common shared culture, built by countless generations of thinkers. Never mind that the US Founding Fathers thought copyright was ok for a few years but then works should be provided to the community at large. We are now in a Disney-purchased world of ever-longer copyright periods.

Some communities are explicitly pushing back by accepting copyright as-is and then making the licensing agreements open ended. E.g.:

Software: http://www.gnu.org/
Information: http://en.wikipedia.org/wiki/Main_Page
Art: http://creativecommons.org/
Science: http://www.openscience.org/
Engineering: http://opencollector.org/

It would be fair to say that Richard Stallman (GNU) was the driving force for this push back. He sparked the Open Source Software movement, which led to a set of licenses (not all approved by Stallman), codified at: http://www.opensource.org/

GNU and Linux demonstrated that this was a powerful and practical meme. This in turn inspired the others.

Librarians took to the idea like ducks to water. Open access was the basic premise of their profession, and they had centuries of cultural memes which applied directly to the computerized world. This cultural heritage is also the basis for the reaction of the library world to corporate/fascist efforts to control/limit access to information, such as ever-extended copyright, DMCA, and PATRIOT ACT.

5.2. Algorithms

Once you have a corpus of data in machine readable form, you can search it. Thus, as libraries (and commercial databanks) grew more computerized, algorithms which had seemed intellectual exercises atone time took on major importance.

These probably started in the intelligence services with analysis of electronic traffic, and thus language translation and message-understanding. Then turned to databases, esp to serve financial institutions (credit analysis) and retail (customer buying habits analysis).

As these grew, we learned more about the complexity of "knowledge", and tried to do sophisticated expert systems. Then learned that common sense was crucial to understanding real data in the real world. Then shifted to statistical approaches.

With the explosion of the web, search engines became the driver. Yahoo consciously follows the library cataloging paradigm. Google applies statistical analyses. They all use indexes and data structures designed for full-text searching.

Creator: Harry George
Updated/Created: 2005-06-07