INN Cookbook

 $Id: cookbook.pod,v 1.3 1999/11/10 14:47:02 esamsono Exp esamsono $
 By Elena Samsonova, <elena.inn@inter.nl.net>.

Position among the INN documentation

From most general to most detailed, the INN documents are structured in the following way:

Relation to other INN documentation

This is the INN Cookbook. This document is meant to help you to choose one of the various possible configurations that INN can assume, depending on your needs and possibilities. Once a configuration is chosen, look it up in the INN Architecture Guide for an overview and explanation of the major parts. This knowledge will help you to read the INN Implementation Guide which describes various configuration aspects in detail. For greater detail yet please refer to the Install document and manual pages. For something more general than what's in this document, please refer to Readme.

How to use it?

Simply answer the questions below and find the most likely best configuration for your situation. Then continue with the INN Architecture Guide.

Disclaimer

The Cookbook does not relieve you from using your brain at all times. :-)


What the Heck Is News in the First Place?

Those of you who already know what news is and how Usenet works, may safely skip this section. If however you start having difficulties with the rest of the document, it may be helpful to come back here again.

When your boss calls you into his office and tells you: ``We want a news server and you are going to set it up'', you may find yourself wondering what the heck is news, let alone a news server. Well, read on!


What It Is

News is a way to allow every single individual to tell the rest of the world what he thinks about it without having to kidnap CNN's CEO. News is a network of servers that exchange users' postings so that each of them ends up having the same copy of the whole mass of them. Well, more or less. It is really pretty much like buying your favorite news paper in the East of the country and in the West of the country and discovering that the same issue contains exactly the same articles so that in fact you don't have to rush to your local news stand but can get it anywhere.

Now, in order to organize this mass of articles somehow and make it easier for people to find postings of their interest, they are broken down into news groups. The whole collection of news groups together with the rules of their creation and of their use is called Usenet.

When you look at each particular news server in the news network, you can see its article spool as a file system with directories, one separate directory for each news group. You can see separate articles as files in those directories.

As with every filing system, the news group structure is not flat. Indeed, that would create a list of tens of thousands of news groups and it would be still too hard to browse through it. Therefore news groups are organized in branched hierarchies. On the top level, the division is as follows:

The Big Eight international hierarchies:

        comp.*        - computers
        humanities.*  - humanities
        misc.*        - miscellaneous stuff
        news.*        - everything about news
        rec.*         - recreation
        sci.*         - science
        soc.*         - society
        talk.*        - serious talk

The Alternative news:

        alt.*         - alternative, everything is allowed here

Country specific hierarchies:

These are organized by country codes, for example:

        at.*          - Austria
        de.*          - Germany
        nl.*          - Netherlands
        tw.*          - Taiwan
        uk.*          - UK

Additional and/or commercial hierarchies:

Examples here include:

        bionet.*      - biology
        compuserve.*  - old CompuServe network news

Professional hierarchies:

These are set up by specific companies, for example:

        borland.*
        microsoft.*

In addition to these hierarchies, every news server can have its own local hierarchy usually called local.* which does not leave the news server.

As you have most probably already guessed, the dot-asterisks (.*) in the news group names indicate that there are branches in these hierarchies. Indeed, for example the news group that discusses news servers and offers help on INN is called news.software.nntp (and you can access it at http://www.dejanews.com if you don't have your own news server set up yet).

So, how to decide which hierarchies to carry? Typically, you either carry a full feed, that is everything that exists in the world, or some subset of it. It is usually a good idea to carry the Big Eight hierarchies as well as your country one and perhaps some of your neighbors. The alt.* hierarchy is a rather controversial one. It is by far the largest of all, both in the number of groups in it and in the volume of postings and easily constitutes 1/3 to 1/2 of Usenet's full feed. There is a lot of junk in it but there are a lot of good things too. It is certainly one of the most popular hierarchies and you are likely to get complaints if your users don't find it on your server. Don't worry, there are ways to exclude parts of hierarchies that you do not want to carry.


How It Works

It is really quite simple. You either have a stand-alone news server or one connected to Usenet. If you are stand-alone, you can only carry your own local hierarchies and cannot exchange any information with the outside world. It may be appropriate for corporate news servers.

If you are a part of Usenet, you are connected to one or more of other news servers with which you exchange articles. They send you stuff and you send them stuff as well. You can be connected to Usenet in one of the two modes: you can be either an internal node or a leaf. Consider the figure below:

Here the blue rectangles are internal nodes and the green ellipses are leaf nodes. Internal nodes exchange high volume news feeds, that is they pass the news around the world and thus make sure that everyone gets his copy. That is why the lines between them have no arrows. Leaf nodes only receive high volume news feed but do not send it out anymore. They only send out the users' postings which is a very insignificant amount compared to all the news. That is why their lines have arrows pointing to them.

So what you need to decide is whether you want to become an internal node and pass the whole mass of news around, or a leaf node and only send out your own users' postings.

News group creation and management issues are discussed in the INN Implementation Guide, but if you have burning questions already now, you can read a document about the Big Eight creation process at http://www.eyrie.org/~eagle/faqs/big-eight.html . Other hierarchies have more or less similar rules and procedures.


What You Need Besides the Software

If you intend to run a stand-alone news server or an isolated network of news servers without connecting to Usenet, you don't need anything else.

If you want to exchange news with the rest of the world, you need to get connected to Usenet. The other news servers with whom you exchange news are called news peers. As the absolutely minimal requirement you need to find a peer who will accept your users' postings and forward them on to the network. No, I did not forget the feed: you can get that out of the sky these days from a news satellite. There are companies that provide these services.

If you don't want to deal with satellites, you will also need one or more (better more than one!) peers who will feed you news. The only requirement for setting up peers is that you have network connectivity to them and enough bandwidth to send and receive the feed. For full news feed you need to count on some 3-4 Mb/s sustained traffic these days, and this number is growing rapidly.

So, if you still want to get into this, start defining your system in the next section.


INN Architecture Choice

For each of the sections that follow, calculate your score by adding up the points you get on each item. Find your section verdict in the end of the section. After you've gone through all sections, you can derive the final answer by combining the results of all sections.


News Feed

Expected feed size:

Please note that news feed size is growing rapidly so that it is impossible for me to give precise numbers here. This should still be a fairly good indication though.

        1 - non-binaries feed:              less than 10 GB a day 
        2 - major hierarchies feed:         around    30 GB a day
        3 - full feed from the whole world: more than 50 GB a day

Number of peers you get your feed from:

        1 - less than 5
        2 - 5 to 15
        3 - more than 15

Number of peers you send full feed to:

        1 - less than 5
        2 - 5 to 15
        3 - more than 15

Anti-spam filters:

        1 - filtering only your users' postings
        2 - filtering all feed, incoming and outgoing

Results:


News Server Usage

This section analyzes the type and intensity of usage that you expect.

Number of simultaneous readers:

        0 - no readers at all
        1 - less than 500
        2 - 500 to 2000
        3 - above 2000

Average usage type:

        0 - no usage
        1 - reading text articles via a modem connection
        2 - reading (err, watching) binary articles one by one via a fast 
            modem connection or ISDN
        3 - sucking binaries (downloading many articles in one batch) via 
            ISDN or a fat fixed line

Authentication:

See also section Authentication Methods.

        0 - no authentication (everyone who connects gets access)
        0 - IP-based authentication (access granted according to rules
            based on the users' origin IP addresses)
        1 - less than 1000 users for local authentication (using an INN
            configuration file or the system password file)
        2 - more than 1000 users for local authentication
        2 - authentication using an external database

Please note that this estimation is very coarse, your own situation may be different.

Results:


Scalability and Quality of Service

In this section you need to look at your scalability and quality of service requirements for some period in time. It is not always true that when in doubt, better take a larger system--larger systems require more maintenance with higher complexity. On the other hand, not planning ahead may force you to start from scratch when it appears that your existing configuration cannot handle growing requirements. A good trade-off is the hardest part.

Expected growth of the number of news hierarchies you carry and/or of the number of peers with 100% of your feed selection:

        1 - the level will remain constant

Note that in absolute numbers it will mean that your news feed will increase steadily. The tendency today is to nearly double each 6 months but at the time of your reading it, it may be even more.

        2 - number of peers will double
        3 - number of peers will increase 10 fold or more

please take the closest value here

        3 - number of hierarchies will double (or more)

Required quality of service:

        0 - at busy hours, the server may refuse incoming connections

        1 - at busy hours, the users may experience a slow news server
            with as much as 30 seconds delay before the article starts
            appearing on the PC

        2 - at busy hours, the users may experience a slight performance
            degradation with at most 10 seconds delay before the article
            starts appearing on the PC

        3 - at busy hours, the users may not experience any performance
            degradation, there may be no noticeable delay in article 
            retrieval

Expected growth of number of simultaneous users assuming the same user profile or expected migration of users to heavier profiles:

        0 - will not grow
        1 - will double
        2 - hard to predict (but you assume it will grow real fast)

Results:


Final Verdict

If your scalability requirements are heavy, you should consider building a larger system right away because the one you'd choose according to your current requirements is guaranteed not to be able to cope with your new requirements. Alternatively, you can opt for a medium size system that is too large for your current requirements and somewhat too small for your new requirements but which can scale better.

If you have light scalability requirements, you can stick to the smallest option, if you have regular requirements you should probably read about the smallest option and the next larger one and decide which way to go.

The table below gives an indication of the type of architecture most suitable for your requirements. There exist two basic types of architecture: single server and distributed (consisting of multiple servers). Please refer to the INN Architecture Guide for details on these architectures.

Server types are rated as light, medium and heavy, and regardless of the actual computer manufacturer and OS (as long as it is a Unix) are graded as follows:

Disk sizes are not included in the server rating because they depend on the amount of articles that you want to keep and are discussed in section Article Spool Considerations.

                    f e e d   r e q u i r e m e n t s :
                  light            medium             heavy
             .....................................................
     none    :  single light     single medium      single heavy
             :  server           server             server
             :
 u   light   :  single medium    single heavy       single heavy 
             :  server           server             server
 s           :
     medium  :  single heavy        d i s t r i b u t e d :
 a           :  server           medium feeder      heavy feeder
             :                      single medium reader or
 g           :                      multiple light readers
             :
 e   heavy   :              d i s t r i b u t e d :
             :  light feeder     medium feeder      heavy feeder
             :        m u l t i p l e   r e a d e r s


Additional Information


Article Spool Considerations

This section helps you to determine the amount of disk space you will need for your system depending on the feed you expect to receive. The INN Implementation Guide and the Install document contain detail information on the choice of a spooling method and its configuration.

First, you need to decide on the quality level of your news service that you are going to provide. Among other concerns, the period of time that you keep articles on your server is an important parameter. Consider that many people only really have the time to read news during the weekend, so if you keep the articles for anything shorter than 7 days, your users will miss stuff. 10 days would give them a nice overlap while 15 days would ensure that they can miss a weekend and still get all the news.

On the other hand, the longer you want to keep articles on the server, the more disk space you need. For example, keeping 10 days worth of binary pictures would ask for anything between 200 GB and 500 GB of disk space and this is probably not what you want. So what to do?

The good news is that you don't have to keep all the articles the same period of time but can set up a fairly fine grain configuration specifying up to a newsgroup how long it should be kept. This will allow you to keep text groups longer than binaries, for example. See INN Implementation Guide for sample configurations that can give you ideas.

So, at this point in order to determine the disk space you will need, use one of the following simple formulas:

full feed (text and binaries)

  total disk space = ( GB per day * 1/4 ) * days you keep text articles +
                     ( GB per day * 3/4 ) * days you keep binaries      +
                     10 GB for supporting files

text feed only

  total disk space = GB per day * days you keep text articles +
                     6 GB for supporting files

Note that the values you will get here are only approximate and are not significantly better than an educated guess, but they do give you an indication. Make sure that you can add disk space as needed in case your estimation was wrong.


Authentication Methods

Authentication can request some serious resources. Therefore it is important to determine whether you will need authentication or not and on which scale.

No authentication is the lightest type you can have. It permits anyone to use your service. If you are setting up a news server within a well defined network segment, you can disable authentication on the server itself and enable routing filters and/or firewalls on your network instead which will ensure that only your own users can access the server.

If you cannot deploy routers and firewalls to achieve this but you do have a certain range of IP addresses that your users can have, you can use IP address authentication on the news server. This is a lightweight method and will not impair the performance. This method uses users' IP addresses to determine who is allowed in.

A variant of this is to resolve the users' IP addresses and use patterns of their domains to determine access. This is handy if your IP addresses do not fall into one cluster and naming them all would create a messy configuration. This method is slightly heavier than the previous one because of the DNS lookups.

If you require however that users can connect from any IP address, you cannot use the two methods listed above and need to look at some user name and password based authentication. Here you can consider maintaining a configuration file on your server, adding all the users to its local password file or having it connect to an external database instead. The latter option does not come out of the box with INN but can be installed with just a reasonably small effort, so don't discard it right away.

Either of these authentication methods require some extra performance from your machine. It is extremely difficult to predict just how much CPU time and memory it will need, so be prepared to scale your system if necessary.

The INN Implementation Guide provides some samples of different configurations.


What's Next?

Now that you have determined the scale of your future system as well as gave some thought to the service that you are going to provide, go on with the INN Architecture Guide and select the proper architecture for your system.