Teaching:TUW - UE InfoVis WS 2005/06 - Gruppe G7 - Aufgabe 3

From InfoVis:Wiki
Jump to navigation Jump to search

Topic

Webserver Logfile Visualization

Area of Application

Analysis of Application Area

General Description

Webservers typically generate logfiles containing huge amounts of information on page accesses, used client software, type of access, and many more. Analysis tools like for example AWStats try to make use of this information and present simple statistics mostly in form of tables or simple bar graphs. Unfortunately, they are mostly very limited and "low-level" regarding their information representation. More interesting questions like user behavior in combination with site structure, dead ends, changes in behavior regarding to time (evolution), typical behavioral patterns, finding groups of users that share similar behavioral patterns, site entry points, or intrusion detection cannot be answered by using similar tools. For being able to deal with this kind of topics, more advanced visual tools are needed that unveil this information. [Behlendorf et al., 2005]

Special Issues

There are many different known solutions/methods to visualize Webeserver Logfiles. Most of them make use of simple chart or bar diagrams. The question is, if it is possible to present this huge set of informations in just one diagram, that illustrates all the data in a simple and undastandable way.

Analysis of the Dataset

Here is an example of such a logfile entry: [Cooper,2004]


205.218.110.166 - - [08/Dec/1996:15:02:10 -0800] "GET /info/index.html HTTP/1.0" 200 14912 "http://www.yourcompany.com/index.html " "Mozilla/3.0Gold (Win95; I)" "35bebd61b31211cfbdcd00c04fd611cf"


The content of this entry explained, from left to right:

"205.218.110.166" - - This is the IP address of the machine making a request of your web server - its domain name can be determined in HitList by enabling Reverse DNS lookups, assuming your server hasn't put this information in already - many so, some don't. (if the domain name was in there, you'd see its URL instead of the raw IP).

"-" - this first dash is typically the server's IP address, which most NCSA format servers don't insert by default.

"-" - this second dash is typically authenticated usernames, which again many NCSA format servers don't insert by default.

"[08/Dec/1996:15:02:10 -0800]" - This is the date and time of the access, including the offset from Greenwich Mean Time - the latter is the "-800", meaning the web server being accessed is 8 hours ahead of GMT.

"GET /info/index.html HTTP/1.0" - This is the actual request the visitor's browser made when at your page or server.

"HTTP/1.0" refers to the protocol and its version, here being version 1.0 of the http protocol.

"200" - this is the server response code - a "successful" request (meaning the visitor's browser loaded the entire HTML/GIF/JPEG, etc.) generates a response code of 200. Others include:

206 - Partial request successful (not complete) 302 - URL has been redirected to another document 400 - Bad request was made by the client 401 - Authorization is required for this document 403 - Access to this document is forbidden 404 - Document not found 500 - Server internal error 501 - Application method (either GET or POST) is not implemented 503 - Server is out of resources

"14912" - This is the number of bytes transferred to the client during the visit. Since every request has some response, even erroneous requests will have a non-zero value for this field. "http://www.yourcompany.com/index.html" - This is the referrer field, or the site the visitor was on immediately prior to making this entry's request - in this case, the person was looking at the index.html (probably the home page) page before going to the /info/index.html page in this entry. "Mozilla/3.0Gold (Win95; I)" - this is the user-agent field, meaning the actual browser and OS used by the visitor - in this case, Mozilla is Netscape, the next value is the version (here, 3.0Gold), and the final value is the OS it was using (Windows 95).

Finally, the "35bebd61b31211cfbdcd00c04fd611cf" is the cookie information, which may or may not be there, depending on whether the webserver used has cookies enabled and whether one was passed from webserver to the visitor's computer.


There are several different specific Logfile Formats, for example: Microsoft IIS 3.0 and 2.0, Microsoft IIS4.0 (W3SVC format), Netscape (NCSA format with/without unique format header), Lotus Domino format, O'Reilly WebSite format...


Target Group

Identifying the Target

We identified the following Target groups:

  1. Software Companies: especially those who develop browsers and web based applications
  2. Web Dsigners
  3. Administrators
  4. Advertising Companys
  5. Web Users

Special Issues of the Target Group

Software Companies: In order to optimize their software, they have to explore the user's needs. Logfiles can offer them some useful informations.

For Web Designers it can be helpful to know some facts about their visitors. Logfiles tell them, which browsers they use, when they access the site, traffic, and so on. This Informations allow them to organize the webpage in a convenient and reliable way. It's the same with Web Administrators.

Advertising Companies strive to study user behaviour in order to orientate their adds more effectively.

Known Solutions / Methods (related to the target group)

  1. Radial Tree Viewer
  2. Anemone
  3. Internet Cartographer
  4. WebTracer
  5. WebHopper
  6. WebPath
  7. The Chicago Tribune Website
  8. Visualizing the online debate on the European Constitution
  9. Mercator

...

Intended Purpose

Goals and Objectives

  1. Which client software is used, which browsers
  2. How users behave, in combination with site structure, dead ends, changes in behavior regarding to time.
  3. Identify groups of users who act in a similar way.
  4. Identify "hot" topics

Problems and Tasks to Solve

- Improve the Servers: Administrators, Web Designers and Software Companies need this information, to improve the Servers. They need to know what people are interested in, if there are special groups, which have similar behavior, navigate to the same themes and in a similar way. What kind of themes are often searched for, and how. With this information, they can help Users to get the things they want faster and easier. They can adapt the Server to the needs of the Users.
In order to effectively manage a web server, it is necessary to get feedback about the activity and performance of the server as well as any problems that may be occuring

- Adapt Advertising to Users interests:Advertising Companys can use the information to identify themes and products of interest to a group of users. You get an overview of groups of things people are interessted in. So they can adapt the topics of their advertisement to this information.


Proposed Design

Types of Visualization Applied

Visual Mapping

(Datadimension => Attribute)

The log contains the following information:

IP, date, request, protocol, response code, bytes, referer, user agent, cookie


From a given IP it is possible to determin the country of origin, so IPs are mapped to the world map. <Unknown> will also be listed.

The date is listed on the date lens slider. After choosing a date range the corresponding numbers and entries will change. Above this date slider the user will find bars of transferred bytes in this date region.

The request is mapped onto the "web site" net in the middle. Further calculation is necessary to match /file/ to /file/index.php and /file/index.php?somesessionid=2139123123. Only successful requests are listed here (eg response code 200).

If referer is an URL from within the server, the "web site" net draws a connection between the two pages. If the page is referred from the outside, the referring page is listed on the left hand side panel "external referers"


The user agent is parsed for operating system and browser information. Each is listed on the right with checkboxes, so the user can remove or include certain browsers or OS to the data set.

Description of Used Techniques

Possibilities of Interaction

Mockups / Fake Screenshots

Resources

[1] [Behlendorf et al., 2005] Brian Behlendorf, Apache HTTP Server Logfiles, Apache HTTP Server Project, Access Date: 17 October 2005, http://httpd.apache.org/docs/1.3/logs.html
[2] [Cooper, 2004] Colin Cooper, Logfile Definitions and Examples, Intranet Software Solutions (Europe) Limited [ISSEL], Access Date: 17 October 2005, http://www.issel.co.uk/FAQ/logfile_definitions_examples.htm