/
Page Scanner

Page Scanner

ZebraTester's Page Scanner function browses and explores web pages of a web server automatically in a recursive way - similar to a Web Spider or a Web Crawler.

Page Scanner's Purpose

Primary: To turn a "normal" web surfing session into a load test program. This provides a simplified way to create a web surfing session instead of recording single web pages manually.

However, Page Scanner can only be used to acquire web surfing sessions that do not require HTML form-based authentication. This tool is not a replacement for recording web surfing sessions of real web applications.

Other: Page Scanner allows the detection of broken links inside a website and provides statistical data about the largest and slowest web pages. It also supports searching for text fragments overall scanned web pages.

Note 1: Page Scanner does not interpret JavaScript code and does not submit forms. Only hyperlinks are considered. Cookies are automatically supported.

Note 2: Page Scanner keeps the entire scanned website in its transient memory (RAM) in compressed form. This means that large websites can be scanned, but it also means that transient memory is not unlimited.

Please note that the Page Scanner tool may return no result or return an incomplete result because some websites or web pages contain malformed HTML code or because old, unusual HTML options have been used within the scanned web pages. Although this tool has been intensively tested, we cannot provide any warranty for error-free behavior. Possible website--or webpage-related errors--may be impossible to fix because of divergent requirements or complexity. The functionality and behavior are similar to other search engines, which also have similar restrictions.

GUI Display

The window is divided into two parts.

Scan Result: The upper part of the window shows the scan's progress or the scan results when it has been completed.

Page Scanner Input Parameter: The lower part of the window allows scan input parameters and starting a scan.


Page Scanner Parameter Inputs

Starting Web Page

The scan starts from this URL. Optionally, scan only parts of a website by entering a deep-linked URL path; for example, http://www.example.com/sales/customers.html. In this case, only web pages below or at the same level of the URL path are scanned.

Char Encoding

The default value, Auto Detect, can be overridden in case some or all web pages are wrongly coded, such that the HTML header-specified character set does not match the character set which is actually used within the HTML body of the web pages (malformed HTML at server-side). You can try ISO-8859-1 or UTFas a workaround if Page Scanner cannot extract hyperlinks (succeeding web pages) from the starting web page.

Exclude Path Patterns

Excludes one or more URL path patterns from scanning. Commas separate the path patterns.

Follow Web Servers

Include content and web pages from other web servers within the scan; for example, images embedded in the web pages located on another web server. Enter several additional web servers, separated by commas. Example: http://www.example.comhttps://imgsrv.example.com:444. The protocol (HTTP or HTTPS), the hostname (usually www), the domain, and the TCP/IP port are considered, but URL paths are NOT considered.

Verify External Links

Verify all external links to all other web servers. This is commonly used to detect broken hyperlinks to other web servers.

Include

Effects which sets of embedded content types should also be included in the scan. Page Scanner uses the URL paths' file extensions to determine the content type (if available) because this can be done before the hyperlink of the embedded content itself is processed. This saves execution time, but it might affect a few URLs for excluded content types that flow into the result from scanning because the MIME type of the received HTTP response headers is only used in detecting web pages. Remove these unwanted URLs after the scan has been completed using the "remove URL" form in the Display Result window.

Content-Type Sets

Corresponding File Extensions

Images, Flash, CSS, JS

.img.bmp.gif.pct.pict.png.jpg.jpeg.tif.tiff.tga.ico.swf.stream.css.stylesheet.js.javascript

PDF Documents

.pdf

Office Documents

.doc.ppt.pps.xls.mdb.wmf.rtf.wri.vsd.rtf.rtx

ASCII Text Files

.txt.text.log.asc.ascii.cvs

Music and Movies

.mp2.mp3.mpg.avi.wav.avi.mov.wm.rm.mpeg

Binary Files

.exe.msi.dll.bat.com.pif.dat.bin.vcd.sav

Include Options

Allows you to select or de-select specific file extensions using the keywords -add or -remove.

Example: 

-remove .gif -add .mp2

Max Scan Time

Limits the maximum scan time in minutes. The scan will be stopped if this time is exceeded.

Max Web Pages

Limits the maximum number of scanned web pages. The scan will be stopped if the maximum number of web pages is exceeded.

Max Received Bytes

Limits the maximum size of the received data (in megabytes), measured over the entire scan. The scan will be stopped if the maximum size of the received data is exceeded.

Max URL Calls

Limits the maximum number of executed URLcalls, measured over the entire scan. The scan will be stopped if the maximum number of executed URL calls is exceeded.

URL Timeout

Defines the response timeout, in seconds, per single URL call. If this timeout expires, the URLcall will be reported as failed (no response from a web server).

Max Path Depth

Limits the maximum URL path depth of scanned web pages.

Examplehttp://www.example.com/docs/content/about.htmlhas a path depth of 3.

Follow Redirections

Limits the total number of followed HTTP redirects during the scan.

Follow Path Repetitions

Limits the number of path repetitions that can occur within a single URL path. This parameter acts as protection against endless loops in scanning and should usually be set to 1 (default) or 2.

Examplehttp://www.example.com/docs/content/about.htmlhas a path repetition value of 3.

Follow CGI Parameters

This (by default disabled) option acts as protection against receiving almost identical URLs many times if they differ only in their CGI parameters. If disabled, only the first similar URL will be processed.

For example the first URLhttp://www.example.com/showDoc/context=12 will be processed, but subsequent similar URLs http://www.example.com/showDoc?context=10 and http://www.example.com/showDoc?context=13, will not be processed.

Browser Language

Sets which default language should be preferred when scanning multilingual websites.

Use Proxy

Apply the Personal Settings menu's Next Proxy Configuration when scanning through an (outgoing) proxy server.

SSL Version

Select the SSL protocol version to communicate with HTTPS servers (encrypted connections).

Annotation

Enter a short comment about the scan.

Authentication

Allows scanning protected websites (or web pages).

Supported Authentication Methods

Authentication Method

Description

 Basic

Apply HTTP Basic Authentication (Base64 encoded username: password send within all HTTP request headers). You should also enter a username and password into the corresponding input fields.

 NTLM

 Apply NTLM authentication for all URL calls (if requested by the Web server). The NTLM configuration of the Personal Settings menu will be used.

 PKCS#12 Client Certificate

 Apply an HTTPS/SSL client certificate for authentication. The active PKCS# 12 client certificate of the Personal Settings menu will be used.

Scan Options

Options - Fields

Screenshot

Options - Fields

Screenshot

ABORT: You can abort a running scan by clicking on the “Abort Scan” “X“Icon

 

DISPLAY: Display the scan result

CONVERT Converts the Page Scanner Result into a “normal” Web Surfing Session .prxdat, creating a load test program for additional ZebraTester actions.

  • A filename, without path or file extension, is required.

  • An annotation is recommended to provide a hint in Project Navigator.

  • Click Convert and Save when ready.

  • Optionally display the newly converted session in the Main Menu.

Filename

The filename of the web surfing session. You must enter a "simple" filename with no path and no file extension. The file extension is always .prxdat. The file will be saved in the selected Project Navigator directory.

Web Pages

Selects the scanned web pages which should flow into the web surfing session. “All Pages” means that all scanned web pages are set. Alternatively, the option “Page Ranges” allows you to select one or several ranges of page numbers. If you use several ranges, they must be separated by commas.

Example: "1, 3-5, 7, 38-81"

Max. URL Calls:

Limits the number of URL calls that should flow into the web surfing session. 
Tip: Apica recommends not converting more than 1,000 URL calls into a web surfing session.

Annotation

Enter a short comment about the web surfing session. This will become a hint in Project Navigator.

Load Session into

Optionally loads the web surfing session into the transient memory area of the Main Menu or one of two memory Scratch Areas of the Session Cutter.

SAVE: When a scan has been completed, save the scan result to a file. The file will be saved in the selected Project Navigator directory and will always have the file extension .prxscn. Scan results can be restored and loaded back into the Page Scanner by clicking on the corresponding "Load Page Scan" icon inside Project Navigator.

DISCARD

Discards the Scan Results


Analyzing the Scan Result

Section/Form

Screenshot

Section/Form

Screenshot

The most important statistical data about the scan are shown in the summary/overview, near the window's top. Below the overview, select the various scan result details you want to retain/find/filter.

On the right side near the scan result detail selection, the search form allows you to search for an ASCII text fragment overall web pages of the scan result.

By default, the text fragment is searched for within all HTTP request headers, all HTTP response headers, and all HTTP response content data.

The Remove URLs form, shown below the scan result detail selection, allows you to remove specific URLs from the scan result. The set of removed URLs is selected by the received MIME-type (examples: IMAGE/GIF, APPLICATION/PDF, ..), and linked with a logical AND condition with the received HTTP status code for the URLs (200, 302, ..), or with a Page Scanner error code, such as "network connection failed"

with content MIME-type

selects a specific MIME type). The input field is case insensitive (upper and lower case characters will be processed as identical).

  • any means that all MIME types are selected, independent of their value. 

  • none means that only URL calls whose HTTP response headers do NOT contain MIME type information (HTTP response header field "Content-Type" not set) will be selected.

HTTP status code

selects an HTTP status code or a Page Scanner error code.

Analytics Filters

The Scan Input Parameter displays all input parameters for the scan (without authentication data).

Scan Statistic displays some additional statistical data about the scan.

Similar Web Pages are the number of web pages with duplicate content (same content but different URL path). Failed URL Calls are the number of URL calls which failed, such that no HTTP status code was available (no response received from a web server), or that the received HTTP status was an error code (400..599).

Non-Processed Web Servers displays a summary of all web servers found in hyperlinks but whose web pages or page elements have not been scanned.

The number before the server name shows the number of times Page Scanner ignored the hyperlink.

Scan Result per Web Page: displays all scanned web pages. A web page's embedded content, such as images, is always displayed in a Web Browser Cached View. For example, this can mean that a particular (unique) image is only shown once inside the web page in which it has been referenced for the first time. All subsequent web pages will not show the same embedded content. This behavior is more or less equal to what a web browser does - it caches duplicate references over all the web pages within a web surfing session.

URL Detail

More details about a specific URL call can be shown by clicking on the corresponding URL hyperlink.

In this example, we clicked in on one of the above URLs https://www.apicasystems.com/feed/, and we see the server 200 OK Response and the MIME type, the HTTP Request and Response Headers, and the Response Content.

 Broken Links displays a list of all broken hyperlinks.

Duplicated Content displays a list of URLs with duplicate content (same content but different URL path).

or

Largest Web Pages displays a list of the largest web pages.

Tip: Click on any of the bars for the Scan Result per Web Page Details

Slowest Web Pages display a list of the slowest web pages.

 

Can't find what you're looking for? Send an E-mail to support@apica.io