HTTP - Browsing page source
Webpages are served up using a protocol called Hypertext Transfer Protocol or HTTP for short, standards suggest that port 80 should be where the http service listens although it is trivial for the administrator to use another port. For the purposes of our little exercise we are going to look at very simple ways to get the webserver to serve up the front page of it's site, there are a myriad of things you can do but we are going to keep it pretty trivial.
Start off by connecting your chosen tool to port 80 of a website (www.example.com for the purposes of this demonstration), to start off with we type the following and press return twice after we are done. If you are using telnet you need to get it right the first time otherwise it will not work correctly.
GET / HTTP/1.0
What we have just asked for is for the system to send us the document (GET) which exists as the front page of the website (/ or root document), and we specify that we just want a simple request without all the extra fuss which I will explain later (HTTP/1.0).
In response the server replied with the following (this is just an example so yours will be different, plus I have tweaked this to make it simpler so some of the numbers will not be accurate):
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Cache-Control: no-cache
Expires: Mon, 24 Dec 2001 00:49:17 GMT
Content-Location: http://10.0.0.100/index.html
Date: Mon, 24 Dec 2001 00:49:17 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Mon, 24 Dec 2001 00:27:03 GMT
ETag: "60fa8cb5118cc11:adc"
Content-Length: 103
<HTML>
<HEAD>
<TITLE>Demo page</TITLE>
</HEAD>
<BODY>
This is just a test.
</BODY>
</HTML>
When the webserver is done telling us about this page it closes the connection since we have the data, and it has other requests to process - when this happens we get this message:
Connection to host lost.
So what have we leant from this? Well everything before the linespace is called the header, and everything below that is called content and normally you only see content since your browser filters out headers for you. Headers however do contain rather a lot of information...
* The site appears to have a standard front page (denoted by the 200 status), a status of 3XX implies you have to go to a second page to find what you are looking for, a status of 4XX implies that there was a problem getting this page (either it was missing or you arent allowed access to it at this time etc.) lastly a status of 5XX would mean that the server had problems processing my request.
* The site appears to be running Microsoft's IIS version 5 (this is only found on windows 2000) so I know what webserver and operating system they appear to be running. IF you haven't guessed you get this from the string that says Microsoft-IIS/5.0.
* We have a Content-Location header which can often give away information such as the internal addresses of machines, full paths to documents on the website and a multitude of other things.
* We have a Last-Modified header which does what you expect it to - details the last date and time that this page was modified, sometimes very useful to see how frequently a website is really updated.
The actual content side of it will generally only give away the errors of the designers but looking out for and identifing those is an entire lession in itself.
HTTP/1.1 vs HTTP/1.0
HTTP/1.1 is a widely used extension to HTTP/1.0 as it allows the client more control over the content it is being delivered, and like most protocols it is over-engineered so much so that there are features built into it that are rarely ever used - they were nice ideas but very few people would implement features; such as only giving out certain types of content if the client can accept them.The minimum number of details that make up a request you can expect to use and get a valid response back is the following:
telnet www.example.com 80
Connected to www.example.com.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.example.com
Connection: Close
However in practice you are more likely to be using a replica set of request data since it fools the website into thinking a browser is visiting their site, and also makes sure that if for some reason the website needs the regular amount of data it has it.
telnet www.example.com 80Example of Show Header Only:
Connected to www.example.com.
Escape character is '^]'.
GET / HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Host: www.example.com
Connection: Close
telnet www.example.com 80
Connected to www.example.com.
Escape character is '^]'.
HEAD / HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Host: www.example.com
Connection: Close
Then you should see the response message:
HTTP/1.1 200 OK
Date: Tue, 20 Oct 2009 16:53:41 GMT
Server: Apache/2.2.11 (FreeBSD) DAV/2 PHP/5.2.10 with Suhosin-Patch
X-Powered-By: PHP/5.2.10
Set-Cookie: SESS93e15ce578546d0a845aa3efdd4d6bde=00d162992b81068ab56f0ce61d164494; expires=Thu, 12-Nov-2009 20:27:01 GMT; path=/; domain=.www.example.com
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified: Tue, 20 Oct 2009 16:53:41 GMT
Cache-Control: store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 20
Connection: close
Content-Type: text/html; charset=utf-8
The main difference is the Host: request header as this allows website hosting companies to put more than one website on an address and have them all accessible (referred to as a virtual server), as if you try to access a virtual server without the Host: line you will not get the site you expect!
Just incase you were curious the other request headers used in that example are:
- * Accept - gives a list of the types of data that you are in theory willing to accept.
- * Accept-Language - gives a list of the languages that you are in theory willing to accept.
- * Accept-Encoding - gives a list of the encoding methods that you are in theory willing to accept.
- * User-Agent - a string that describes the type of browser you are using.
- * Host - the hostname this request is destined for.
- * Connection - specifies how to handle this request.
The choice of which version you want to use comes down to how much effort you want to put into the task, because whereas 1.0 is simple to the point that it is noticably unrealistic, 1.1 is complex but is more believable since this is what a modern browser would us, so will not look out of place. Also it is useful to remember that you cannot access virtual servers using 1.0 since host came in under the 1.1 specification.
No comments:
Post a Comment