ā¹ļø Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.troubleshooters.com/codecorn/php/persist.htm |
| Last Crawled | 2026-03-09 03:23:46 (1 month ago) |
| First Indexed | 2025-05-21 15:55:43 (10 months ago) |
| HTTP Status Code | 200 |
| Meta Title | PHP Data Persistence |
| Meta Description | PHP data persistence |
| Meta Canonical | null |
| Boilerpipe Text | Troubleshooters.Com
and
Code Corner
Present
PHP Power Pointers
:
PHP Data Persistance
Copyright (C) 2002 by Steve Litt
Contents
Introduction
Security Warning!
Challenges of Statelessness
Generating Session IDs
Summary
Introduction
This document assumes you have computer programming knowledge and that
you know your way around PHP, Linux and UNIX. Rather than spending huge amounts
of time discussing cookies or PHP4's sessions, this document discusses the
ramifications of HTTP's statelessness, alternatives to get around it, database
variable lookup by passed session key, and session key generation tips and
techniques.
This document assumes PHP is installed and functioning on your system, and that you're reasonably familiar with PHP.
Security Warning!
Throughout this document it is assumed that PostgreSQL data web access is through user
apache
or whatever user your
httpd
daemon runs as.
Although that's the easiest way to do it, it's by no means the most secure.
Anyone co-hosted on the same box as your website can access your data by
writing their own PHP scripts, because their access to user
apache
is the same as yours. Also,
ident
methods of authentication.
Data enabled web apps have many gotchas, especially if there are multiple
website owners on a single host computer sharing one Apache and one DBMS.
For a futher discussion of this, and some solutions, see
PHP Data Security
.
Challenges of Statelessness
Most web app challenges revolve around the fact that HTTP is a stateless
protocol. Statelessness means that when you transition from one web page
to another, all variable values are forgotten. This is true whether the page
is static or rendered by a script. Any variables -- any values, are forgotten.
As a programmer, a good way of envisioning this challenge is to imagine a
programming language without global variables. Any variable set in one web
page is out of scope in another.
There are three basic ways to retain variables and values between pages:
Cookies
Passing the values
Database lookup
Cookies
I lied about no global variables between web pages. Cookies can hold global
variables accessible from multiple web pages. Cookies are little files kept
on the user's computer that store the desired values. While this might seem
ideal, cookies have several potential problems:
Older browsers don't recognize cookies
Many people set their browsers to reject cookies for security reasons
Reason #2 is critical. Cookies can be made to track your browsing activities
and other very personal information. Cookie-compliant browsers all provide
methods of rejecting cookies entirely, or on a per-domain basis.
The bottom line is that if you need your website to work with all users,
cookies might help with some but you need a backup plan for those rejecting
or unable to process cookies.
Value passing
All computer programmers know global variables can spell disaster through
side-effect problems. So instead of using global variables, we pass necessary
information as subroutine arguments. It's inconvenient but much safer.
You can pass values between web pages. As in subroutines, it's inconvenient
(actually much more inconvenient than passing between subroutines). But it
can be done, and it doesn't rely on cookies.
The two methods of passing data between web pages are the GET method and
the POST method. GET is much easier, POST is much more secure. Professional
websites are usually better off with the POST method.
The GET Method
The GET method simply passes variable name/value pairs on the URL after a
question mark. This method can be sent by all web pages, whether or not the
transfer mechanism is a link or a form's submit button.
Unfortunately, the variables are plain text and are subject to tampering
by the user. On visiting an ecommerce site using the GET method, I was able
to modify the URL such that I could purchase a $10.00 book for $1.00. Naturally
I didn't complete the transaction, but an unscrupulous person could have
easily ordered books at 1/10 the price and then resold them. Only a separate
validation script or a human audit could have uncovered the problem.
The GET method is best used for only the most harmless data, or data which cannot easily be intelligently forged.
It's possible to generate a
session id
variable from microsecond-seeded
random numbers, sometimes combined with other data. If constructed right,
such session id's are for all practical purposes impossible to reverse engineer.
Sometimes such a session id can be passed via the URL without undue risk.
One more problem with passing data in the URL is that crackers and script
kiddies can tack malicious commands on the end of an otherwise good URL.
Be sure every page receiving GET requests immediately truncates excess characters
from the URL, and after splitting the URL into variables, immediately check
each for the right length and other validity, and immediately look up database
session info with the session id. If any of these fails, immediately terminate
the script.
The POST Method
The POST method of value passing is available ONLY from HTML forms. It cannot
be applied by links, even if those links are contained within a form. Therefore,
to retain state without resorting to the GET method, every exit point on
every page must be a button on a form. While inconvenient, this is the most
secure method of passing data between pages.
Database lookup
Oops, I lied again. You can have the equivalent of global variables without
cookies. You can simply keep the information in a database. But there's one
problem -- how do you retrieve the information once you've switched pages?
The new page must receive at least one piece of data -- the lookup key for
the remaining data, by means other than the database. That means either cookies
or value passing.
The way this is typically done is that the first page visited generates a
session id
.
The session id is a unique string defining a specific visit by a specific
visitor. The string typically involves the concatination or combination of
a random number seeded by a timestamp and an autoincrement. For additional
resistance to cracking and reduced likelihood of duplicates other information
might be involved, such as the user's IP address, the Apache process id.
Whatever is chosen, it's vital that it not be guessable, because if it can
be guessed, a bad guy can guess at a session id, and then masquerade as a
different user. PHP's
md5()
function can change an otherwise
guessable session id into a completely unguessable 32 character string which
can be used as the session id.
As soon as the session id is created, a row for that session id is created
in the database. This row has all the necessary information for the various
web pages on the site. For instance, it might contain the invoice info for
a shopping cart, and a relation to the user's shopping cart.
The combination of storing most variables in a database and passing a
unique key through POST cookies, or even GET (the session id has been made
to be unguessable) is an excellent solution. It's convenient in that only
a single variable need be passed around, with all other variables regenerated
in each page. If a website enhancement necessitates a new piece of information,
that new piece can be added to the database, rather than simultaneously adding
a variable to 20 web scripts. Another advantage of the key/database hybrid
is confusion reduction.
Without the database repository, every variable used by every script would
need to be passed by every other script, because even if neither the passing
nor receiving script needs the variable, once it's dropped there's no way
that the one script needing it can regenerate it.
The session id's MUST be unguessable. When you generate unique
session ids, depending on how you create them they might be guessable. As
a simple case, session ids could simply be incrimented. If so, a badguy could
use your site, and then decrement the session id to hack into somebody else's
session. If you're passing the session id in the URL (GET method) even a
12 year old could perform this crack. However, using PHP's md5() function,
you can md5sum your unique key into a new 32 character string that bears
no resemblance to the old string and is impossible to convert back to the
old string. Simply use the 32 character string as the session id, and the
likelihood of a cracker being able to change it to an existing good session
ID (good meaning a session ID that's been updated in let's say the past hour)
approaches 0.
If your app doesn't use a database, it's possible to use a flat file or some
other mechanism to store all the non-session-id variables. However, finding
a way for the http server to write and read those variables without allowing
crackers to read them is difficult, and out of the scope of this article.
In summary, a hybrid with a single session id passed everywhere, and the
rest looked up by session id by each script, is probably the best methodology.
The session id can be passed by cookies where possible, and by POST or GET where
it must. If the session ID is sufficiently unforgeable (perhaps by using
some sort random numbers or checksum), it could even be passed in the URL as a GET request.
Generating Session IDs
The stateless nature of HTTP is always a hassle, but many developers find
partial solace by keeping most variables in the database, passing only the
session id, and looking up the rest of the variables based on the session
id. Use of session id's present two different risks which must be handled:
Risk of a user or cracker forging a legitimate session id
Risk of accidentally handing out duplicate session id's
These are two very different risks, and both must be handled. The more important,
valuable or private the information being handled, the more precautions must
be taken against these two risks.
Absolute duplication prevention can be handled by a well designed autoincriment
mechanism. Unfortunately, autoincriments are extremely easy to forge. Risk
of forgery can be limited to one in trillions with a well designed random
number generator. Unfortunately, random numbers uniqueness is only as good
as the random seed, and the best that can be hoped for Ā is about one
in a million on a system with a true microsecond clock. For clocks with less
precision, the likelihood can be less. So it's a good idea to seed the random
number generator with numbers other than the microsecond clock.
Therefore, to minimize both the risk of forgery and the risk of accidental
duplication, you can use a combination (concatination) of a random number
and an autoincriment, or if you completely trust the autoincrementer you
can use the
md5()
function on the increment in order to turn it
into a 32 digit hex number, thereby reducing the chance of successful forgery
to one in trillions even on the most heavily trafficked sites. However, unless
you completely trust your autoincrement, you're better off using it in combination
with a random number, so even if the autoincrementer occasionally returns
a dup, the random number will likely provide uniqueness.
Minimizing forging risks
Passed variables can be changed, and a passed session id is no exception.
Changing the session id is trivial if the session id is passed in the URL
(GET method), but a skillful cracker can also change the session id even
if the POST method is used.
You cannot prevent the user from changing the session id, but you can take
steps to minimize the chance of such forgeries mimicking a legitimate session
id. What you need is a random number, and the random number must be large
enough that its range overwhelms the number of legitimate numbers. For example,
if your session id's are 12 character base 26 random numbers (each "digit"
is an upper case letter), that offers 9.54 x 10
16
possible session
id's. If your site receives a million visitors a day, and if you regularly
purge sessions over 24 hours old, then at a given time you'll have a million
legitimate session id's. The chance of someone forging a legitimate one is
(1 x 10
6
)/9.54 x 10
16
), or 1 in 94.5 billion. The risk
is miniscule compared to more pressing risks facing your business, and if
you want to further decrease the risk of a successful forgery, you can add
more digits. Adding 5 more digits would decrease the risk by another 11 million,
or approximately 1 in a million trillion.
Minimizing accidental duplicate risks
There's no such thing as a truly random algorithm. PHP's
mt_rand(lowerlimit,upperlimit)
routine is excellent at producing seemingly random numbers evenly distributed
across the range defined by its arguments, but in fact its starting place
will always be determined by its seed. If you don't seed it, you have no
idea how random it will really be. And if you seed it with the microsecond
clock (explained later), the odds of duplication are less than a million
to one. Ā For a well trafficked site that's not good enough for the session
id.
Including the seconds since epoch in the session ID seed drops the duplication
risk much further -- for practical purposes limiting the possibility to cases
where users arrive within the same microsecond, or within a number of microseconds
corresponding to the time granularity of the system. Depending on your traffic
and the degree of loss created by an inadvertent duplicate, the appending
of seconds since epoch might be sufficient. But for more safety, it's all
too easy to add in a third number -- an unseeded random number. Unseeded
random numbers aren't necessarily random, but when combined with a time element
the two make it very difficult to reverse engineer the produced random number.
Other values that could be either concatenated into the session id or used for the seed include:
The incoming IP address
The difference between a second
microtime()
call and the first one, multiplied by a suitable multiplier
The second
microtime()
might seem like an opportunity to exploit
system variation for more variability, but my tests on my box indicate it
cuts the risk of duplication by maybe a factor of 30. That's better than
nothing, but not much. The incoming IP address might help, but given the
fact that large ISP's funnel many users through the same IP address, its
added safety is unpredictable.Ā
Probably the easiest way to autoincrement in a secure way is with a single
row of a table containing a key, and a data column containing an integer
to be incremented. When someone needs a new number, they lock the table,
grab the value, increment it,Ā write it back, and unlock it. If two
people try to perform this process at the same time, Postgres acts as a traffic
cop.
Locking can be a mess for the application programmer to implement, so later
we'll discuss how to have Postgres do the whole thing with a stored procedure
(a
function
in Postgresese).
Random number generation
PHP gives you two outstanding tools to use in generating random number session ids:
microtime()
and
mt_random()
. The
microtime()
command returns a string consisting of a float, a space, and an integer.
The float is between 0 and 1, representing the number of microseconds on
the system clock. The integer is the number of seconds ellapsed Ā since
1/1/1970, or whatever epoch the computer uses. To test it, create the following
microtime.php
file:
<?php
echo microtime();
?>
Pull up
microtime.php
in a browser, and you'll see something like this:
0.52037700 1039818902
Click your browser's refresh button several times and note that the first
number changes almost randomly, while the second number corresponds to ellapsed
seconds and changes only in the rightmost couple digits.
To actually acquire the numeric value of the microseconds and seconds parts of the string, do the following:
$timestring = microtime();
$microseconds = (double) $timestring;
$seconds = (integer) substr($timestring, strrpos($timestring, " "), 100);
First you capture the time in a snapshot, and from that point forward you
work only on the snapshot. PHP's type conversion considers the space as the
end of the floating point number, so the typecast retrieves the correct amount.
The second part of the string is more difficult. You need to find the spaces
position, and retrieve only the portion of the string after that space.
Prove this to yourself by changing
microtime.php
to the following:
<?php
$timestring = microtime();
$microseconds = (double) $timestring;
$seconds = (integer) substr($timestring, strrpos($timestring, " "), 100);
echo "<pre>";
echo $timestring . "\n";
echo $microseconds . " " . $seconds . "\n";
echo "</pre>";
?>
The preceding should print the string, and then a concatination of the numeric
values, and they should obviously be the same. Note that depending on how
precise your system's microsecond reading is, you'll need to vary the number
of spaces between
$microseconds
and
$seconds
in order to have them fall below their string equivalents. And of course, if
$microseconds
ends in one or more zeros, it will be shorter and the
$seconds
will move left. But this should be enough to demonstrate the workings of the
microtime()
command.
As discussed earlier in this section, using only microseconds as a random number seed leaves you open to inadvertent duplicates.
Much better is to include the seconds since Epoch, and even better is to
also throw in an unseeded random number. Between those three, it's almost
impossible for a cracker to set up a machine to duplicate your seed and thus
guess your random numbers.
The easiest way to generate a random number is with base26, using A-Z, using
the random number generator seeded with microseconds. Here's the code to
generate a 12 character base 26 number:
<?php
function randomString($randStringLength)
{
$timestring = microtime();
$secondsSinceEpoch=(integer) substr($timestring, strrpos($timestring, " "), 100);
$microseconds=(double) $timestring;
$seed = mt_rand(0,1000000000) + 10000000 * $microseconds + $secondsSinceEpoch;
mt_srand($seed);
$randstring = "";
for($i=0; $i < $randStringLength; $i++)
{
$randstring .= chr(ord('A') + mt_rand(0, 25));
}
return($randstring);
}
echo "<pre><big><big>\n";
echo randomString(12);
echo "</big></big></pre>\n";
?>
The preceding code generates a 12 character base 26 number. There are 9.54 x 10
16
such numbers, meaning that if your site gets a
billion
visits per year, the chance of duplicate numbers being handed out in a year
is over one in a million. Assuming your purge old session records daily, that
becomes less than one in 365 million. If these aren't good enough odds for
you, tack on an additional 5 characters to reduce the likelihood of of duplicates
another 11 million times. At that point you're more at risk of being killed
by an alligator than you are of dealing out duplicate. This is true especially
because the seed is based on seconds since epoch, microseconds, and an unseeded
random number.
What is the performance effect of the randomString() function? Let's run
30000 iterations on my unloaded dual Celeron 450 with 512MB of RAM. Here's
the loop:
<?php
echo "<pre><big><big>\n";
$iterations = 30000;
echo "Starting $iterations iterations of random number generation...\n";
$randstring="";
$startTimeString = microtime();
for($i=0; $i < $iterations - 1; $i++)
{
$randstring = randomString(12);
}
$endTimeString = microtime();
$startTime = (integer) substr($startTimeString, strrpos($startTimeString, " "), 100);
$endTime = (integer) substr($endTimeString, strrpos($endTimeString, " "), 100);
$elapsed = $endTime - $startTime;
$elapsedPerCall = $elapsed/$iterations;
echo "Final random number is $randstring\n";
echo "Elapsed time is $elapsed seconds\n";
echo "That's $elapsedPerCall seconds per call.\n";
echo "</big></big></pre>\n";
?>
The preceding code produced the following output on my browser:
Starting 30000 iterations of random number generation...
Final random number is XKLYEQWAWBVC
Elapsed time is 11 seconds
That's 0.00036666666666667 seconds per call.
366 microseconds isn't bad unless you're getting thousands of visits per
hour, and if you are, you're probably running more than a dual Celeron 450.
I believe that this base 26 representation of a random number, when concatinated
with an autoincrement, is the best compromise between performance and security.
Autoincrementing
Autoincrementing on a busy site is anything but trivial. It's perfectly possible
for two users to appear at the same nanosecond. Will the autoincrement work
properly, will it grant the two users duplicate autoincrements, or will it
malfunction in some other way? If you're already working with a database,
perhaps the simplest way is to use the database. Let's use PostgreSQL as an example.
UsingĀ
psql
,Ā create a table called
increments
with columns
type
and
number
. :
create table increments (type char(8), number integer);
Pre-load it with this row:
insert into increments (type, number) values ('sid', 100001);
For the purposes of this exercise, be sure to grant all priveleges for this table to the user under which
httpd
runs (user
apache
on my box).
Now create a test-jig program called
autoincrement.php
to test it:
<?php
echo "<pre>";
$starttime = microtime();
echo $starttime . "\n";
$connection = pg_Connect ("dbname=mydb port=5432 user=apache");
if($connection == 0)
{
die("Connection failed\n");
}
else
{
echo "<p>Connection succeeded</p>\n";
}
$result = pg_Exec($connection,
"select number from increments where type='sid';");
$row = pg_fetch_row($result, 0);
$number = $row[0] + 1;
$result = pg_Exec($connection,
"update increments set number=$number where type='sid';");
echo "\nNew increment is: " . $number . "\n";
pg_close($connection);
$endtime = microtime();
echo $endtime . "\n";
echo "</pre>";
?>
Hit it with a browser, and you'll see that each refresh increments the number.
Autoincrement traffic copping
The preceding code is cute, but what happens if autoincrement requests occur
within nanoseconds of each other? Watch the following simulated disaster:
<?php
echo "<pre>";
$starttime = microtime();
echo $starttime . "\n";
$connection = pg_Connect ("dbname=mydb port=5432 user=apache");
if($connection == 0)
{
die("Connection failed\n");
}
else
{
echo "<p>Connection succeeded</p>\n";
}
$result = pg_Exec($connection,
"select number from increments where type='sid';");
$row = pg_fetch_row($result, 0);
$number1 = $row[0] + 1;
$result = pg_Exec($connection,
"select number from increments where type='sid';");
$row = pg_fetch_row($result, 0);
$number2 = $row[0] + 1;
$result = pg_Exec($connection,
"update increments set number=$number1 where type='sid';");
$result = pg_Exec($connection,
"update increments set number=$number2 where type='sid';");
echo "\nNew first increment is: " . $number1 . "\n";
echo "\nNew second increment is: " . $number2 . "\n";
pg_close($connection);
$endtime = microtime();
echo $endtime . "\n";
echo "</pre>";
?>
The preceding code shows what happens if a second autoincrement request
comes in between the select and the update for the first one. Both requests
get the same number -- a disaster when driving a website with session id's.
Concatinating the autoincrement with a random number helps, because due to
the closeness of the two requests' time of arrival it's better not to depend
on the random number, because 2/3 of its seed factors are time based, and
the other one is unknown.
You could fix this with locks, timeout code and anti-deadlock code. Ughhh!
It has the advantage of database portability (more or less), but it can get
nasty.
My preference is to work directly at the database level, using a stored procedure
to accomplish both the increment and the return of the number as a single
transaction. Thus the database queue's all the requests. Everybody gets incremented,
and nobody gets a duplicate or any other bogus problem.
Create the following text file, called
incr.sql
:
drop function incr(text);
create function incr(text) returns int8 as '
declare
mytype char(8);
rtrn record;
begin
mytype := $1;
SELECT number into rtrn FROM increments WHERE type=mytype;
rtrn.number := rtrn.number + 1;
update increments set number=rtrn.number where type=mytype;
return rtrn.number;
end;
' language 'plpgsql';
Now, within the
psql
environment, run the following command:
\i incr.sql
Depending on where you started
psql
from, you might need to put the complete path on the filename in the preceding command. If all goes well
psql
should issue a message saying "DROP" followed by another saying "CREATE". What has happened is that it dropped function
incr(text)
and then created it again. If
psql
gripes about "permission denied", the user from which you ran
psqlpsql
doesn't have permission to create and drop functions (stored procedures). Those rights must be granted from run by the
postgres
user. Also, only the owner of a function can drop it, so if the
incr(text)
function was previously created by a different user, that user must drop it.
Once the function is in place, you can test it from within
psql
because the function is implemented in PostgreSQL, not in PHP code. Within
psql
issue the following command:
select incr('sid');
If you run the preceding command twice, you'll see the number increment. Within the
psql
environment it should look something like this:
mydb=> select incr('sid');
incr
--------
100029
(1 row)
mydb=> select incr('sid');
incr
--------
100030
(1 row)
mydb=>
If the command doesn't work, perhaps you need to grant the user proper permissions. Try this from user postgres:
grant select,update on increments to apache;
And if that doesn't work, temporarily try brute force:
grant all on increments to apache;
Later you can take away priveleges with the revoke command.
Once you can autoincrement within the
psql
environment, you can do it in the PHP environment. Create the following
inctest.php
:
<?php
echo "<pre><big><big>\n";
$number=0;
$iterations = 1000;
echo "Starting $iterations iterations of random number generation...\n";
$startTimeString = microtime();
for($i=0; $i < $iterations - 1; $i++)
{
$number = getNextIncrement('sid');
}
$endTimeString = microtime();
$startTime = (integer) substr($startTimeString, strrpos($startTimeString, " "), 100);
$endTime = (integer) substr($endTimeString, strrpos($endTimeString, " "), 100);
$elapsed = $endTime - $startTime;
$elapsedPerCall = $elapsed/$iterations;
echo "Final increment number is $number\n";
echo "Elapsed time is $elapsed seconds\n";
echo "That's $elapsedPerCall seconds per call.\n";
echo "</big></big></pre>\n";
?>
The result is ugly. It takes 6 seconds for 1000 iterations, and every time
you run it it goes up from there. Can you guess what went wrong?
You might think it's because I never closed the connection, or that repeatedly
opening connections is very expensive. Although these might be true, the
problem is more subtle. Internally PostgreSQL processes the update as a delete
followed by an insert, and those "deleted" records lie around bloating the
database. As user
postgres
you can actually see the bloat by first running a
duĀ .
command, then accessing the preceding php program from a browser, then once more runing a
duĀ .
command. Certain directories will show an increase. Trace it down to a specific
file, and you can see that file grow every time you refresh your browser.
Performing the following
psql
command as user Postgres will cure the problem:
vacuum full;
But of course the database will bloat again as more increments are done.
If you're using PostgreSQL, you can triple the best case incrementation speed
by substituting a sequence for the stored procedure. Do the following, as
user
postgres
, within the
psql
environment:
create sequence sidseq;
grant update on sidseq to apache;
Finally, change the
$sql
variable in the web app from this:
select incr('sid');
To this:
select nextval('sidseq');
Summary
Because of the stateless nature of the http protocol, achieving data persistance
between called web pages is a pain. The more variables, the more pain. A
great method is to pass a single session id variable, and within each page
use that session id to look up other variables in the database.
Such a session id must possess two properties:
Low likelihood of accidental distribution of duplicates
Extremely difficult to forge through reverse engineering, brute force or other methods.
#1 is best achieved by an autoincrement, but autoincrements are incredibly
easy to forge or reverse engineer, and even if they're obfuscated with something
like
md5()
, it's all too easy to brute force them through a program that iterates numbers and runs those numbers through
md5()
, comparing the results to a legitimately obtained autoincrement.
#2 is best achieved through a random number. But random numbers have a possibility of duplication that may be too high.
By concatinating an autoincrement with a properly generated random number,
the session id is incredibly secure. Sure -- it could be cracked -- but the
badguy would probably enter through easier doors than that one.
A properly generated random number is seeded with something dependent on
more than just time, because given enough persistence a time-seeded random
number could be brute force reverse engineered. I recommend seeding the random
number generator with a function with these three inputs:
The return from an unseeded call to
mt_rand()
The seconds since Epoch
The microseconds on the system clock
Such a session id -- even if passed in the URL, is secure enough that your
time would be better spent hunting down other security risks. And there are
plenty more. See
PHP Data Security
.
Ā [
Troubleshooters.com
|
Code Corner
|
Email
Steve Litt
] |
| Markdown | ## [Troubleshooters.Com](https://www.troubleshooters.com/troubleshooters.htm) and [Code Corner](https://www.troubleshooters.com/codecorn/index.htm) Present
# [PHP Power Pointers](https://www.troubleshooters.com/codecorn/php/index.htm):
# PHP Data Persistance
[Copyright (C) 2002 by Steve Litt](https://www.troubleshooters.com/cpyright.htm)
***
**Contents**
- **[Introduction](https://www.troubleshooters.com/codecorn/php/persist.htm#Introduction)**
- **[Security Warning\!](https://www.troubleshooters.com/codecorn/php/persist.htm#Security_Warning)**
- **[Challenges of Statelessness](https://www.troubleshooters.com/codecorn/php/persist.htm#Challenges_in_a_Full_Featured_Data)**
- **[Generating Session IDs](https://www.troubleshooters.com/codecorn/php/persist.htm#Generating_Session_IDs)**
- **[Summary](https://www.troubleshooters.com/codecorn/php/persist.htm#Summary)**
# Introduction
This document assumes you have computer programming knowledge and that you know your way around PHP, Linux and UNIX. Rather than spending huge amounts of time discussing cookies or PHP4's sessions, this document discusses the ramifications of HTTP's statelessness, alternatives to get around it, database variable lookup by passed session key, and session key generation tips and techniques.
This document assumes PHP is installed and functioning on your system, and that you're reasonably familiar with PHP.
# Security Warning\!
Throughout this document it is assumed that PostgreSQL data web access is through user
apache
or whatever user your
httpd
daemon runs as. Although that's the easiest way to do it, it's by no means the most secure. Anyone co-hosted on the same box as your website can access your data by writing their own PHP scripts, because their access to user
apache
is the same as yours. Also, *ident* methods of authentication.
Data enabled web apps have many gotchas, especially if there are multiple website owners on a single host computer sharing one Apache and one DBMS. For a futher discussion of this, and some solutions, see [PHP Data Security](https://www.troubleshooters.com/codecorn/php/security.htm).
# Challenges of Statelessness
Most web app challenges revolve around the fact that HTTP is a stateless protocol. Statelessness means that when you transition from one web page to another, all variable values are forgotten. This is true whether the page is static or rendered by a script. Any variables -- any values, are forgotten.
As a programmer, a good way of envisioning this challenge is to imagine a programming language without global variables. Any variable set in one web page is out of scope in another.
There are three basic ways to retain variables and values between pages:
1. Cookies
2. Passing the values
3. Database lookup
## Cookies
I lied about no global variables between web pages. Cookies can hold global variables accessible from multiple web pages. Cookies are little files kept on the user's computer that store the desired values. While this might seem ideal, cookies have several potential problems:
1. Older browsers don't recognize cookies
2. Many people set their browsers to reject cookies for security reasons
Reason \#2 is critical. Cookies can be made to track your browsing activities and other very personal information. Cookie-compliant browsers all provide methods of rejecting cookies entirely, or on a per-domain basis.
The bottom line is that if you need your website to work with all users, cookies might help with some but you need a backup plan for those rejecting or unable to process cookies.
## Value passing
All computer programmers know global variables can spell disaster through side-effect problems. So instead of using global variables, we pass necessary information as subroutine arguments. It's inconvenient but much safer.
You can pass values between web pages. As in subroutines, it's inconvenient (actually much more inconvenient than passing between subroutines). But it can be done, and it doesn't rely on cookies.
The two methods of passing data between web pages are the GET method and the POST method. GET is much easier, POST is much more secure. Professional websites are usually better off with the POST method.
### The GET Method
The GET method simply passes variable name/value pairs on the URL after a question mark. This method can be sent by all web pages, whether or not the transfer mechanism is a link or a form's submit button.
Unfortunately, the variables are plain text and are subject to tampering by the user. On visiting an ecommerce site using the GET method, I was able to modify the URL such that I could purchase a \$10.00 book for \$1.00. Naturally I didn't complete the transaction, but an unscrupulous person could have easily ordered books at 1/10 the price and then resold them. Only a separate validation script or a human audit could have uncovered the problem.
The GET method is best used for only the most harmless data, or data which cannot easily be intelligently forged.
It's possible to generate a *session id* variable from microsecond-seeded random numbers, sometimes combined with other data. If constructed right, such session id's are for all practical purposes impossible to reverse engineer. Sometimes such a session id can be passed via the URL without undue risk.
One more problem with passing data in the URL is that crackers and script kiddies can tack malicious commands on the end of an otherwise good URL. Be sure every page receiving GET requests immediately truncates excess characters from the URL, and after splitting the URL into variables, immediately check each for the right length and other validity, and immediately look up database session info with the session id. If any of these fails, immediately terminate the script.
### The POST Method
The POST method of value passing is available ONLY from HTML forms. It cannot be applied by links, even if those links are contained within a form. Therefore, to retain state without resorting to the GET method, every exit point on every page must be a button on a form. While inconvenient, this is the most secure method of passing data between pages.
## Database lookup
Oops, I lied again. You can have the equivalent of global variables without cookies. You can simply keep the information in a database. But there's one problem -- how do you retrieve the information once you've switched pages?
The new page must receive at least one piece of data -- the lookup key for the remaining data, by means other than the database. That means either cookies or value passing.
The way this is typically done is that the first page visited generates a *session id*. The session id is a unique string defining a specific visit by a specific visitor. The string typically involves the concatination or combination of a random number seeded by a timestamp and an autoincrement. For additional resistance to cracking and reduced likelihood of duplicates other information might be involved, such as the user's IP address, the Apache process id. Whatever is chosen, it's vital that it not be guessable, because if it can be guessed, a bad guy can guess at a session id, and then masquerade as a different user. PHP's
md5()
function can change an otherwise guessable session id into a completely unguessable 32 character string which can be used as the session id.
As soon as the session id is created, a row for that session id is created in the database. This row has all the necessary information for the various web pages on the site. For instance, it might contain the invoice info for a shopping cart, and a relation to the user's shopping cart.
The combination of storing most variables in a database and passing a unique key through POST cookies, or even GET (the session id has been made to be unguessable) is an excellent solution. It's convenient in that only a single variable need be passed around, with all other variables regenerated in each page. If a website enhancement necessitates a new piece of information, that new piece can be added to the database, rather than simultaneously adding a variable to 20 web scripts. Another advantage of the key/database hybrid is confusion reduction.
Without the database repository, every variable used by every script would need to be passed by every other script, because even if neither the passing nor receiving script needs the variable, once it's dropped there's no way that the one script needing it can regenerate it.
The session id's MUST be unguessable. When you generate unique session ids, depending on how you create them they might be guessable. As a simple case, session ids could simply be incrimented. If so, a badguy could use your site, and then decrement the session id to hack into somebody else's session. If you're passing the session id in the URL (GET method) even a 12 year old could perform this crack. However, using PHP's md5() function, you can md5sum your unique key into a new 32 character string that bears no resemblance to the old string and is impossible to convert back to the old string. Simply use the 32 character string as the session id, and the likelihood of a cracker being able to change it to an existing good session ID (good meaning a session ID that's been updated in let's say the past hour) approaches 0.
If your app doesn't use a database, it's possible to use a flat file or some other mechanism to store all the non-session-id variables. However, finding a way for the http server to write and read those variables without allowing crackers to read them is difficult, and out of the scope of this article.
In summary, a hybrid with a single session id passed everywhere, and the rest looked up by session id by each script, is probably the best methodology. The session id can be passed by cookies where possible, and by POST or GET where it must. If the session ID is sufficiently unforgeable (perhaps by using some sort random numbers or checksum), it could even be passed in the URL as a GET request.
# Generating Session IDs
The stateless nature of HTTP is always a hassle, but many developers find partial solace by keeping most variables in the database, passing only the session id, and looking up the rest of the variables based on the session id. Use of session id's present two different risks which must be handled:
1. Risk of a user or cracker forging a legitimate session id
2. Risk of accidentally handing out duplicate session id's
These are two very different risks, and both must be handled. The more important, valuable or private the information being handled, the more precautions must be taken against these two risks.
Absolute duplication prevention can be handled by a well designed autoincriment mechanism. Unfortunately, autoincriments are extremely easy to forge. Risk of forgery can be limited to one in trillions with a well designed random number generator. Unfortunately, random numbers uniqueness is only as good as the random seed, and the best that can be hoped for is about one in a million on a system with a true microsecond clock. For clocks with less precision, the likelihood can be less. So it's a good idea to seed the random number generator with numbers other than the microsecond clock.
Therefore, to minimize both the risk of forgery and the risk of accidental duplication, you can use a combination (concatination) of a random number and an autoincriment, or if you completely trust the autoincrementer you can use the
md5()
function on the increment in order to turn it into a 32 digit hex number, thereby reducing the chance of successful forgery to one in trillions even on the most heavily trafficked sites. However, unless you completely trust your autoincrement, you're better off using it in combination with a random number, so even if the autoincrementer occasionally returns a dup, the random number will likely provide uniqueness.
## Minimizing forging risks
Passed variables can be changed, and a passed session id is no exception. Changing the session id is trivial if the session id is passed in the URL (GET method), but a skillful cracker can also change the session id even if the POST method is used.
You cannot prevent the user from changing the session id, but you can take steps to minimize the chance of such forgeries mimicking a legitimate session id. What you need is a random number, and the random number must be large enough that its range overwhelms the number of legitimate numbers. For example, if your session id's are 12 character base 26 random numbers (each "digit" is an upper case letter), that offers 9.54 x 1016 possible session id's. If your site receives a million visitors a day, and if you regularly purge sessions over 24 hours old, then at a given time you'll have a million legitimate session id's. The chance of someone forging a legitimate one is (1 x 106)/9.54 x 1016), or 1 in 94.5 billion. The risk is miniscule compared to more pressing risks facing your business, and if you want to further decrease the risk of a successful forgery, you can add more digits. Adding 5 more digits would decrease the risk by another 11 million, or approximately 1 in a million trillion.
## Minimizing accidental duplicate risks
There's no such thing as a truly random algorithm. PHP's
mt\_rand(lowerlimit,upperlimit)
routine is excellent at producing seemingly random numbers evenly distributed across the range defined by its arguments, but in fact its starting place will always be determined by its seed. If you don't seed it, you have no idea how random it will really be. And if you seed it with the microsecond clock (explained later), the odds of duplication are less than a million to one. For a well trafficked site that's not good enough for the session id.
Including the seconds since epoch in the session ID seed drops the duplication risk much further -- for practical purposes limiting the possibility to cases where users arrive within the same microsecond, or within a number of microseconds corresponding to the time granularity of the system. Depending on your traffic and the degree of loss created by an inadvertent duplicate, the appending of seconds since epoch might be sufficient. But for more safety, it's all too easy to add in a third number -- an unseeded random number. Unseeded random numbers aren't necessarily random, but when combined with a time element the two make it very difficult to reverse engineer the produced random number.
Other values that could be either concatenated into the session id or used for the seed include:
1. The incoming IP address
2. The difference between a second
microtime()
call and the first one, multiplied by a suitable multiplier
The second
microtime()
might seem like an opportunity to exploit system variation for more variability, but my tests on my box indicate it cuts the risk of duplication by maybe a factor of 30. That's better than nothing, but not much. The incoming IP address might help, but given the fact that large ISP's funnel many users through the same IP address, its added safety is unpredictable.
Probably the easiest way to autoincrement in a secure way is with a single row of a table containing a key, and a data column containing an integer to be incremented. When someone needs a new number, they lock the table, grab the value, increment it, write it back, and unlock it. If two people try to perform this process at the same time, Postgres acts as a traffic cop.
Locking can be a mess for the application programmer to implement, so later we'll discuss how to have Postgres do the whole thing with a stored procedure (a *function* in Postgresese).
## Random number generation
PHP gives you two outstanding tools to use in generating random number session ids:
microtime()
and
mt\_random()
. The
microtime()
command returns a string consisting of a float, a space, and an integer. The float is between 0 and 1, representing the number of microseconds on the system clock. The integer is the number of seconds ellapsed since 1/1/1970, or whatever epoch the computer uses. To test it, create the following
microtime.php
file:
#
Pull up
microtime.php
in a browser, and you'll see something like this:
Click your browser's refresh button several times and note that the first number changes almost randomly, while the second number corresponds to ellapsed seconds and changes only in the rightmost couple digits.
To actually acquire the numeric value of the microseconds and seconds parts of the string, do the following:
```
$timestring = microtime();
$microseconds = (double) $timestring;
$seconds = (integer) substr($timestring, strrpos($timestring, " "), 100);
```
First you capture the time in a snapshot, and from that point forward you work only on the snapshot. PHP's type conversion considers the space as the end of the floating point number, so the typecast retrieves the correct amount. The second part of the string is more difficult. You need to find the spaces position, and retrieve only the portion of the string after that space.
Prove this to yourself by changing
microtime.php
to the following:
The preceding should print the string, and then a concatination of the numeric values, and they should obviously be the same. Note that depending on how precise your system's microsecond reading is, you'll need to vary the number of spaces between
\$microseconds
and
\$seconds
in order to have them fall below their string equivalents. And of course, if
\$microseconds
ends in one or more zeros, it will be shorter and the
\$seconds
will move left. But this should be enough to demonstrate the workings of the
microtime()
command.
As discussed earlier in this section, using only microseconds as a random number seed leaves you open to inadvertent duplicates. Much better is to include the seconds since Epoch, and even better is to also throw in an unseeded random number. Between those three, it's almost impossible for a cracker to set up a machine to duplicate your seed and thus guess your random numbers.
The easiest way to generate a random number is with base26, using A-Z, using the random number generator seeded with microseconds. Here's the code to generate a 12 character base 26 number:
The preceding code generates a 12 character base 26 number. There are 9.54 x 1016 such numbers, meaning that if your site gets a *billion* visits per year, the chance of duplicate numbers being handed out in a year is over one in a million. Assuming your purge old session records daily, that becomes less than one in 365 million. If these aren't good enough odds for you, tack on an additional 5 characters to reduce the likelihood of of duplicates another 11 million times. At that point you're more at risk of being killed by an alligator than you are of dealing out duplicate. This is true especially because the seed is based on seconds since epoch, microseconds, and an unseeded random number.
What is the performance effect of the randomString() function? Let's run 30000 iterations on my unloaded dual Celeron 450 with 512MB of RAM. Here's the loop:
The preceding code produced the following output on my browser:
366 microseconds isn't bad unless you're getting thousands of visits per hour, and if you are, you're probably running more than a dual Celeron 450.
I believe that this base 26 representation of a random number, when concatinated with an autoincrement, is the best compromise between performance and security.
## Autoincrementing
Autoincrementing on a busy site is anything but trivial. It's perfectly possible for two users to appear at the same nanosecond. Will the autoincrement work properly, will it grant the two users duplicate autoincrements, or will it malfunction in some other way? If you're already working with a database, perhaps the simplest way is to use the database. Let's use PostgreSQL as an example.
Using
psql
, create a table called
increments
with columns
type
and
number
. :
```
create table increments (type char(8), number integer);
```
Pre-load it with this row:
```
insert into increments (type, number) values ('sid', 100001);
```
For the purposes of this exercise, be sure to grant all priveleges for this table to the user under which
httpd
runs (user
apache
on my box).
Now create a test-jig program called
autoincrement.php
to test it:
Hit it with a browser, and you'll see that each refresh increments the number.
### Autoincrement traffic copping
The preceding code is cute, but what happens if autoincrement requests occur within nanoseconds of each other? Watch the following simulated disaster:
The preceding code shows what happens if a second autoincrement request comes in between the select and the update for the first one. Both requests get the same number -- a disaster when driving a website with session id's. Concatinating the autoincrement with a random number helps, because due to the closeness of the two requests' time of arrival it's better not to depend on the random number, because 2/3 of its seed factors are time based, and the other one is unknown.
You could fix this with locks, timeout code and anti-deadlock code. Ughhh! It has the advantage of database portability (more or less), but it can get nasty.
My preference is to work directly at the database level, using a stored procedure to accomplish both the increment and the return of the number as a single transaction. Thus the database queue's all the requests. Everybody gets incremented, and nobody gets a duplicate or any other bogus problem.
Create the following text file, called
incr.sql
:
Now, within the
psql
environment, run the following command:
```
\i incr.sql
```
Depending on where you started
psql
from, you might need to put the complete path on the filename in the preceding command. If all goes well
psql
should issue a message saying "DROP" followed by another saying "CREATE". What has happened is that it dropped function
incr(text)
and then created it again. If
psql
gripes about "permission denied", the user from which you ran
psqlpsql
doesn't have permission to create and drop functions (stored procedures). Those rights must be granted from run by the
postgres
user. Also, only the owner of a function can drop it, so if the
incr(text)
function was previously created by a different user, that user must drop it.
Once the function is in place, you can test it from within
psql
because the function is implemented in PostgreSQL, not in PHP code. Within
psql
issue the following command:
```
select incr('sid');
```
If you run the preceding command twice, you'll see the number increment. Within the
psql
environment it should look something like this:
If the command doesn't work, perhaps you need to grant the user proper permissions. Try this from user postgres:
```
grant select,update on increments to apache;
```
And if that doesn't work, temporarily try brute force:
```
grant all on increments to apache;
```
Later you can take away priveleges with the revoke command.
Once you can autoincrement within the
psql
environment, you can do it in the PHP environment. Create the following
inctest.php
:
The result is ugly. It takes 6 seconds for 1000 iterations, and every time you run it it goes up from there. Can you guess what went wrong?
You might think it's because I never closed the connection, or that repeatedly opening connections is very expensive. Although these might be true, the problem is more subtle. Internally PostgreSQL processes the update as a delete followed by an insert, and those "deleted" records lie around bloating the database. As user
postgres
you can actually see the bloat by first running a
du .
command, then accessing the preceding php program from a browser, then once more runing a
du .
command. Certain directories will show an increase. Trace it down to a specific file, and you can see that file grow every time you refresh your browser.
Performing the following
psql
command as user Postgres will cure the problem:
```
vacuum full;
```
But of course the database will bloat again as more increments are done.
If you're using PostgreSQL, you can triple the best case incrementation speed by substituting a sequence for the stored procedure. Do the following, as user
postgres
, within the
psql
environment:
```
create sequence sidseq;
grant update on sidseq to apache;
```
Finally, change the
\$sql
variable in the web app from this:
```
select incr('sid');
```
To this:
```
select nextval('sidseq');
```
# Summary
Because of the stateless nature of the http protocol, achieving data persistance between called web pages is a pain. The more variables, the more pain. A great method is to pass a single session id variable, and within each page use that session id to look up other variables in the database.
Such a session id must possess two properties:
1. Low likelihood of accidental distribution of duplicates
2. Extremely difficult to forge through reverse engineering, brute force or other methods.
\#1 is best achieved by an autoincrement, but autoincrements are incredibly easy to forge or reverse engineer, and even if they're obfuscated with something like
md5()
, it's all too easy to brute force them through a program that iterates numbers and runs those numbers through
md5()
, comparing the results to a legitimately obtained autoincrement.
\#2 is best achieved through a random number. But random numbers have a possibility of duplication that may be too high.
By concatinating an autoincrement with a properly generated random number, the session id is incredibly secure. Sure -- it could be cracked -- but the badguy would probably enter through easier doors than that one.
A properly generated random number is seeded with something dependent on more than just time, because given enough persistence a time-seeded random number could be brute force reverse engineered. I recommend seeding the random number generator with a function with these three inputs:
1. The return from an unseeded call to
mt\_rand()
2. The seconds since Epoch
3. The microseconds on the system clock
Such a session id -- even if passed in the URL, is secure enough that your time would be better spent hunting down other security risks. And there are plenty more. See [PHP Data Security](https://www.troubleshooters.com/codecorn/php/security.htm).
# \[ [Troubleshooters.com](https://www.troubleshooters.com/troubleshooters.htm) \| [Code Corner](https://www.troubleshooters.com/codecorn/index.htm) \| [Email Steve Litt](https://www.troubleshooters.com/email_steve_litt.htm) \][](https://www.troubleshooters.com/cpyright.htm#top)[Copyright (C)2002 by Steve Litt](https://www.troubleshooters.com/cpyright.htm#top) -- [Legal](https://www.troubleshooters.com/cpyright.htm#top) |
| Readable Markdown | null |
| Shard | 102 (laksa) |
| Root Hash | 13602311321129060302 |
| Unparsed URL | com,troubleshooters!www,/codecorn/php/persist.htm s443 |