Recommend this page to a friend! |
Download |
Info | Example | Files | Install with Composer | Download | Reputation | Support forum | Blog | Links |
Ratings | Unique User Downloads | Download Rankings | ||||
54% | Total: 456 | All time: 6,119 This week: 78 |
Version | License | PHP version | Categories | |||
searchable-file 1.1.4 | BSD License | 5.5 | PHP 5, Files and Folders, Searching, T... |
<?php
|
The SearchableFile class is designed to allow you to handle text files which __do not fit into memory__, and will never fit into memory, whatever your php.ini file settings are. With this class, you can perform operations as would do any lexer or parser to operate on file contents and analyze them.
It has been designed to minimize as much as possible the overhead implied by performing file IO to get the real data instead of working directly into memory.
The initial motivation for this class was to be able to handle big RTF contents, while preserving performance. However, as you will see, it is completely independent of the underlying file format.
The SearchableFile class can be seen as some kind of wrapper around a text file opened in read-only mode. It won't allow you to perform in-place modifications, since it's aimed at reading text streams, analyzing contents and optionally performing some modifications on-the-fly, then finally writing them to some output stream.
Creating a SearchableFile object is pretty simple :
$sf = new SearchableFile ( ) ;
You can also specify a block size for IO operations (the default is 16k), as well as a cache size, in numbers of records :
$sf = new SearchableFile ( 64 * 1024, 128 ) ;
Then you will have to open a file :
$sf -> Open ( 'myfile.txt' ) ;
Once created, you can use any of the search functions that the class has to offer ; the following example will look for the first character in the set [\\{}] :
$index = $sf -> strchr ( "\\{}" ) ;
Or you can just extract a substring from your file :
$text = $sf -> substr ( 10000, 20 ) ; // Extract 20 characters at position 10000
You can also use the object as an array, to access individual characters :
$ch = $sf [1000000] ;
or cycle through file contents using an iterator :
foreach ( $sf as $ch )
{
// do something with the character $ch
}
However, please note that such constructs should be used with care, since PHP will be terribly slow at doing that.
You can also use the equivalent of the preg\_match() and preg\_match\_all() builtin PHP functions ; the following example tries to find the first occurrence of the "\pict" or "\bin" strings :
$status = $sf -> pcre_match ( '/\\((bin)|(pict))/', $match, PREG_OFFSET_CAPTURE ) ;
while the following will try to find all the occurrences of those strings :
$status = $sf -> pcre_match_all ( '/(?P<pattern> \\((bin)|(pict)))/', $match, PREG_OFFSET_CAPTURE ) ;
Please have a look at the Making the pcre functions work section later in this paragraph, because there are some restrictions on using them.
All the examples provided with this package use a file named "verybigfile.rtf" and assume that this is an RTF file which contains embedded pictures and drawing objects. They should be used as command-line scripts.
I won't pollute this repository by providing a useless data file of almost 1Gb, but you can recreate it very easily :
Under Unix systems, you can do it that way :
cat myfile.txt myfile.txt ... myfile.txt >verybigfile.rtf
The same, using Msdos commands on Windows systems :
copy myfile.txt+myfile.txt...+myfile.txt verybigfile.rtf
Most of the examples test the SearchableFile functions and try to compare their timing with the same method using in-memory data (load the file contents using file\_get\_contents(), then use PHP builtin functions to achieve the same goal).
For this reason, you should ensure that :
Your php.ini memory\_limit setting has been set to a sufficient value. For example :
memory_limit = 1024M
The PCRE functions provided by the SearchableFile class are : pcre\_match() and pcre\_match\_all() ; there is no magic in them, they simply rely on an external command, pcregrep, which is not included in standard Linux distributions.
To install it :
On Debian distributions, run the following command :
apt_get install pcregrep
On CentOs distributions, it seems to be :
yum install pcre # not tested !
If none of the above conditions are met, then you will not be able to use the pcre functions.
The following sections describe the SearchableFile methods and properties.
$sf = new SearchableFile ( $block_size = 16384, $cache_size = 8 ) ;
Creates a searchable file object.
Since the class uses direct IO to access chunks of data, an optional block size can be specified (the default is 16k). Note that, at least on Windows systems, ideal block sizes range between 16 and 64Kb. Below or above that, performances seem to degrade.
The $cache_size parameter indicates how many file records should be kept in cache for later retrieval. This is a naive LRU cache.
The destructor of the SearchableFile class closes the file,if already opened.
$sf -> Close ( ) ;
Closes the searchable file, if it was opened.
No exception is thrown if it was already opened.
$sf -> Open ( $filename ) ;
Opens the specified file. Throws an exception if the file was already opened or could not be opened.
$pos = $sf -> multistrpos ( $searched_strings, $offset = 0, &$found_index = null, &$found_string = null ) ;
$pos = $sf -> multistripos ( $searched_strings, $offset = 0, &$found_index = null, &$found_string = null ) ;
These function behave like the PHP standard strpos() and stripos() functions, but can be used to find the first occurrence of a string within a set of searched strings.
The parameters are the following :
Returns either the byte offset of a found occurrence in the $searched\_strings array, or false if the string was not found in the file.
$status = pcre_match ( $pattern, &$matches = null, $flags = 0, $start_offset = 0 ) ;
pcre_match() tries to behave like the builtin preg_match() function, but operates on a file rather than in memory.
For achieving that, it uses the pcregrep linux command to extract match offset using the --file-offsets parameter.
The meaning of the parameters is the following :
The function returns false if some error occurred (the starting offset is beyond the end of the file, or the search pattern is incorrect) ; otherwise the number of matches is returned (0 or 1).
Notes :
$status = pcre_match_all ( $pattern, &$matches = null, $flags = 0, $start_offset = 0 ) ;
pcre\_match\_all() tries to behave like preg\_match\_all(), but operates on a file rather than in memory. For achieving that, it uses the pcregrep linux command to extract match offset using the --file-offsets parameter.
It returns false if some error occurred (the starting offset is beyond the end of the file, or the search pattern is incorrect, or an individual preg\_match() on one of the sub-results failed for some reason) ; otherwise the number of matches is returned.
$pos = $sf -> strchr ( $cset, $offset = 0 ) ;
Finds the offset of the first character belonging to $cset.
Returns either the byte offset of the first character found belonging to $cset, or false if no more characters from $cset are present in the file.
The $offset parameter indicates where the search is to be started.
Unlike the useless PHP strchr() function, which returns a substring starting with the searched character or string, but much more like the C strchr() function, which returns a pointer to the found character, strchr() returns the offset in the file of the searched character(s).
$pos = $sf -> strpos ( $searched_string, $offset = 0 ) ;
$pos = $sf -> stripos ( $searched_string, $offset = 0 ) ;
Behave like the PHP standard strpos() and stripos() functions. The parameters are the following :
Returns either the byte offset of a found occurrence of $searched_string, or false if the string was not found in the file.
$text = $sf -> substr ( $start, $offset ) ;
Extracts a substring from the searchable file. The $start and $length parameters have the same meaning that for the php builtin substr() function.
Returns the specified substring or false if one of the following conditions occur :
An empty string is returned if $length has been specified and is zero (ie, 0, false or null).
$sf -> Write ( $output, $start, $end = false ) ;
When processing large files, you sometimes need to write (copy) unmodified data from the input file to some output file. This is the purpose of the Write method, which takes the following parameters :
output : Either a file resource, or a callback function which has the following signature :
function callback ( $data_to_write ) ;
This function ensures that $data\_to\_write will not exceed the record size specified when calling the SearchableFile class constructor. This size can even be smaller if the offset specified by the $start parameter does not fall on a record boundary, but somewhere in the record. In this case, $data\_to\_write will be smaller than the SearchableFile record size.
Gets the underlying filename.
Gets/sets the read buffer size.
Note that if the record size is modified, it will only take effect on the next call to the Open() method.
Gets the underlying file size.
Files (8) |
File | Role | Description | ||
---|---|---|---|---|
examples (4 files) | ||||
LICENSE | Lic. | License text | ||
NOTICE | Data | Auxiliary data | ||
README.md | Doc. | Documentation | ||
SearchableFile.phpclass | Class | Class source |
Files (8) | / | examples |
File | Role | Description |
---|---|---|
multistrpos.php | Example | Example script |
preg_match.php | Example | Example script |
preg_match_all.php | Example | Example script |
strpos.php | Example | Example script |
The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page. |
Install with Composer |
Version Control | Unique User Downloads | Download Rankings | |||||||||||||||
100% |
|
|
User Ratings | ||||||||||||||||||||||||||||||
|
Applications that use this package |
If you know an application of this package, send a message to the author to add a link here.