a_bertrand Posted January 4, 2011 Share Posted January 4, 2011 Problem: There is some cases when MySQL or any pre-made database are simply too slow for what you need to do and at the same time you don't need all the features they offer like you don't need complex queries or deal with different data types. In those cases you can still try to code yourself your own storage engine which is normally not as hard as you may think at first. First of all you must think what you really need. What will you do with your data? Example 1: Create the database once and never edit it again. During the creation of the database you need to be as fast as possible. Example 2: Speed of storage is less important than allowing to add data later on. If you are with the first case, then you should store the data in a flat file and create an index to retrieve it once the storage is finished. In the other case it is much better to store directly with a way to retrieve the data quickly. In my case I choose the option 2. Now how to create an index for the example 1 (as well as it could be used for example 2) ? A good solution would be to create a binary tree with the "key" as node data and position in the flat file. A binary tree is basically a structure where you have some data for the node, and 2 pointers (left, right). If the new value is bigger than the existing one you go to the node right otherwise on the node left. If the node is empty then you add you data there and create the 2 empty pointers. Advantage of this solution over a sorted index is that you NEVER need to have all the keys in memory and therefore can deal with extremely huge data sets. Here you are with a C# code (which should be readable by anyone): static void AddNode(BinaryWriter writer, BinaryReader reader, string key, string data) { // First node? if (reader.BaseStream.Length == 0) { writer.BaseStream.Seek(0, SeekOrigin.Begin); writer.Write(key); writer.Write(0L); // A writer.Write(0L); // B writer.Write(data); return; } // Search a free node in the tree long nodePos = 0; while (true) { reader.BaseStream.Seek(nodePos, SeekOrigin.Begin); string ns = reader.GetString(); long pos = reader.BaseStream.Position; long a = reader.ReadInt64(); long b = reader.ReadInt64(); if (key > ns) // Go to A { if (a == 0) { writer.BaseStream.Seek(pos, SeekOrigin.Begin); break; } else nodePos = a; } else // Go to B { if (b == 0) { writer.BaseStream.Seek(pos + 8, SeekOrigin.Begin); break; } else nodePos = b; } } // Add now the node pointer... writer.Write(reader.BaseStream.Length); // Add the new node at the end writer.BaseStream.Seek(0, SeekOrigin.End); writer.Write(key); writer.Write(0L); // A writer.Write(0L); // B writer.Write(data); } Binary trees offer quick read / write access normally... beside if you go in the worse case scenario where you create your tree with sorted data in which case the tree sadly de-generate in a list and therefore you have slow access. To solve this you have multiple solutions which could be more or less complex. One of the solution is called red-black trees which basically tries to balance the tree all the time. Honestly I find it a bit complex and not really adequate for a file based solution. My solution is to make a quad tree (4 branches) instead of 2. I use the key and the CRC of the key to find where it goes (left left, left, right, right right): static void AddNode(BinaryWriter writer, BinaryReader reader, string key, string data) { byte crc = 0; for (int i = 0; i < key.Length; i++) crc = (byte)(crc ^ key[i]); // First node? if (reader.BaseStream.Length == 0) { writer.BaseStream.Seek(0, SeekOrigin.Begin); writer.Write(key); writer.Write(crc); writer.Write(0L); // A writer.Write(0L); // B writer.Write(0L); // C writer.Write(0L); // D writer.Write(data); return; } // Search a free node in the tree long nodePos = 0; while (true) { reader.BaseStream.Seek(nodePos, SeekOrigin.Begin); string ns = reader.GetString(); byte cs = reader.GetByte(); long pos = reader.BaseStream.Position; long a = reader.ReadInt64(); long b = reader.ReadInt64(); long c = reader.ReadInt64(); long d = reader.ReadInt64(); if (key > ns && crc >= cs) // Go to A { if (a == 0) { writer.BaseStream.Seek(pos, SeekOrigin.Begin); break; } else nodePos = a; } else if(key > ns && crc < cs) // Go to B { if (b == 0) { writer.BaseStream.Seek(pos + 8, SeekOrigin.Begin); break; } else nodePos = b; } else if(key < ns && crc >= cs) // Go to C { if (c == 0) { writer.BaseStream.Seek(pos + 8 * 2, SeekOrigin.Begin); break; } else nodePos = c; } else // Go to D { if (d == 0) { writer.BaseStream.Seek(pos + 8 * 3, SeekOrigin.Begin); break; } else nodePos = d; } } // Add now the node pointer... writer.Write(reader.BaseStream.Length); // Add the new node at the end writer.BaseStream.Seek(0, SeekOrigin.End); writer.Write(key); writer.Write(crc); writer.Write(0L); // A writer.Write(0L); // B writer.Write(0L); // C writer.Write(0L); // D writer.Write(data); } Of course a quad tree takes more space (in memory or disk) but should avoid the degeneration of a binary tree and should offer yet more speed to it. Retrieving data from it is not much more complex: static string GetData(BinaryReader reader, string key) { byte crc = 0; for (int i = 0; i < key.Length; i++) crc = (byte)(crc ^ key[i]); long nodePos = 0; while (true) { reader.BaseStream.Seek(nodePos, SeekOrigin.Begin); string ns = reader.GetString(); byte cs = reader.GetByte(); long a = reader.ReadInt64(); long b = reader.ReadInt64(); long c = reader.ReadInt64(); long d = reader.ReadInt64(); if(ns == key) return reader.GetString(); if (key > ns && crc >= cs) // Go to A nodePos = a; else if(key > ns && crc < cs) // Go to B nodePos = b; else if(key < ns && crc >= cs) // Go to C nodePos = c; else // Go to D nodePos = d; if(nodePos == 0) // Found nothing... quit... return null; } } Now about performances: Width a flat file (binary or not) I get about 200'000 entries inserted per sec (without counting the index creation at the end). With a quad tree I get about 20'000 entries inserted per sec. With MySQL MyISAM (which is faster than InnoDB in my case) I get about 300 entries inserted per sec. Now to speed things up a little bit more, you could split the files based on the start of the key, if we use a string (lowercase) as key as example we could use the first character to create 26 different files data_a up to data_z, that will speed up to 26x the storage as well as the retrieval. If you use the first 2 characters you will get 676 files with a speed up to 676x faster. In my case, retrieval of items within a data set of 3.3 mil items: 00:00:00.0002379 So it is certainly not slower than MySQL ;) Conclusion: A general purpose database like MySQL or Oracle or any other do have a lot of advantages like flexibility and powerful tools, yet is not optimized for every single cases. In some cases you must create your own storage engine to reach the performances needed by your application and this is certainly possible. Quote Link to comment Share on other sites More sharing options...
Danny696 Posted January 4, 2011 Share Posted January 4, 2011 Thanks for this alain, many people here wont know what to do with it now how to use it, me included, but thanks for the post, intresting to see the time difference... Quote Link to comment Share on other sites More sharing options...
a_bertrand Posted January 4, 2011 Author Share Posted January 4, 2011 Well the same thing could be without any troubles ported on PHP. If you want I can make the reader in PHP just to show how to do it. Quote Link to comment Share on other sites More sharing options...
Danny696 Posted January 4, 2011 Share Posted January 4, 2011 If you wouldnt mind, i would like to see that.... Quote Link to comment Share on other sites More sharing options...
a_bertrand Posted January 5, 2011 Author Share Posted January 5, 2011 Here you are, should work as reader. With this you can certainly port the writer as well, isn't it? function readInt($file) { $val=0; for($i=0;$i < 4;$i++) $val=($val<<8)|ord(fgetc($file)); return $val; } function GetData($key) { $crc = 0; for ($i = 0; i < strlen($key); $i++) $crc = $crc ^ ord($key[i]); $file=fopen("data.db","rb"); $nodePos = 0; while (true) { fseek($file,$nodePos,SEEK_SET); $ns=fgets($file); $cs=ord(fgetc($file)); $a=readInt($file); $b=readInt($file); $c=readInt($file); $d=readInt($file); if($ns == $key) return fgets($file); // return the line of data if ($key > $ns && $crc >= $cs) // Go to A $nodePos = $a; else if($key > $ns && $crc < $cs) // Go to B $nodePos = $b; else if($key < $ns && $crc >= $cs) // Go to C $nodePos = $c; else // Go to D $nodePos = $d; if($nodePos == 0) // Found nothing... quit... return ""; } } btw you could use the pack / unpack functions to store / retreive binary numbers (http://ch2.php.net/pack) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.