There are a big range of databases. From the Metallic tape systems the FBI used to the fancy Oracle or IBM towers. The goal is simple: Store your information. But the subtleties are complicated.
Do you want writing to be fast? Do you want reading to be fast? What do you want to read? How do you figure out what you want to read? -- These questions have a variety of answers depending on who they're built for.
There are some interesting examples of databases that come to mind. Google and Facebook are the unspoken kings of the field. But unfortunately for us outsiders, we don't know all of their insights into the matter. But we can piece together a thing or two.
Cassandra.
What's up with Cassandra? On the surface, it's a really cool, scalable, database solution. It supports a lot of fascinating stuff.
But! Facebook, after developing Cassandra (http://cassandra.apache.org/), dropped it entirely. Strange... it looks like Facebook stumbled upon something much better. Better enough to through out what they'd invested on for so long and start over completely.
And Google? They built their own thing too. A really pretty thing (from the likes of it that I manage to overhear). Assuredly, MapReduce was a beautiful thing and Google was a huge proponent of it. Just taking into account that they can take a single query and search the whole internet in sub-second timespans, we can appreciate how effective they are. Anyway, the fact that they threw everything out and built their own is a testament to the horrid condition of existing database solutions.
Combine that with Facebook's jumping around from database to database and, eventually, building their own and then building another, has got to be a wake up call to everyone that our databases are just shit.
I have felt this way for a long time. Every coworker, every employer along the way of my winding career has universally been against me on the concept. The more I push the issue, the more I've gotten large groups of religious-esk programmers ganging up on me using social force instead of logic and persuasion.
To begin my attack on this age-old issue, I've started with the index concept. One of the goals of my database-to-be is quick data lookup. Indeed that is one of the more challenging of the concepts involved. Anyone can write something where data writing is quick, it's data lookup where all the drama is.
Okay, so you got an index. The first question, which seems obvious from a certain perspective, is why don't indices have filters? Filters are just a ridiculously obvious optimization. Let's look at MySql's index syntax and see if they have a filter parameter:
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_nameNo filter parameter. What happens when we want to build an index based on some condition of the row?
[index_type]
ON tbl_name (index_col_name,...)
[index_type]
index_col_name:
col_name [(length)] [ASC | DESC]
index_type:
USING {BTREE | HASH}
Let's presume you have a table of this sort:
"gender" => string
"income" => int
Now let's say you want to build an index for all males sorted by their income. You can't! That may not seem so bad when gender has just two values, but what about something more complex:
"unix_creation_timestamp" => int
"user_id" => int
Let's say we want to build an index for just a single given day sorted by user_id. Again, we can't. But why can't we?
Turns out, this is a very common question in many applications. But, this optimization isn't enough. In the gender case, we're only going from log(n) time to log(n/2) time, which isn't much to celebrate. In the user creation table, if w = number of weeks in the range, than we've just gone from log(n) to log(n/w). While that's better for a large number of weeks, it's not really enough. We're onto something much bigger here.
What if, what we really care about is not a list of new user ids during the given week but just the count. Suddenly, our needs go way down. All we need to store is an int for each given week. When we add a given user row, we adjust just a single integer (increment it by one) for the given week it's being created inside of.
That's an O(1) burden on insertion but optimizes our query from O(log(n)) to O(1).
Now we're talking seriously. This is how queries should be operating. And they can, with the right kind of database. It just requires a little thinking of your queries in advance of needing them and approaching the database problem from a sideways perspective. Don't tackle it head on because your presumptions will get in the way, you've got to look for different angles to think about this stuff.




