Distributed join

+1  

To join data together you need the data to be colocated - this idea is to colocate a subset of joined data on each node so that the data can be split up

YAML Idea

If you've got lots of data - more than can fit in memory you need a way to split the data up between machines. This idea is to prejoin data at insert time and distribute the records that join together to the same machine

this way the overall dataset can be split between machines. But the join can be executed on every node and then aggregated for the total joined data.

In more detail, we decide which machine stores the data by consistently hashing it with the join keys. So the matching records always get stored on the same machine.

chronological,


(suppress notifications) (Optional) Please, log in.

Is it not what a distributed nft does. We agregating a project from many accounts. I like to call it composition. I do think composition is the problem to solve, not just for storage concerns, but to manage complexity



    :  -- 
    : Mindey
    :  -- 
    

skihappy,

I don't know how NFTs store ownership in the block chain. But essentially this idea is about the mid level implementation detail of a distributed SQL database storage engine.

I find the idea of a distributed P2P database very useful I have a project where I implement a very simple SQL database and it supports distributed joins by distributing join keys to every node in the cluster. In this idea I take a different approach and hash the join key and use the join keys to do the node placement.

So everybody in the cluster gets a subset of the data. You need everybody to be online to do a query


// prejoin data at insert time //

I think it's a reasonable idea, but how would this be done? There are many possible joins, in fact, suppose that the number of tables is the database is %%n%%, then the number of all table pairs that may need to be joined, is the number of %%k=2%% subsets:

$${\binom {n}{2}} = {\frac {n!}{2!(n-2)!}} = {\frac {n^2-n}{2}} $$

For example, if the database has 15 tables, this number is 105, and if there's 42 tables, this number is 861. Add the possibility that you need to do joins on different fields -- and the number of pre-computed joins may be even higher. Still, it seems reasonable to do it at insert time, as the joins would change and need to be recomputed or modified to on every insert.


In my SQL database I introduced a statement called create join.

You tell the database ahead of time what fields are joinable fields.

create join inner join people on people.id = items.people inner join products on items.search = products.name

The database then checked on every insert and does the associated consistent hashing and node placement.



    : Mindey
    :  -- 
    :  -- 
    

chronological,

I should point out that some data may be movable once the create join statement has ran.

If a join existed for products.id and search.product_id and at insert time of products was inserted. A query for matching searches would run select Id from search where product_id = X

search.product_ID depends on products.id. they have the same value. The consistent hash for products can be: id. The consistent hash for search can ignore its own id and use the product_id. This would distribute this data to the same machine because the hashes are identical.

If there is multiple joins the this scheme might need to be more complicated. I think the fields can be concatenated and hashed.q



    :  -- 
    : Mindey
    :  -- 
    

chronological,

Providing a search engine for the internet is outrageously expensive because the data is so large and doesn't fit in memory. We could use this approach to split up the storage requirements of doing search.

If everybody hosts a fraction of the search index then all queries go to everybody before being returned.

If there is 1000 storage nodes then every search query produces 1000 subqueries one to every storage node. Each returns what they know about.



    : Mindey
    :  -- 
    :  -- 
    

chronological,