Articles metastrings - a CSV-driven NoSQL Database by Michael Sydney Balloni

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
metastrings - a CSV-driven NoSQL Database
Michael Sydney Balloni - 26/May/2020
[SHOWTOGROUPS=4,20]
SQL schema and C# code for implementing a NoSQL database on top of MySQL
Database implementation and source code snippets, including a CSV appendix, tell the story of building a functional NoSQL database.

sources: https://dumpz.ws/resources/metastrings-a-csv-driven-nosql-database.113/

Introduction
I set out to create a CSV-driven NoSQL database powered by a SQL database with a completely normalized schema.

Values
All unique values in the database are represented in one table:
Код:
CREATE TABLE `bvalues` (
  `id` int NOT NULL AUTO_INCREMENT,
  `isNumeric` bit(1) NOT NULL,
  `numberValue` double DEFAULT NULL,
  `prefix` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `stringValue` text CHARACTER SET utf8 COLLATE utf8_bin,
  PRIMARY KEY (`id`),
  KEY `idx_bvalues_prefix` (`prefix`,`isNumeric`,`id`),
  KEY `idx_bvalues_number` (`numberValue`,`isNumeric`,`id`),
  FULLTEXT KEY `idx_bvalues_prefix_stringValue` (`prefix`,`stringValue`)
)

It's either a floating point number, or a string. If it's a string, it can be up to 64K characters long, UTF8 permitting.

The value "knows" if it's a number or a string with the isNumeric field.

An improvement over a previous incantation of this project is that small strings (<= 255 long) are not stored in the large text field at all. And the full-text index includes both the varchar and the text columns. So you get MATCH querying against something that can be small or large, and you only pay for the storage you use. If you want the string value, you'll see "IFNULL(stringValue, prefix)" around.

There are stored procedures to add / get IDs of values in the system:
Код:
; For working with numbers
CREATE DEFINER=`root`@`localhost` PROCEDURE `GetNumberId`(IN numberValueParam DOUBLE)
BEGIN
DECLARE valueId BIGINT DEFAULT NULL;
SELECT id INTO valueId 
FROM bvalues 
FORCE INDEX (idx_bvalues_number)
WHERE isNumeric = 1 AND numberValue = numberValueParam;
IF valueId IS NULL THEN
    INSERT bvalues (isNumeric, numberValue) VALUES (1, numberValueParam);
    SELECT LAST_INSERT_ID() INTO valueId LIMIT 1;
END IF;
SELECT valueId;
END

; For working with short strings
CREATE DEFINER=`root`@`localhost` _
PROCEDURE `GetShortStringId`(IN stringValueParam varchar(255))
BEGIN
DECLARE valueId BIGINT DEFAULT NULL;
SELECT id INTO valueId 
FROM bvalues 
FORCE INDEX (idx_bvalues_prefix)
WHERE isNumeric = 0 AND prefix = stringValueParam AND stringValue IS NULL;
IF valueId IS NULL THEN
    INSERT bvalues (isNumeric, prefix, stringValue) VALUES (0, stringValueParam, NULL);
    SELECT LAST_INSERT_ID() INTO valueId LIMIT 1;
END IF;
SELECT valueId;
END

; For working with long strings
CREATE DEFINER=`root`@`localhost` PROCEDURE `GetLongStringId`(IN stringValueParam TEXT)
BEGIN
DECLARE valueId BIGINT DEFAULT NULL;
DECLARE strPrefix VARCHAR(255) DEFAULT SUBSTRING(stringValueParam, 0, 255);
SELECT id INTO valueId FROM bvalues _
WHERE isNumeric = 0 AND prefix = strPrefix AND stringValue = stringValueParam;
IF valueId IS NULL THEN
    INSERT bvalues (isNumeric, prefix, stringValue) VALUES (0, strPrefix, stringValueParam);
    SELECT LAST_INSERT_ID() INTO valueId LIMIT 1;
END IF;
SELECT valueId;
END

Notice that I had to use "LIMIT 1" with the "SELECT LAST_INSERT_ID()" calls; not sure why, but it's been needed by others, and it was needed here.

Types (Think Table Names)
All type names are stored in a types table:
Код:
CREATE TABLE `types` (
  `id` smallint NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `isNumeric` bit(1) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `name_UNIQUE` (`name`)
)

Names
All names of things (think column names) are stored in a names table:
Код:
CREATE TABLE `names` (
  `id` smallint NOT NULL AUTO_INCREMENT,
  `typeid` smallint NOT NULL,
  `name` varchar(100) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `isNumeric` bit(1) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `name_types` (`name`,`typeid`),
  KEY `fk_name_types_idx` (`typeid`),
  CONSTRAINT `fk_name_types` FOREIGN KEY (`typeid`) _
  REFERENCES `types` (`id`) ON DELETE RESTRICT ON UPDATE RESTRICT
)

Each name is associated with one type, and is inherently numeric or not.

Between types and names, the "table" schema is defined.

That rounds out the name-value universe for describing things. But what are we describing?

Items (Think Rows)
Код:
CREATE TABLE `items` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `typeid` smallint NOT NULL,
  `valueid` int NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_items_valueid_typenameid` (`valueid`,`typeid`),
  KEY `fk_item_typeid_idx` (`typeid`),
  CONSTRAINT `fk_item_typeid` FOREIGN KEY (`typeid`) REFERENCES `types` (`id`),
  CONSTRAINT `fk_item_valueid` FOREIGN KEY (`valueid`) _
  REFERENCES `bvalues` (`id`) ON DELETE RESTRICT ON UPDATE RESTRICT
)

Each item has a type and an inherent value, think single-column primary key, like the path for a file system.

There is a stored procedure to add / get ID of items:
Код:
CREATE DEFINER=`root`@`localhost` _
PROCEDURE `GetItemId`(IN typeIdParam INT, IN valueIdParam INT, IN noCreate BOOLEAN)
BEGIN
DECLARE itemId BIGINT DEFAULT NULL;
SELECT id INTO itemId 
FROM items 
WHERE typeId = typeidParam AND valueId = valueIdParam;
IF itemId IS NULL THEN
    IF noCreate THEN
        SELECT -1 INTO itemId;
    ELSE
        INSERT items (typeId, valueId) 
        VALUES (typeIdParam, valueIdParam);
        SELECT LAST_INSERT_ID() INTO itemId LIMIT 1;
    END IF;
END IF;
SELECT itemId;
END

[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
[SHOWTOGROUPS=4,20]
Item Name-Values
We need to associate the rows with the fields of the rows:
Код:
CREATE TABLE `itemnamevalues` (
  `itemid` bigint NOT NULL,
  `nameid` smallint NOT NULL,
  `valueid` bigint NOT NULL,
  PRIMARY KEY (`itemid`,`nameid`),
  KEY `fk_inv_name_ids_idx` (`nameid`),
  KEY `fk_inv_value_ids_idx` (`valueid`),
  CONSTRAINT `fk_inv_items_ids` FOREIGN KEY (`itemid`) _
  REFERENCES `items` (`id`) ON DELETE CASCADE ON UPDATE RESTRICT,
  CONSTRAINT `fk_inv_name_ids` FOREIGN KEY (`nameid`) REFERENCES `names` (`id`),
  CONSTRAINT `fk_inv_value_ids` FOREIGN KEY (`valueid`) REFERENCES `bvalues` (`id`)
)

No need for an auto-number on this one. Gotta save space somewhere, yeah?

The ItemValues View
It all comes together in view a which makes writing SQL to query this schema practical. This is where it all comes together:
Код:
CREATE 
    ALGORITHM = UNDEFINED 
    DEFINER = `root`@`localhost` 
    SQL SECURITY DEFINER
VIEW `metastrings`.`itemvalues` AS
    SELECT 
        `inv`.`itemid` AS `itemid`,
        `inv`.`nameid` AS `nameid`,
        `v`.`isNumeric` AS `isNumeric`,
        `v`.`numberValue` AS `numberValue`,
        `v`.`prefix` AS `prefix`,
        `v`.`stringValue` AS `stringValue`
    FROM
        (`metastrings`.`itemnamevalues` `inv`
        JOIN `metastrings`.`bvalues` `v` ON ((`v`.`id` = `inv`.`valueid`)))

The beauty in this schema is that you can query top-down (give me the data for these items) or bottom-up (get me the items that have these name+value's). And it's normaized all to hell!

Article.gif


That's it for the SQL schema.

Software
There's a lot of code! Let's take a quick pass and hit the high points.

  1. client: the project for client applications to program against, plus CSV and utils...
  2. csv: mini-project for working with CSV, see appendix below
  3. lib: implementation of the database
  4. metafiles: proof-of-concept program for working with file metadata, comparing performance between a MySQL table and this metastrings NoSQL creation
  5. server: database server, thin wrapper around lib's Server class
  6. tests: unit test library
  7. utils: useful tools to keep handy
client

api: These Are the Things You Can Do...
  1. Define(CSV): Think UPSERT
  2. Get: given a type name and a list of "values", get back a list of name=>value dictionaries: type+List<object>=> List<dictionary<string, object="">>
  3. QueryGet: given a SQL-like query, get results like Get
  4. GetSql: given a SQL-like query object, get the MySQL query needed to satisfy that query
  5. Delete: given a type name and a list of "primary keys", deleting the matching items (rows)
  6. Drop: get rid of a "table"
  7. Reset: empty out the Items and ItemNameValues tables, and optionally the name-value stuff, too
  8. GetTiming: retrieve stats on how long things are taking
define - CSV-driven ASP.NET UPSERT
Begin starts the web request and opens the input stream

Call WriteRecordAsync a bunch, then call End

query - CSV-driven ASP.NET SELECT
Execute takes SQL-like query, POSTs it to the server

Call Read, and access FieldCount and the indexer to access the query results, until it says no more.

CSV Document Format
As part of this, I developed the sketch of a CSV document format for general purpose data transfer. One line with "header" names for document metadata, like a schema version number, a table name, and come what may. The next line has the values for the document metadata, currently just the "table" name being UPSERT'd. The next line has the column headers. Then the rest of the rows have the data.

For example...
Код:
version,table
1.1,TheTableName
Column1,Column2
foo,bar
blet,monkey

It is self-descriptive, you only say the field names once, and there's no punctuation or structure but commas and the rare double-quotes. If what you are sending is flat and can be sent as text, I see no reason to pay the space and processing price of JSON.



[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
[SHOWTOGROUPS=4,20]
lib
  • Command: server implementation of the API
  • Connection+Context: MySQL DB management; "post-ops" are SQL statements that can be run independently at the end of the command for great batch-processing performance
  • Server: HttpListener-based DB/API server
  • Sql: SQL-like => real SQL: (sorry this is long, but this is the magic)
Код:
using System;
using System.Threading.Tasks;
using System.Collections.Generic;
using System.Text;
using System.Linq;

using MySql.Data.MySqlClient;

namespace metastrings
{
    public static class Sql
    {
        public static async Task<string> GenerateSqlAsync(Context ctxt, Select query)
        {
            //
            // "COMPILE"
            //
            if (string.IsNullOrWhiteSpace(query.type))
                throw new MetaStringsException("Invalid query, TypeName is missing");

            if (query.select == null || query.select.Count == 0)
                throw new MetaStringsException("Invalid query, SelectColumns is empty");

            if (query.orderBy != null)
            {
                foreach (var order in query.orderBy)
                {
                    string orderField = order.field.Trim();
                    if (!query.select.Contains(orderField))
                        throw new MetaStringsException
                        ($"Invalid query, OrderByColumns must be in SelectColumns: 
                        {order.field.Trim()}");
                }
            }

            if (query.where != null)
            {
                foreach (var criteriaSet in query.where)
                {
                    foreach (var criteria in criteriaSet.criteria)
                    {
                        ValidateOperator(criteria.op);
                        ValidateParameterName(criteria.paramName);
                    }
                }
            }

            //
            // SETUP
            //
            short typeId = await Types.GetIdAsync
             (ctxt, query.type, noCreate: true).ConfigureAwait(false);
            TypeObj typeObj = await Types.GetTypeAsync(ctxt, typeId).ConfigureAwait(false);

            // Gather columns
            var names = new List<string>();

            names.AddRange(query.select);

            if (query.orderBy != null)
                names.AddRange(query.orderBy.Select(o => o.field));

            if (query.where != null)
            {
                foreach (var criteriaSet in query.where)
                    names.AddRange(criteriaSet.criteria.Select(w => w.name));
            }

            // Cut them down
            names = names.Select(n => n.Trim()).Where
            (n => !string.IsNullOrWhiteSpace(n)).Distinct().ToList();

            // Get name objects
            var nameObjs = new Dictionary<string, NameObj>(names.Count);
            foreach (var name in names)
            {
                if (name == "value" || name == "id")
                {
                    nameObjs.Add(name, null);
                }
                else
                {
                    var nameId = await Names.GetIdAsync
                     (ctxt, typeId, name, noCreate: true).ConfigureAwait(false);
                    var nameObj = await Names.GetNameAsync(ctxt, nameId);
                    nameObjs.Add(name, nameObj);
                }
            }

            //
            // SELECT
            //
            string selectPart = "";
            foreach (var name in query.select.Select
            (n => n.Trim()).Where(n => !string.IsNullOrWhiteSpace(n)))
            {
                var cleanName = CleanName(name);

                if (selectPart.Length > 0)
                    selectPart += ",\r\n";

                if (name == "value")
                {
                    if (typeObj.isNumeric)
                        selectPart += "bv.numberValue";
                    else
                        selectPart += query.prefixesOnly ? 
                             "bv.prefix" : "IFNULL(bv.stringValue, bv.prefix)";
                }
                else if (name == "id")
                    selectPart += "i.id";
                else if (name == "count")
                    selectPart += "COUNT(*)";
                else if (nameObjs[name] == null)
                    selectPart += "NULL";
                else
                    selectPart += $"(CASE WHEN iv{cleanName}.isNumeric 
                    THEN iv{cleanName}.numberValue
                    ELSE IFNULL(iv{cleanName}.stringValue, iv{cleanName}.prefix) END)";

                selectPart += $" AS {cleanName}";
            }
            selectPart = "SELECT\r\n" + selectPart;

            //
            // FROM
            //
            string fromPart = "FROM\r\nitems AS i";
            if (nameObjs.ContainsKey("value"))
                fromPart += "\r\nJOIN bvalues bv ON bv.id = i.valueid";

            foreach (var name in names.Select(n => n.Trim()).Where
                    (n => !string.IsNullOrWhiteSpace(n)))
            {
                if 
                (
                    name != "value" 
                    && 
                    name != "id"
                    && 
                    nameObjs.ContainsKey(name) 
                    && 
                    nameObjs[name] != null
                )
                {
                    var cleanName = CleanName(name);
                    fromPart += 
                        $"\r\nLEFT OUTER JOIN itemvalues as iv{cleanName} 
                        ON iv{cleanName}.itemid = i.id" +
                        $" AND iv{cleanName}.nameid = {nameObjs[name].id}";
                }
            }

            //
            // WHERE
            //
            string wherePart = $"i.typeid = {typeId}";
            if (query.where != null)
            {
                foreach (var criteriaSet in query.where)
                {
                    if (criteriaSet.criteria.Count == 0)
                        continue;

                    wherePart += "\r\nAND\r\n";

                    wherePart += "(";
                    bool addedOneYet = false;
                    foreach (var where in criteriaSet.criteria)
                    {
                        string name = where.name.Trim();
                        if (string.IsNullOrWhiteSpace(name))
                            continue;

                        if (!addedOneYet)
                            addedOneYet = true;
                        else
                            wherePart += $" {Enum.GetName(criteriaSet.combine.GetType(), 
                                         criteriaSet.combine)} ";

                        var nameObj = nameObjs[name];
                        var cleanName = CleanName(name);

                        if (nameObj == null && cleanName != "value") // something on 
                                                                     // the item row
                        {
                            wherePart += $"i.{cleanName} {where.op} {where.paramName}";
                        }
                        else if (where.op.Equals
                                ("matches", StringComparison.OrdinalIgnoreCase))
                        {
                            string columnPart = 
                                name == "value" 
                                ? "bv.prefix, bv.stringValue" 
                                : $"iv{cleanName}.prefix, iv{cleanName}.stringValue";
                            wherePart += $"MATCH({columnPart}) 
                                         AGAINST ({where.paramName} IN BOOLEAN MODE) > 0";
                            if (query.orderBy == null || query.orderBy.Count == 0)
                            {
                                selectPart += $", MATCH({columnPart}) 
                                AGAINST ({where.paramName} IN BOOLEAN MODE)
                                AS {cleanName}Score";
                                query.orderBy = new List<Order> 
                                { new Order() 
                                { field = $"{cleanName}Score", descending = true } };
                            }
                        }
                        else if (cleanName == "value")
                        {
                            if (typeObj.isNumeric)
                                wherePart += $"bv.numberValue {where.op} {where.paramName}";
                            else
                                wherePart += $"bv.prefix {where.op} {where.paramName}";
                        }
                        else if (nameObj.isNumeric)
                        {
                            wherePart += $"iv{cleanName}.numberValue 
                                         {where.op} {where.paramName}";
                        }
                        else
                        {
                            wherePart += $"iv{cleanName}.prefix {where.op} {where.paramName}";
                        }
                    }
                    wherePart += ")";
                }
            }
            wherePart = "WHERE " + wherePart;

            //
            // ORDER BY
            //
            string orderBy = "";
            if (query.orderBy != null)
            {
                foreach (var order in query.orderBy)
                {
                    if (string.IsNullOrWhiteSpace(order.field))
                        continue;

                    if (orderBy.Length > 0)
                        orderBy += ",\r\n";

                    string orderColumn = CleanName(order.field);

                    orderBy += MySqlHelper.EscapeString(orderColumn) + 
                               (order.descending ? " DESC" : " ASC");
                }

                if (orderBy.Length > 0)
                    orderBy = "ORDER BY\r\n" + orderBy;
            }

            //
            // LIMIT
            //
            string limitPart = "";
            if (query.limit > 0)
                limitPart = $"LIMIT\r\n{query.limit}";

            //
            // SQL
            //
            StringBuilder sb = new StringBuilder();

            sb.Append($"{selectPart.Trim()}");
            
            sb.Append($"\r\n\r\n{fromPart}");

            if (!string.IsNullOrWhiteSpace(wherePart))
            {
                sb.Append($"\r\n\r\n{wherePart}");
            }

            if (!string.IsNullOrWhiteSpace(orderBy))
            {
                sb.Append($"\r\n\r\n{orderBy}");
            }

            if (!string.IsNullOrWhiteSpace(limitPart))
            {
                sb.Append($"\r\n\r\n{limitPart}");
            }

            string sql = sb.ToString();
            return sql;
        }

        private static void ValidateOperator(string op)
        {
            if (!sm_validOps.Contains(op.ToLower()))
                throw new MetaStringsException($"Invalid query operator: {op}");
        }

        private static void ValidateParameterName(string parm)
        {
            if (string.IsNullOrWhiteSpace(parm))
                throw new MetaStringsException($"Invalid parameter name: (blank)");

            parm = parm.Trim();

            if (parm.Length < 2 || parm[0] != '@')
                throw new MetaStringsException($"Invalid parameter name: {parm}");

            for (int p = 1; p < parm.Length; ++p)
            {
                if (!char.IsLetterOrDigit(parm[p]))
                    throw new MetaStringsException($"Invalid parameter name: {parm}");
            }
        }

        public static string CleanName(string name) // used for table and column aliases
        {
            string clean = "";
            foreach (char c in name)
            {
                if (char.IsLetterOrDigit(c))
                {
                    clean += c;
                }
            }

            if (string.IsNullOrWhiteSpace(clean) || !char.IsLetter(clean[0]))
            {
                clean = "a" + clean;
            }
            return clean;
        }

        private static HashSet<string> sm_validOps =
            new HashSet<string>
            (
                new[]
                {
                    "=", "<>", ">", ">=", "<", "<=",
                    "matches",
                    "like"
                }
            );
    }
}



[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
[SHOWTOGROUPS=4,20]
metafiles - The Competition
How about one table instead of all the above malarky?
Код:
CREATE TABLE `fi` (
  `id` int NOT NULL AUTO_INCREMENT,
  `pathPrefix` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `path` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `length` bigint NOT NULL,
  `created` datetime NOT NULL,
  `lastmodified` datetime NOT NULL,
  `ext` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `readOnly` bit(1) NOT NULL,
  `namePrefix` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `company` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `product` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `descriptionPrefix` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `description` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `version` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `language` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `copyright` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_fi_pathPrefix` (`pathPrefix`),
  KEY `idx_fi_length` (`length`),
  KEY `idx_fi_created` (`created`),
  KEY `idx_fi_lastmodified` (`lastmodified`),
  KEY `idx_fi_ext` (`ext`),
  KEY `idx_fi_readOnly` (`readOnly`),
  KEY `idx_fi_namePrefix` (`namePrefix`),
  KEY `idx_fi_company` (`company`),
  KEY `idx_fi_product` (`product`),
  KEY `idx_fi_descriptionPrefix` (`descriptionPrefix`),
  KEY `idx_fi_version` (`version`),
  KEY `idx_fi_language` (`language`),
  KEY `idx_fi_copyright` (`copyright`),
  FULLTEXT KEY `idx_fi_path` (`path`),
  FULLTEXT KEY `idx_fi_name` (`name`),
  FULLTEXT KEY `idx_fi_description` (`description`)
)

utils
Lots of fun little stuff, but here are two you may get good mileage out of...
Код:
// Given a bunch of concatenate-able strings,
// and an ideal size for a batch of these strings, returns a list of string batches
// Great for batching up post-ops SQL for improved ingestion
public static List<string> BatchUp(IEnumerable<string> pieces, int batchSize)
{
    if (batchSize <= 0)
        throw new ArgumentException("batchSize must be > 0");

    var retVal = new List<string>();

    StringBuilder sb = new StringBuilder(batchSize);
    foreach (string str in pieces)
    {
        if (str.Length + sb.Length > batchSize)
        {
            retVal.Add(sb.ToString());
            sb.Clear();
        }

        sb.Append(str);
    }

    if (sb.Length > 0)
        retVal.Add(sb.ToString());

    return retVal;
}

// Given a bunch of stuff to work on, like file paths,
// split the items into a number of batches, to be processed independently
// Great for splitting up work to farm out to Tasks
// Kind of Parallel.ForEach
public static List<List<T>> SplitUp<T>(IEnumerable<T> items, int batches)
{
    if (batches <= 0)
        throw new ArgumentException("batches must be > 0");

    List<List<T>> retVal = new List<List<T>>(batches);

    if (batches == 1)
    {
        retVal.Add(new List<T>(items));
        return retVal;
    }

    int itemCount = items.Count();
    int batchSize = (int)Math.Round((double)itemCount / batches); // last batch can be fat

    for (int b = 0; b < batches; ++b)
    {
        var newList = new List<T>(batchSize);
        retVal.Add(newList);
    }

    int listIndex = 0;
    foreach (T t in items)
    {
        var curList = retVal[listIndex];
        curList.Add(t);

        if (curList.Count >= batchSize && listIndex != retVal.Count - 1)
            ++listIndex;
    }

    return retVal;
}

Conclusion
It was a blast to build this system, and rebuild it, and concentrating on CSV, as I think has its place as an alternative to JSON.
I had three goals for this project:
  1. Reuse of cached names and values leads to acceptable ingestion rates
    The ingest rate is 4X slower when cold (nothing in name-value tables or in-memory caches), and about 30% slower when the name-values world is in place in MySQL...and cached in memory in the lib classes! Not too shabby, ingestion is acceptable, I'd call this a win.
  2. Single string full-text index makes for faster querying
    Querying is about 2X slower...all those joins really add up! This is an area that needs more attention and optimization. Just happy to get working queries at this point. You saw the query building code, right? Yikes!
  3. Ultra-normalization results in less space used on disk
    Storage is about 2X smaller. All that normalization pays off!
And the dynamic schema is a big selling point. All's well that ends well.

Appendix: CSV is Fun and Easy
The world may be JSON-in, JSON-out, but I think there's a place for CSV in this world.

I was working on the above NoSQL database server, and I had leafy JSON documents flying back and forth. If there were 1,000 items to insert, there would be 1,000 self-documenting little...documents! We can do better!

There are three areas: CSV-generation, CSV-parsing, and CSV-using applications.

To generate CSV, you only really need to worry about commas, and when you find commas, then you put double-quotes around the value, then you have to worry about double-quotes.

Lise so:
Код:
public static string NullCellValue = "== null ==";  // set this to wait works for your domain,
// maybe a single character
// will work for you, how about ~ ?

public static string CleanseCsv(string str)
{
if (str == null)
return NullCellValue;

if (str.IndexOf(',') < 0 && str.IndexOf('\"') < 0) // clean, no op
return str;

return $"\"{str.Replace("\"", "\"\"")}\"";
}

public static string CleanseCsvObj(object obj)
{
if (obj == null)
return NullCellValue;

return CleanseCsv(obj.ToString());
}

public static string UnNullCsv(string str)
{
return str == NullCellValue ? null : str;
}
Generate a line of CSV like so:
Hide   Copy Code
public static string CreateCsv(IEnumerable<string> strs)
{
return string.Join(",", strs.Select(s => CleanseCsv(s)));
}

public static string CreateCsv(IEnumerable<object> objs)
{
return string.Join(",", objs.Select(o => CleanseCsvObj(o)));
}

Doesn't get much simpler than that... Thanks Linq!

Next up is parsing CSV.

Big Disclaimer: This code works with the One Line, One Row rule, no multi-line data allowed.

Here's the parser. The idea is that you pass in a CSV string or a TextReader to read from, then call Read and access the Count and the data indexer to access the data from the line just processed. Looks to be simple, clean, and efficient. May it serve you well.
Код:
using System;
using System.Threading.Tasks;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Linq;

namespace metastrings
{
public class CsvParser : IDisposable
{
public CsvParser(Stream stream, bool keepOpen = false)
: this(new StreamReader(stream, Encoding.UTF8), keepOpen)
{
}

public CsvParser(TextReader reader, bool keepOpen = false)
{
m_tr = reader;
m_keepOpen = keepOpen;
}

public void Dispose()
{
if (m_tr != null && !m_keepOpen)
{
m_tr.Dispose();
m_tr = null;
}
}

public bool Read()
{
string line;
while (true)
{
line = m_tr.ReadLine();
if (line == null)
return false;

if (string.IsNullOrWhiteSpace(line))
continue;

break;
}

ParseCsvLine(line, m_sb, m_lineBuffer);
return true;
}

public async Task<bool> ReadAsync()
{
string line;
while (true)
{
line = await m_tr.ReadLineAsync().ConfigureAwait(false);
if (line == null)
return false;

if (string.IsNullOrWhiteSpace(line))
continue;

break;
}

ParseCsvLine(line, m_sb, m_lineBuffer);
return true;
}

public int Count => m_lineBuffer.Count;
public string this[int index] => m_lineBuffer[index];

public static void ParseCsvLine
(string line, StringBuilder sb, List<string> lineBuffer)
{
sb.Clear();
lineBuffer.Clear();

if (line.Length == 0)
return;

bool hasQuotes = line.IndexOf('\"') >= 0;
if (!hasQuotes)
{
lineBuffer.AddRange(line.Split(',').Select(s => CsvPrep.UnNullCsv(s)));
return;
}

bool inQuote = false;
for (int c = 0; c < line.Length; ++c)
{
char cur = line[c];

if (cur == '\"')
{
if (sb.Length == 0) // start quote
{
inQuote = true;
continue;
}

// quote-before double
if (c + 1 < line.Length)
{
char next = line[c + 1];
if (next == '\"')
{
sb.Append('\"');
++c;
continue;
}
}

// end quote
string cell = sb.ToString();
lineBuffer.Add(CsvPrep.UnNullCsv(sb.ToString()));
sb.Clear();
++c; // skip comma
inQuote = false;
continue;
}

if (cur == ',' && !inQuote)
{
lineBuffer.Add(CsvPrep.UnNullCsv(sb.ToString()));
sb.Clear();
continue;
}

sb.Append(cur);
}

if (sb.Length > 0 || line[line.Length - 1] == ',')
lineBuffer.Add(CsvPrep.UnNullCsv(sb.ToString()));
}

private TextReader m_tr;
private bool m_keepOpen;
private StringBuilder m_sb = new StringBuilder();
private List<string> m_lineBuffer = new List<string>();
}
}

Enjoy!

History
  • 26th May, 2020: Initial post

License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

[/SHOWTOGROUPS]