Database Technologies......

Sunday, July 7, 2024

DataStoryTelling

Exploring the Art of Data Communication: Tips and Tricks for Effective Data Storytelling (youtube.com)

Claus Grand Bang

Tuesday, May 21, 2024

let's walk through the implementation of a polymorphic pattern in MongoDB with an example of a content management system where different types of content (e.g., articles, videos, and images) are stored in a single collection.

Step 1: Identify Different Document Types

Determine the types of documents you want to store in the collection. In our example, we have articles, videos, and images.

Step 2: Design Schema

Define a schema that accommodates different document types using fields to indicate the type or structure. Include common fields shared by all document types, as well as type-specific fields.
Example schema:
```
json
```

{
  "type": "article" | "video" | "image",
  "title": <string>,
  "content": <string>,
  "url": <string> // Only for video and image types
  // Additional fields specific to each type
}

Step 3: Insert Documents of Different Types

Insert documents of different types into the MongoDB collection, ensuring they adhere to the specified schema.
Example documents:
```
json
```

{
  "type": "article",
  "title": "Introduction to MongoDB Polymorphic Pattern",
  "content": "This article provides an overview of implementing a polymorphic pattern in MongoDB.",
  // Additional fields specific to articles
}
{
  "type": "video",
  "title": "MongoDB Tutorial",
  "content": "A tutorial on using MongoDB.",
  "url": "https://example.com/mongodb-tutorial"
  // Additional fields specific to videos
}
{
  "type": "image",
  "title": "MongoDB Logo",
  "content": "The official MongoDB logo.",
  "url": "https://example.com/mongodb-logo"
  // Additional fields specific to images
}

Step 4: Query Data by Type

Use MongoDB queries to retrieve documents based on their type field value.
Example query to retrieve all articles:
```
javascript
```

```
db.content.find({ "type": "article" })
```

Step 5: Handle Different Document Types

Implement conditional logic in queries and application code to handle different document types appropriately. This might involve different processing or rendering logic based on the document type.

By following these steps and adjusting them to fit your specific use case, you can effectively implement a polymorphic pattern in MongoDB to store and query documents of different types within a single collection.

MongoDB Patterns

Certainly! Here's a concise cheat sheet covering various MongoDB data modeling patterns with schema design and a retail domain example for each:

Embedded Data Pattern

Description: Store related data within a single document using nested structures.
Schema Design:
```
json
```

{
  "_id": ObjectId("..."),
  "order_id": "ORD123",
  "customer": {
    "name": "John Doe",
    "email": "john@example.com",
    "address": {
      "street": "123 Main St",
      "city": "Anytown",
      "country": "USA"
    }
  },
  "products": [
    {
      "name": "Product 1",
      "quantity": 2,
      "price": 50
    },
    {
      "name": "Product 2",
      "quantity": 1,
      "price": 75
    }
  ]
}

Retail Domain Example: Order document containing customer details and ordered products.

Normalized Data Pattern

Description: Organize related data across multiple collections and establish relationships using references.
Schema Design:
```
json
```

// Customers collection
{
  "_id": ObjectId("..."),
  "name": "John Doe",
  "email": "john@example.com"
}

// Orders collection
{
  "_id": ObjectId("..."),
  "customer_id": ObjectId("..."),
  "order_id": "ORD123",
  // Other order fields...
}

// Products collection
{
  "_id": ObjectId("..."),
  "name": "Product 1",
  "price": 50
  // Other product fields...
}

Retail Domain Example: Separate collections for customers, orders, and products with references between them.

Array of Objects Pattern

Description: Store related data as an array of objects within a document.
Schema Design:
```
json
```

{
  "_id": ObjectId("..."),
  "customer": "John Doe",
  "orders": [
    {
      "order_id": "ORD123",
      "products": [
        {
          "name": "Product 1",
          "quantity": 2,
          "price": 50
        },
        {
          "name": "Product 2",
          "quantity": 1,
          "price": 75
        }
      ]
    }
  ]
}

Retail Domain Example: Customer document with an array of orders, each containing ordered products.

Bucketing Pattern

Description: Group related data into buckets or categories within a single collection.
Schema Design:
```
json
```

{
  "_id": ObjectId("..."),
  "timestamp": ISODate("..."),
  "category": "sales",
  "order_id": "ORD123",
  // Other sales-related fields...
}

Retail Domain Example: Sales data bucketed by categories like orders, returns, discounts, etc.

Polymorphic Pattern

Description: Accommodate different types of data within a single collection.
Schema Design:
```
json
```

{
  "_id": ObjectId("..."),
  "entity_type": "customer",
  // Customer fields...
}
{
  "_id": ObjectId("..."),
  "entity_type": "product",
  // Product fields...
}
{
  "_id": ObjectId("..."),
  "entity_type": "order",
  // Order fields...
}

Retail Domain Example: Documents representing customers, products, and orders stored in a single collection.

Shredding Pattern

Description: Decompose complex, nested structures into simpler, flatter documents.
Schema Design:
- Decompose nested structures into separate collections and establish relationships using references.
Retail Domain Example: Decompose order documents into separate collections for customers, orders, and products.

Document Versioning Pattern

Description: Track changes to documents over time.
Schema Design:
```
json
```

{
  "_id": ObjectId("..."),
  "order_id": "ORD123",
  "status": "shipped",
  "__v": 1 // Version number
}

Retail Domain Example: Order documents with a versioning field to track status changes.

By utilizing these patterns with appropriate schema designs in a retail domain context, you can effectively model your data in MongoDB to handle various aspects of a retail business, such as orders, customers, products, and sales data.

mongodb bucketing pattern

Let's walk through the implementation of a bucketing pattern in MongoDB with an example of time-series data. In this scenario, we'll create buckets representing different time intervals (e.g., days) for storing sensor data.

Step 1: Identify Data to Bucket

We have sensor data that records temperature readings every minute.

Step 2: Define Bucketing Criteria

We'll bucket the sensor data by day, meaning each bucket will represent a single day's worth of temperature readings.

Step 3: Design Schema

Our schema will include fields for the temperature reading, the timestamp, and a bucketing field to represent the day.
Example schema:
```
json
```

{
  "temperature": <value>,
  "timestamp": <timestamp>,
  "day_bucket": <date>
}

Step 4: Insert Documents with Bucketing Field

Insert documents into the MongoDB collection, ensuring each document includes the day_bucket field representing the day it belongs to.
Example document:
```
json
```

{
  "temperature": 25.5,
  "timestamp": ISODate("2024-05-20T12:30:00Z"),
  "day_bucket": ISODate("2024-05-20")
}

Step 5: Query Data by Bucket

Use MongoDB's query capabilities to retrieve data based on the bucketing criteria.
Example query to retrieve temperature readings for May 20, 2024:
```
javascript
```

db.sensor_data.find({ "day_bucket": ISODate("2024-05-20") })

Step 6: Aggregate Data Across Buckets

Utilize MongoDB's aggregation framework to perform calculations across multiple buckets.
Example aggregation pipeline to calculate the average temperature for each day:
```
javascript
```

db.sensor_data.aggregate([
  {
    $group: {
      _id: "$day_bucket",
      average_temperature: { $avg: "$temperature" }
    }
  }
])

Step 7: Optimize Performance

Monitor data distribution across buckets and create indexes on the day_bucket field to optimize query performance.

Step 8: Handle Bucket Growth

Implement strategies to manage bucket growth, such as archiving or partitioning buckets further, as needed.

By following these steps and adjusting them to fit your specific use case, you can effectively implement a bucketing pattern in MongoDB to organize and query time-series data.

MongoDB modeling techniques

Expanding the list to the top 10 modeling techniques in MongoDB provides a broader perspective on the various strategies available for data modeling:

Embedded Data Models: Store related data within a single document using nested or embedded structures. Suitable for one-to-one and one-to-many relationships where the embedded data logically belongs to the parent document.
Normalized Data Models: Organize related data across multiple collections and establish relationships using references or foreign keys. Ideal for many-to-many relationships or scenarios requiring data integrity and consistency.
Array of Objects: Utilize arrays within documents to store related data as a collection of objects. Suitable for scenarios with one-to-many relationships and small, relatively static arrays.
Bucketing or Bucketing Patterns: Group related data into "buckets" or categories within a single collection, often used for partitioning data such as time-series or event-based data.
Polymorphic Patterns: Accommodate diverse data types within a single collection by using a field to indicate document types or by storing documents with varying structures but similar attributes. Offers flexibility for evolving schemas or heterogeneous data.
Tree Structures: Model hierarchical relationships such as organizational charts or category hierarchies using tree structures like parent references or materialized path patterns.
Schema Versioning: Implement techniques to manage schema evolution over time, such as versioning documents or using flexible schema designs like the "attribute pattern" or "schemaless" modeling.
Sharding and Data Partitioning: Scale out MongoDB deployments by distributing data across multiple shards based on a shard key, partitioning data to improve performance and scalability.
Materialized Views: Precompute and store aggregated or derived data in separate collections to improve query performance for frequently accessed data or complex aggregations.
Document Versioning: Implement versioning within documents to track changes over time, allowing for historical analysis or data rollback capabilities.

Each modeling technique offers specific advantages and trade-offs, and the selection depends on factors such as data access patterns, query requirements, scalability needs, and data consistency requirements. It's essential to evaluate the characteristics of your data and application to choose the most appropriate modeling approach.

Thursday, February 22, 2024

Prompt Engineering

Prompt engineering involves designing and crafting prompts that effectively communicate the desired task or question to a language model like ChatGPT.

The key components of prompt engineering include:

Task Definition: Clearly defining the task or problem you want the language model to solve. This involves specifying the input format, expected output format, and any constraints or requirements.
Context and Examples: Providing relevant context and examples to guide the language model's understanding of the task. This can include giving it sample inputs and corresponding outputs, demonstrating different cases or scenarios, and providing additional information or constraints.
Prompt Structure: Designing the structure and format of the prompt to ensure clarity and consistency. This includes using appropriate language, specifying placeholders or variables, and organizing the prompt in a logical and coherent manner.
Few-Shot Learning: Leveraging few-shot learning techniques to train the language model on a small number of examples. This helps the model generalize and adapt to new tasks or variations of existing tasks.
Prompt Patterns: Utilizing prompt patterns or templates that capture common patterns or structures in prompt writing. These patterns provide a framework for constructing prompts and can help improve efficiency and effectiveness in generating desired outputs.

By focusing on these key components, prompt engineering contributes to improving prompt writing skills in several ways:

Precision: Prompt engineering helps in generating precise prompts that clearly communicate the desired task or question to the language model. This improves the accuracy and relevance of the model's responses.
Consistency: By designing consistent prompt structures and formats, prompt engineering ensures that the language model receives consistent inputs, making it easier to interpret and generate desired outputs.
Adaptability: Through few-shot learning, prompt engineering enables the language model to learn and generalize from a small number of examples. This enhances its ability to handle new tasks or variations of existing tasks.
Efficiency: Prompt patterns provide a systematic approach to prompt writing, saving time and effort by reusing proven structures and formats. This allows prompt engineers to focus on customizing prompts for specific tasks rather than starting from scratch.
Effectiveness: Well-engineered prompts improve the overall performance and reliability of the language model, leading to more accurate and useful responses. This enhances the user experience and the value derived from using the model.

By honing their prompt engineering skills, individuals can effectively harness the capabilities of language models and achieve better outcomes in various applications, such as natural language understanding, problem-solving, and content generation.

There are various types of prompt patterns that can be used to enhance prompt engineering with large language models like ChatGPT. Here are some examples:

Input Prompt Patterns:

Asking for user input: Prompting the user to provide specific information or answer a question.
Providing alternatives: Offering multiple options for the user to choose from.

Persona Prompt Patterns:

Adopting a persona: Writing prompts from the perspective of a specific character or persona.
Role-playing: Engaging in a conversation or interaction with the model as a specific persona.

Instruction Prompt Patterns:
- Asking for clarification: Requesting the model to provide more details or clarify a certain topic.
- Asking for examples: Prompting the model to provide examples or demonstrate a concept.
Formatting Prompt Patterns:
- Specifying output format: Instructing the model to generate output in a specific format or structure.
- Controlling verbosity: Guiding the model to be more concise or elaborate in its responses.
Contextual Prompt Patterns:
- Providing context: Including relevant background information or previous conversation history in the prompt.
- Referring to previous responses: Referring to the model's previous answers or statements in the prompt.
Goal-oriented Prompt Patterns:
- Setting goals: Explicitly stating the desired outcome or objective in the prompt.
- Requesting step-by-step instructions: Asking the model to provide a sequence of actions or steps to achieve a specific goal.

These are just a few examples of prompt patterns that can be used to structure prompts and guide the behavior of large language models. By leveraging these patterns effectively, users can achieve more accurate and desired responses from the models.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

We explore how generating a chain of thought—a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.

Refinement pattern

This text outlines the concept of using the question refinement pattern to enhance interactions with large language models like ChatGPT. It suggests that by refining initial questions with the model's assistance, users can obtain more precise and contextually relevant queries. The process involves prompting the model to suggest improvements to questions and then deciding whether to use the refined version. The text emphasizes the importance of continuously striving for better questions to optimize interactions with the language model. Through an example involving a decision about attending Vanderbilt University, it illustrates how refining questions can lead to more informative and tailored inquiries. Additionally, it highlights how this pattern fosters reflection on the clarity and completeness of questions, helping users identify missing information and refine their queries accordingly. Overall, the text underscores the value of leveraging question refinement to generate better questions, enhance learning from model refinements, and address missing contextual elements for improved outputs.

Tuesday, June 27, 2023

prompts examples with ChatGPT / ChatGPT with RDBMS / Prompt Tuning

Here are some examples of prompts you can use with ChatGPT:

General Information:

"Can you provide an overview of the history of artificial intelligence?"

"What are the benefits of exercise for overall health?"

"Explain the concept of blockchain technology."

Comparative Analysis:

"Compare the advantages and disadvantages of iOS and Android mobile operating systems."

"What are the differences between machine learning and deep learning?"

"Compare and contrast classical physics and quantum physics."

Problem-solving and Guidance:

"I'm trying to troubleshoot a computer network issue. Can you suggest some steps I can take to diagnose the problem?"

"How can I improve my time management skills and increase productivity?"

"What are some effective strategies for dealing with stress in the workplace?"

Exploratory Questions:

"What are some emerging trends in the field of renewable energy?"

"What are the potential impacts of artificial intelligence on job market in the next decade?"

"How is the COVID-19 pandemic affecting global supply chains?"

Scenario-based Questions:

"You are planning a trip to Paris. Can you recommend some must-visit attractions and local restaurants?"

"I want to start learning a new programming language. Which one would you suggest for beginners and why?"

"I'm considering pursuing a career in marketing. Can you provide insights into the key skills and qualifications needed in the industry?"

Remember to be as specific as possible in your prompts to get more targeted and accurate responses from ChatGPT. You can experiment with different prompt styles and variations to achieve the desired results.

---================================================================

An LLM (Language Model) can be used effectively with relational databases to perform a wide range of tasks, such as querying, data analysis, and generating SQL statements. Here's how you can utilize an LLM with a relational database:

Query Generation: An LLM can assist in generating SQL queries for retrieving data from a relational database. You can provide the LLM with the desired criteria, filters, and conditions, and it can generate the corresponding SQL statement. This can be helpful when you're not familiar with the specific syntax or need assistance in constructing complex queries.
Data Analysis and Exploration: Use an LLM to explore and analyze your database by asking questions or providing prompts. The LLM can provide insights and answers based on the data present in the database. For example, you can ask questions like "What are the top-selling products in the past month?" or "What is the average revenue per customer?" The LLM can generate SQL queries or analyze the data directly to provide the requested information.
Schema Exploration and Documentation: An LLM can assist in understanding and documenting the database schema. You can ask the LLM questions like "What are the tables and columns in this database?" or "Describe the relationships between the tables." The LLM can provide information about the structure of the database, including table names, column names, data types, and relationships.
Data Validation and Cleaning: Use an LLM to validate and clean your data by providing rules or conditions for data integrity checks. The LLM can generate SQL statements to identify and correct inconsistencies or errors in the data. For instance, you can ask the LLM to identify duplicate entries or missing values in a specific column.
Natural Language Interface: An LLM can act as a natural language interface to your relational database. Instead of writing complex SQL queries, you can communicate with the database using plain English or natural language queries. The LLM can interpret the intent of your questions and generate the corresponding SQL statements for data retrieval or manipulation.
Database Administration: An LLM can assist in managing and administering the database by providing information about database statistics, performance optimization techniques, or suggesting indexing strategies. You can ask questions about database administration tasks and receive recommendations based on best practices.

Remember that while an LLM can be a helpful tool for interacting with a relational database, it's essential to have a good understanding of database concepts, query optimization, and data integrity to ensure the accuracy and efficiency of your interactions.

---================================================================

Prompt tuning involves refining and optimizing prompts to improve the quality and relevance of the model's responses. Here are some techniques you can use for prompt tuning:

Be specific and explicit: Clearly specify the desired format, context, or type of response you want from the model. Provide detailed instructions and examples if necessary. The more specific and explicit your prompt, the better the chances of getting the desired output.

Control response length: If you want shorter or more concise answers, set a maximum token limit to restrict the response length. This can help avoid verbose or irrelevant outputs. Experiment with different token limits to find the right balance between length and completeness.

Experiment with temperature and top-k/top-p: Adjust the temperature parameter to control the randomness of the model's responses. Lower values like 0.2 make the model more focused and deterministic, while higher values like 0.8 introduce more randomness. Similarly, try different values for top-k (top-k sampling) or top-p (nucleus sampling) to influence the diversity and creativity of the responses.

Use system messages: System-level instructions can guide the model's behavior throughout the conversation. Use system messages to set context, remind the model of certain rules or constraints, or provide high-level instructions. These messages can help guide the conversation and shape the model's responses.

Iterate and refine: Prompt tuning is an iterative process. Experiment with different prompt variations, instructions, or techniques. Assess the model's responses and make adjustments based on the observed results. Continuously iterate and refine your prompts to improve the quality of the outputs.

Provide context and history: In a multi-turn conversation, include relevant context and history by prefixing previous messages. This helps the model maintain coherence and continuity in its responses. By referencing earlier parts of the conversation, you can guide the model to provide more consistent and contextually relevant answers.

Use human feedback: Solicit feedback from humans to evaluate the model's responses to different prompts. Collect feedback on the quality, relevance, and accuracy of the outputs. This feedback can guide you in further refining and optimizing your prompts to align with human expectations.

Fine-tuning: Consider fine-tuning the base model on a specific dataset or domain if you require more control over the outputs. Fine-tuning allows you to customize the model's behavior based on your specific needs and can lead to improved performance for specific tasks.

Remember that prompt tuning is a dynamic process, and the effectiveness of different techniques can vary depending on the specific task or domain. It requires experimentation, evaluation, and adaptation to optimize the prompts and achieve the desired outcomes.

Wednesday, June 21, 2023

Data Science - Feature Engineering

Feature engineering employs various techniques to shape and enhance the features within a dataset. Some commonly used techniques include scaling, encoding, imputation, binning, and aggregation.

Scaling ensures variables are on a similar scale, while encoding transforms categorical variables into numerical form. Imputation techniques fill in missing values, and binning captures non-linear relationships. Aggregation involves deriving meaningful insights by aggregating data at a higher level.

Feature engineering steps: data understanding, feature selection, feature creation, and feature transformation. Each step contributes to refining the dataset and improving model performance.

Feature engineering is a dynamic and iterative process that requires continuous experimentation and evaluation. It allows data scientists to transform raw data into a feature-rich dataset that captures underlying patterns and relationships. However, it's important to strike a balance and avoid over-engineering, which can lead to overfitting or unnecessary complexity in models.
In conclusion, feature engineering is the cornerstone of successful data science and machine learning projects.

Thursday, June 1, 2023

Data Lakehouse Architecture & Data Observability & Reverse ETL

https://drive.google.com/file/d/1ze7wNp91bhi4Pt40vsMPj9d2_27Q3VCr/view?usp=drive_link

https://docs.google.com/presentation/d/1v2mFcgCW15GH78ug_jHLPA7E120423EX/edit?usp=drive_link&ouid=106871833537371668656&rtpof=true&sd=true

https://drive.google.com/file/d/1cBG87MlaFacG6Fui0XyQf-F-tEPQpale/view?usp=drive_link

Snowflake Reverse ETL

https://medium.com/snowflake/3-ways-data-engineers-at-snowflake-leverage-reverse-etl-1eb106bf6079

--------------------------

Reverse ETL on Snowflake | How to set up Hightouch on Snowflake for beginners

https://www.youtube.com/watch?v=KWNtUu9K_mk

Adam Morton

--------------------------

Hightouch Helps Companies Make Maximum Use Of Their Snowflake Data

https://www.youtube.com/watch?v=zZzO3WYLxOg

--------------------------

Sunday, May 21, 2023

Types of Analytics

Tuesday, February 8, 2022

Machine Learning LInks

Good Data Analysis | ML Universal Guides | Google Developers

Machine Learning Crash Course | Google Developers

COS 424: Syllabus (princeton.edu)

Sunday, January 9, 2022

Data Mesh Links

https://medium.com/intuit-engineering/intuits-data-mesh-strategy-778e3edaa017

https://venturebeat.com/2021/05/27/databricks-unifies-data-science-and-engineering-with-a-federated-data-mesh/

https://aws.amazon.com/blogs/industries/how-data-mesh-technology-can-help-transform-risk-finance-and-treasury-functions-in-banks/

https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/

Monday, November 1, 2021

Quantitative Models ( aka Data Science Projects at a Higher Level)

Monday, January 18, 2021

Spark DataFrames vs DataSets

Tuesday, June 11, 2019

Yuvraj Singh

Saturday, December 1, 2018

Decentralized storage options

Courtesy by ---

Wulf Kaal, Entrepreneur, Technologist, Professor at University of St. Thomas School of Law (2011-present)

Currently decentralized blockchain applications have few options to store data. Decentralized storage options are:

Storing everything in blockchain itself
Peer to peer file system, such as IPFS
Decentralized cloud file storages, such as Storj, Sia, Ethereum Swarm, etc.
Distributed Databases, such as Apache Cassandra, Rethink DB, etc.
BigChainDB
Ties DB

Let’s consider them all in detail:

Storing everything in blockchain itself: Storing everything in blockchain is the simplest solution. Currently most of the simple decentralized applications work exactly this way. However, this approach has significant drawbacks. First of all transactions to blockchain are slow to confirm. It may seem to be fast for money transfer (anyone can wait a minute), but it is extremely slow for a rich application data flow. Rich application may require many thousands transactions per second. Secondly, it is immutable. The immutability is the strength of blockchain that gives it high robustness but it is a weakness for a data storage. User may change their profile or replace their photo, still all the previous data will sit in blockchain forever and can be seen by anyone. The immutability results in one more drawback - the capacity. If all the applications would keep their data in blockchain, the blockchain size will grow rapidly, exceeding publicly available hard drive capacity. Full nodes can require special hardware. It may result in dangerous centralization of blockchain. That’s why storing data in blockchain only is not a good option for a rich decentralized application.
Peer to peer file system, such as InterPlanetary File System. IPFS allows to share files on client computers and unites them in the global file system. The technology is based on BitTorrent protocol and Distributed Hash Table. There are several good moments. It is really peer to peer - to share anything first put it on your own computer. It will be downloaded only if anyone needs it. It is content addressable, so it is impossible to forge content by the given address. Popular files can be downloaded very quickly thanks to BitTorrent protocol. However it also has some drawbacks. You should stay online if you want to share your files. At least before someone becomes interested and wants to download them from you. It serves only static files, they can not be modified or removed once uploaded. And of course you can not search these files by their meaningful content.
Decentralized cloud file storages: There are also decentralized cloud file storages that lift some of IPFS limitations. From the user’s point of view these storages are just cloud storages like Dropbox, for example. The difference is that the content is hosted on user’s computers who offer their hard drive space for rent, rather than in datacenters. There are plenty of such projects nowadays. For example, Sia, Storj, Ethereum Swarm. You don’t need to stay online to share your files anymore. Just upload the file and it is available in the cloud. These storages are highly reliable, fast enough, have enormous capacity. Still they serve static files only, no content search anyway and, since they are built on the rented hardware, they are not free.
Distributed Databases: Since we need to store structured data and seek for advanced query capabilities we may look at the distributed noSql databases. Why noSql? Because strict transactional SQL databases can not be truly distributed due to the restrictions of the CAP-theorem. To make a database distributed we must sacrifice either consistency or availability. NoSQL databases choose availability over consistency replacing it with so called “eventual consistency” where all the database nodes in the network become consistent some time later. There are many mature realizations of such databases, for example MongoDB, Apache Cassandra, RethinkDB and so on. They are very good - fast, scalable, fault tolerant, support rich query language but still have fatal drawback for our application. They are not Byzantine-proof. All the nodes of the cluster fully trust each other. So any malicious node can destroy the whole database.
BigChainDB: There is another project called BigChainDB that claims to solve the data storage and transaction speed problem. It is also a blockchain but with enormous data capacity and really fast transactions. Let us see how it is possible. BigChainDB is build upon RethinkDB cluster, I mentioned this NoSQL database on the previous slide. BigChainDB uses it to store all the blocks and transactions. That is why it shows such a high throughput - it is the one of the underlying noSQL database. All the BigChainDB nodes (denoted BDB on the slide) are connected to the cluster and have full write access to the database. Here comes a problem - the whole BigChainDB is not byzantine-proof! Any malicious BDB node can destroy the RethinkDB cluster. The BigChainDB team is aware of this problem and promises to solve it sometime in the future, however it is the corner stone of the architecture and changing it may not be possible.Anyway, BigChainDB may be good for a private blockchain. But in my opinion, to avoid confusion it should have been named BigPrivateBlockchain. It is not an option for a public storage.
Ties DB: The currently available options could be a good public database. The closest to the ideal are the noSql databases. The only thing they lack is byzantine fault tolerance. The Ties.Network Database: ties.network is a deep modification of the Cassandra database and offers a preferable solution: The TiesDB inherits the majority of features from the underlying noSQL databases and adds byzantine fault tolerance and incentives. With these features it can become a public database and enable feature-rich applications on Ethereum and other blockchains with smart contracts. The database is writable by any user. But the users are identified by their public key and all the requests are signed. Once created, record remembers its creator who becomes an owner of the record. After that the record can be modified only by the owner. Everyone can read all records, because the database is public. All the permissions are checked on request and replication. Additional permissions can be managed via a smart contract.