DP-900 Objective 1.2: Identify Options for Data Storage
DP-900 Exam Focus: This objective covers data storage fundamentals including common file formats (CSV, JSON, XML, Parquet, Avro, ORC) and database types (relational and NoSQL including key-value, document, column-family, and graph). Understanding format characteristics, use cases, and when to choose each database type is essential. Focus on distinguishing features, strengths, weaknesses, and appropriate Azure services for each storage option.
Data Storage Options Overview
Data storage technologies have evolved significantly to address diverse workload requirements, data characteristics, and scale needs. Organizations must choose appropriate storage options based on data structure, access patterns, consistency requirements, and query complexity. The two primary categoriesâfile-based storage and database systemsâeach encompass multiple specialized options optimized for specific scenarios. File formats range from simple delimited text to sophisticated columnar formats optimized for analytics. Database systems span traditional relational databases designed for transactional consistency to specialized NoSQL databases prioritizing scale, flexibility, or specific data models like graphs or time series.
Understanding storage options enables effective data architecture decisions balancing performance, cost, scalability, and maintainability. Legacy systems often relied exclusively on relational databases and flat files, but modern cloud platforms like Microsoft Azure provide specialized services for diverse storage needs. The polyglot persistence approach selects optimal technologies for each workload rather than forcing all data into one storage type. This objective explores common data file formats used for storage and exchange, and database categories including relational and various NoSQL types, providing foundation for selecting appropriate Azure data services.
Common Data File Formats
Delimited Text Files (CSV and TSV)
Comma-Separated Values (CSV) files represent the simplest and most universal data file format, storing tabular data as plain text with commas separating fields and line breaks separating records. CSV's ubiquity stems from simplicityâany text editor can read and modify CSV files, and virtually all database and spreadsheet applications can import and export CSV. The human-readable format enables quick inspection and debugging without specialized tools. However, CSV limitations include lack of data type information (everything is text), no standardized handling of special characters or commas within values (requiring escaping or quoting), no support for hierarchical data or relationships, and inefficiency for large datasets due to text storage overhead.
Tab-Separated Values (TSV) uses tabs instead of commas as delimiters, reducing issues with commas within data fields but otherwise sharing CSV characteristics. Organizations commonly use CSV for data exports from applications, spreadsheet data sharing, simple data integration between systems, and archival storage of structured data. Despite limitations, CSV remains relevant due to universal compatibility and simplicity. Azure Data Factory can ingest CSV files from various sources, and Azure Synapse Analytics can query CSV files directly in data lakes through serverless SQL pools. The lack of schema and data types means applications must validate and parse CSV data carefully, handling missing values, type conversions, and encoding issues.
JSON (JavaScript Object Notation)
JSON has emerged as the dominant format for semi-structured data exchange, particularly in web APIs and modern applications. JSON represents data through key-value pairs, nested objects, and arrays, supporting hierarchical data structures unavailable in flat CSV files. The format includes native data types (strings, numbers, booleans, null) though less sophisticated than programming languages or databases. JSON's lightweight syntax balances human readability with compact representation. Native JavaScript parsing makes JSON ideal for web applications, though libraries exist for virtually every programming language. The self-describing nature through keys makes JSON more resilient to schema changes than position-dependent CSV.
JSON advantages include supporting nested data structures eliminating need for multiple flat files, schema flexibility where documents can have different field sets, human readability aiding debugging and documentation, and ubiquitous tooling support. Disadvantages include verbosity from repeating keys in every record, lack of standardized schema validation (though JSON Schema addresses this), and text-based storage inefficiency for large datasets. JSON dominates REST APIs for request and response payloads, configuration files for applications and infrastructure, document databases storing flexible records, and data exchange between distributed systems. Azure Cosmos DB natively stores and indexes JSON documents, while Azure Blob Storage can hold JSON files processed by analytics services.
XML (eXtensible Markup Language)
XML predates JSON and remains prevalent in enterprise integration, configuration management, and industries with established XML standards. XML uses opening and closing tags enclosing data, supporting attributes on elements, namespaces preventing naming conflicts, and rich schema validation through XSD (XML Schema Definition). The robust validation capabilities ensure documents conform to expected structures before processing. XSLT provides powerful transformation capabilities converting XML between formats. However, XML's verbosity exceeds JSON due to closing tags, making files larger and parsing slower. The tag-based syntax is less immediately readable than JSON's key-value pairs for humans unfamiliar with XML structures.
XML excels in scenarios requiring schema validation ensuring data conformity, enterprise integration using SOAP web services, configuration files needing comments and structure, data interchange in healthcare (HL7), finance (FpML), or government (GovTalk), and document markup (XHTML, DocBook). XML provides features JSON lacks including comments within documents, mixed content combining text and child elements, and namespace-based interoperability. Azure Logic Apps can parse and transform XML documents, and Azure API Management can mediate between XML SOAP services and JSON REST APIs. While JSON dominates new development, XML's establishment in enterprise systems ensures continued relevance particularly in regulated industries with standardized XML formats.
Parquet - Columnar Storage Format
Apache Parquet provides columnar storage format optimized for analytical workloads in big data ecosystems. Unlike row-oriented formats storing complete records sequentially, Parquet organizes data by columns storing all values for each column together. This columnar layout enables excellent compression since column values share types and often patterns, and efficient analytical queries reading only required columns without scanning entire rows. Parquet files include embedded schemas describing data types, and support complex nested data structures. The binary format is not human-readable but provides dramatic performance and storage benefits for analytical workloads.
Parquet advantages include superior compression ratios reducing storage costs (often 75% smaller than CSV), columnar storage enabling column-specific queries and predicate pushdown, embedded schema eliminating separate metadata management, and native support in analytics engines like Apache Spark and Azure Databricks. Parquet suits data lake storage where analytical queries aggregate or filter specific columns across large datasets, ETL pipelines transforming data between systems, long-term archival of structured analytical data, and machine learning feature stores. Azure Synapse Analytics and Azure Databricks natively read and write Parquet files in Azure Data Lake Storage, providing efficient analytics on big data stored as Parquet. Organizations often convert CSV or JSON data to Parquet for long-term storage and analytics, accepting the upfront conversion cost for ongoing query performance and storage benefits.
Avro - Row-Based Binary Format
Apache Avro provides compact binary serialization format supporting schema evolution, making it popular in data streaming and big data scenarios. Unlike Parquet's columnar layout, Avro uses row-based storage where complete records are serialized together, optimizing for write performance and scenarios where entire records are typically read. Avro embeds schemas in files but also supports schema registry enabling centralized schema management and evolution. The format handles schema changes gracefully, allowing readers with newer schemas to process data written with older schemas through default values and optional fields.
Avro advantages include compact binary representation smaller than JSON or XML, schema evolution supporting backward and forward compatibility, fast serialization and deserialization, and integration with streaming platforms like Apache Kafka and Azure Event Hubs. Avro suits data streaming ingesting high-volume events, message queuing requiring serialization efficiency, data pipelines with evolving schemas, and RPC frameworks requiring efficient serialization. Azure Event Hubs supports Avro serialization with schema registry integration, while Azure Data Factory can process Avro files. Organizations often use Avro for streaming data ingestion, then convert to Parquet for analytical storage, leveraging Avro's write performance and schema evolution for ingestion and Parquet's compression and query performance for analytics.
ORC (Optimized Row Columnar)
ORC provides another columnar storage format similar to Parquet but originated in the Apache Hive ecosystem. ORC offers lightweight indexes enabling skipping irrelevant data during scans, integrated compression with multiple codecs, and statistics collection for query optimization. The format supports ACID transactions through integration with Hive, though this capability is specific to certain ecosystems. ORC and Parquet share many characteristicsâboth are columnar, binary, compressed, and optimized for analyticsâwith minor differences in compression algorithms, indexing strategies, and ecosystem integration points.
ORC advantages include excellent compression often exceeding Parquet, built-in indexes accelerating queries, bloom filters for efficient filtering, and strong Hive ecosystem integration. ORC suits similar scenarios as Parquet: data lakes with analytical workloads, ETL output for downstream analytics, and big data storage requiring compression and performance. Azure Synapse Analytics supports querying ORC files in data lakes. The choice between Parquet and ORC often depends on specific ecosystem requirements, existing tooling, and benchmark results for specific workloads rather than fundamental capability differences. Many organizations standardize on one format (often Parquet due to broader ecosystem support) for consistency rather than mixing formats.
Relational Databases
Relational Database Fundamentals
Relational databases organize data into tables with predefined schemas specifying columns, data types, constraints, and relationships. Each table represents an entity type (customers, products, orders), with rows containing individual entity instances and columns representing attributes. Primary keys uniquely identify each row, while foreign keys reference primary keys in related tables establishing relationships. This relational model, introduced by Edgar Codd in 1970, provides mathematical foundation for data management through relational algebra and calculus. The structured approach eliminates redundancy through normalization, enforces data integrity through constraints, and enables powerful query capabilities through SQL.
Key relational database features include ACID transactions guaranteeing Atomicity (all-or-nothing operations), Consistency (valid states), Isolation (concurrent transaction independence), and Durability (committed data persists); referential integrity preventing orphaned records; normalization reducing duplication by separating data into related tables; declarative SQL queries describing desired results rather than access paths; and mature optimization enabling efficient query execution through indexes, statistics, and execution plans. These features make relational databases ideal for transactional systems requiring data consistency and complex queries across related entities.
Relational Database Use Cases
Relational databases excel at transactional workloads requiring strong consistency and ACID guarantees. Financial systems rely on transactions ensuring account balances remain accurate during transfers, preventing double-spending or lost updates. E-commerce platforms require transactions spanning inventory deduction, order creation, and payment processing, ensuring either all steps succeed or none do. Enterprise resource planning (ERP) systems manage complex business operations with intricate relationships between entities like customers, products, suppliers, employees, and financial accounts. Customer relationship management (CRM) systems track interactions across multiple touchpoints with relationships between contacts, companies, opportunities, and activities.
Additional use cases include human resources management tracking employees, departments, positions, payroll, and benefits; supply chain management coordinating inventory, orders, shipments, and warehouses; booking systems managing reservations, availability, customers, and resources; and healthcare systems maintaining patient records, appointments, treatments, and billing. These scenarios share characteristics of structured data with well-defined schemas, complex relationships between entities, requirements for strong consistency preventing data anomalies, and need for complex analytical queries joining multiple tables. When data exhibits these characteristics and consistency outweighs scale requirements, relational databases provide proven solutions with decades of optimization and mature tooling.
Azure Relational Database Services
Microsoft Azure offers multiple managed relational database services. Azure SQL Database provides fully managed SQL Server-compatible database with built-in high availability, automated backups, intelligent performance optimization, security features including encryption and threat detection, and elastic scaling. It suits applications requiring SQL Server compatibility, T-SQL procedures, and enterprise features. Azure Database for PostgreSQL offers managed PostgreSQL with extensions, replication, and Azure integration, ideal for applications using PostgreSQL features, open-source preference, or specific PostgreSQL extensions. Azure Database for MySQL provides managed MySQL popular for web applications and applications requiring MySQL compatibility.
Azure SQL Managed Instance delivers near-100% SQL Server compatibility for lift-and-shift migrations requiring features like SQL Agent, linked servers, or CLR integration. Azure Database for MariaDB supports MariaDB community edition with compatibility for MySQL migrations. Each service handles operational concerns like patching, backups, high availability, monitoring, and security, enabling focus on application development rather than database administration. The services provide flexible pricing tiers matching workload requirements from low-cost development databases to enterprise-grade production systems. Choosing between services depends on application requirements, existing skills, compatibility needs, and specific feature requirements rather than fundamental capability differences since all provide core relational database capabilities.
NoSQL Database Types
Key-Value Databases
Key-value databases represent the simplest NoSQL model, storing data as key-value pairs where unique keys map to values. Values can contain strings, numbers, JSON documents, binary blobs, or any serializable data structure without enforced schema. This simplicity enables extreme performance and scaleâlookups by key are constant time operations, and horizontal partitioning distributes keys across multiple servers. Key-value stores sacrifice query flexibility for speed and scale; they cannot query by value contents, find ranges of keys, or join across items. The model suits scenarios where data access primarily happens by known keys.
Key-value database use cases include session management storing user session state retrievable by session ID, caching frequently accessed data reducing database load, shopping carts tracking temporary user selections before purchase, user preferences storing settings retrievable by user ID, and rate limiting tracking API usage by client identifier. Additional uses include real-time leaderboards updating scores, distributed locks coordinating access, and feature flags controlling application behavior. Azure Table Storage provides simple key-value storage for Azure-native applications. Azure Cosmos DB Table API offers globally distributed key-value database with enhanced consistency, SLA-backed performance, and global distribution. Redis Cache, available as Azure Cache for Redis, provides in-memory key-value store for ultra-low latency.
Document Databases
Document databases store semi-structured documents, typically JSON, containing hierarchical data with nested objects and arrays. Unlike key-value stores returning opaque values, document databases understand document structure enabling queries against nested fields, filtering, sorting, and projections. Unlike relational databases requiring normalized tables, document databases encourage denormalization embedding related data within documents for read performance. This approach trades increased storage and update complexity for reduced joins and improved read performance. Documents in collections need not share schemas, enabling flexible data models evolving without migrations.
Document database use cases include content management systems storing articles, posts, or pages with varying structures; product catalogs where different product types have different specifications; user profiles with varying attributes across user types; mobile applications requiring offline-first capabilities synchronizing documents; and event-driven systems capturing diverse event types. Additional uses include catalogs, inventory systems, player profiles in gaming, and IoT device metadata. Azure Cosmos DB provides globally distributed document database with SQL API for document queries, automatic indexing of all fields, tunable consistency levels, and global distribution. The service scales horizontally across regions, provides single-digit millisecond latency, and supports multiple data models beyond documents including key-value, column-family, and graph.
Column-Family Databases
Column-family databases organize data by columns rather than rows, storing column families together on disk. This columnar storage enables efficient queries reading specific columns across many rows without reading entire rows, dramatic compression since column values share types and patterns, and excellent performance for analytical workloads. Unlike row-oriented databases fetching complete records, column-family databases can retrieve only needed columns. The model originated in Google's Bigtable paper and found implementation in Apache Cassandra and HBase. Column-family databases excel at write-heavy workloads, massive scale, and time-series data.
Column-family database use cases include IoT sensor data ingestion where millions of devices write measurements, event logging capturing high-volume application events, time-series metrics storing performance data over time, clickstream analytics tracking user interactions, and recommendation engines processing interaction history. Additional uses include financial market data, social media activity streams, and network monitoring. Azure Cosmos DB Cassandra API provides globally distributed column-family database compatible with Apache Cassandra, enabling migrations with minimal code changes. Azure Synapse Analytics dedicated SQL pools use columnar storage for data warehousing scenarios. Column-family databases suit scenarios prioritizing write throughput, horizontal scale across many nodes, and column-specific queries over complex joins and transactions.
Graph Databases
Graph databases model data as nodes (entities) and edges (relationships) with properties on both, optimized for relationship-heavy scenarios. Traditional relational databases require multiple joins to traverse relationships, with performance degrading as join depth increases. Graph databases store relationships natively, enabling efficient traversal across many relationship hops through pointers. Queries use specialized graph query languages like Gremlin or Cypher expressing traversals, pattern matching, and path finding. Graph databases excel when relationships between entities are as important as the entities themselves.
Graph database use cases include social networks modeling users, friendships, follows, and interactions; recommendation engines finding patterns in user behavior and purchase history; fraud detection identifying suspicious relationship patterns in transactions; knowledge graphs connecting concepts, entities, and relationships; organizational hierarchies representing reporting structures; and network topology mapping infrastructure connections. Additional uses include access control modeling permissions and roles, dependency analysis in software systems, and route optimization in logistics. Azure Cosmos DB Gremlin API provides globally distributed graph database supporting Apache TinkerPop Gremlin queries. Graph databases solve problems difficult in relational databases like finding shortest paths, discovering clusters, or recommending items based on friend-of-friend relationships. When applications require complex relationship queries or patterns, graph databases provide more natural and performant solutions than relational alternatives.
Comparing Storage Options
File Formats Comparison
File format selection depends on use case requirements. CSV provides maximum compatibility and simplicity for tabular data exchange, spreadsheet integration, and human inspection, though inefficient for large datasets and lacking data types. JSON dominates web APIs and configuration files through hierarchical structure support, human readability, and ubiquitous tooling, though verbose for large datasets. XML suits enterprise integration requiring schema validation, SOAP services, and specific industry standards, despite verbosity and complexity. Parquet excels at analytical workloads in data lakes through columnar storage, compression, and query performance, accepting binary format trade-off. Avro optimizes streaming ingestion through compact serialization and schema evolution, though requiring processing tools for human access.
Modern data architectures often use multiple formats optimally: CSV for simple exports and data transfer to spreadsheets, JSON for APIs and application configuration, Avro for streaming data ingestion with schema evolution, and Parquet for analytical storage in data lakes. Organizations might ingest data as JSON or Avro, transform during processing, and store as Parquet for analytics. This multi-format approach leverages each format's strengths while mitigating weaknesses. Azure Data Factory orchestrates conversions between formats, Synapse Analytics queries multiple formats in data lakes, and Databricks processes diverse formats through Spark. Understanding format characteristics enables appropriate selection for each workload stage rather than forcing one format across incompatible use cases.
Database Types Comparison
Database selection balances consistency, availability, partition tolerance (CAP theorem), query complexity, and scale. Relational databases prioritize consistency and complex queries, supporting ACID transactions, normalized data, and powerful SQL. They suit transactional workloads where data integrity and complex relationships matter more than ultimate scale. Key-value databases prioritize speed and scale for simple lookups, sacrificing query flexibility. Document databases balance flexibility and query power, supporting rich queries against semi-structured data without enforcing rigid schemas. Column-family databases optimize write-heavy workloads and column-specific analytical queries at massive scale. Graph databases solve relationship-heavy scenarios efficiently.
Modern polyglot persistence architectures use multiple database types appropriately: relational databases for core transactional systems, document databases for flexible domains like product catalogs, key-value stores for caching and session management, column-family databases for time-series data ingestion, and graph databases for social features or recommendations. This specialization enables optimal technology selection for each workload rather than compromising across conflicting requirements. Azure enables polyglot persistence through diverse services: SQL Database for relational, Cosmos DB for NoSQL (document, key-value, column-family, graph), Cache for Redis for in-memory caching, and Synapse Analytics for data warehousing. Understanding trade-offs between database types enables matching technologies to requirements rather than forcing inappropriate solutions.
Real-World Data Storage Scenarios
Scenario 1: Global E-Commerce Platform
Business Requirement: E-commerce company needs storage for customer accounts, orders, product catalog, recommendations, shopping carts, and analytics.
Storage Selection:
- Relational database (Azure SQL Database): Customer accounts, order transactions, and payment records requiring ACID transactions and referential integrity
- Document database (Cosmos DB SQL API): Product catalog where electronics have technical specs, clothing has sizes, and furniture has dimensionsâflexible schema accommodates diversity
- Key-value store (Cosmos DB Table API): Shopping carts and user session data requiring fast reads/writes by user ID
- Graph database (Cosmos DB Gremlin API): Product recommendations based on purchase patterns and user similarity
- Blob Storage with Parquet: Historical order data for analytics, customer behavior analysis, and business intelligence
Outcome: Polyglot architecture uses optimal storage for each workloadâconsistency for transactions, flexibility for products, speed for carts, relationship-awareness for recommendations, and efficient analytics storage.
Scenario 2: IoT Telemetry Platform
Business Requirement: Manufacturing company ingests sensor data from equipment, stores device metadata, analyzes time-series metrics, and maintains operational databases.
Storage Selection:
- Event Hubs with Avro: High-volume sensor data ingestion with schema evolution as device types evolve
- Column-family database (Cosmos DB Cassandra API): Raw time-series sensor data optimized for write-heavy workload and time-based queries
- Document database (Cosmos DB): Device metadata, configuration, and status information with varying attributes across device types
- Data Lake with Parquet: Historical sensor data transformed to columnar format for analytical queries aggregating by time ranges or specific sensors
- Relational database (Azure SQL Database): Equipment inventory, maintenance schedules, and operational data requiring transactions
Outcome: Streaming ingestion handles high-volume writes, column-family storage optimizes time-series queries, document database accommodates device diversity, and data lake enables historical analytics.
Scenario 3: Social Media Platform
Business Requirement: Social platform manages user profiles, posts, relationships, newsfeed generation, media files, and activity analytics.
Storage Selection:
- Graph database (Cosmos DB Gremlin API): User relationships (followers, friends), network traversal for friend suggestions, and relationship-based newsfeed generation
- Document database (Cosmos DB): User posts with varying content types (text, images, videos, polls) and flexible structure
- Blob Storage: Photos, videos, and media files with CDN distribution for global access
- Cache (Redis): Frequently accessed user profiles, trending content, and newsfeed cache reducing database load
- Data Lake with JSON: Raw activity logs for behavioral analytics and ML model training
Outcome: Graph database efficiently handles relationship queries, document database accommodates diverse post types, blob storage scales for media, caching improves performance, and data lake enables analytics.
Exam Preparation Tips
Key Concepts to Master
- CSV: Simple text tabular format, universal compatibility, lacks data types, inefficient for large data
- JSON: Hierarchical semi-structured format, human-readable, dominant in APIs, supports nesting
- Parquet/ORC: Columnar binary formats, optimized for analytics, excellent compression, not human-readable
- Avro: Row-based binary format, schema evolution support, optimized for streaming
- Relational databases: Tables with schemas, ACID transactions, SQL queries, relationships through keys
- Key-value databases: Simple lookups by key, extreme performance, limited query capabilities
- Document databases: Flexible JSON documents, rich queries, no enforced schema, denormalized
- Column-family databases: Columnar storage, write-optimized, time-series data, scalable
- Graph databases: Nodes and edges, relationship traversal, pattern matching, social networks
Practice Questions
Sample DP-900 Exam Questions:
- Question: Which file format is optimized for analytical queries through columnar storage?
- A) CSV
- B) JSON
- C) Parquet
- D) XML
Answer: C) Parquet - Parquet uses columnar storage optimizing analytical workloads.
- Question: Which database type is optimized for storing and querying relationship data through nodes and edges?
- A) Relational database
- B) Key-value database
- C) Document database
- D) Graph database
Answer: D) Graph database - Graph databases model relationships as edges between nodes.
- Question: What is the primary advantage of CSV files?
- A) Efficient storage
- B) Universal compatibility
- C) Strong data types
- D) Hierarchical structure
Answer: B) Universal compatibility - CSV is readable by virtually all systems and applications.
- Question: Which Azure service provides globally distributed document database?
- A) Azure SQL Database
- B) Azure Table Storage
- C) Azure Cosmos DB
- D) Azure Blob Storage
Answer: C) Azure Cosmos DB - Cosmos DB provides globally distributed document, key-value, graph, and column-family capabilities.
- Question: Which database type would be most appropriate for a session management system requiring fast lookups by session ID?
- A) Relational database
- B) Graph database
- C) Document database
- D) Key-value database
Answer: D) Key-value database - Key-value stores excel at fast lookups by known keys like session IDs.
- Question: What file format supports schema evolution making it popular for streaming data?
- A) CSV
- B) Avro
- C) XML
- D) Parquet
Answer: B) Avro - Avro provides schema evolution support ideal for streaming scenarios.
- Question: Which database type is optimized for write-heavy workloads and time-series data?
- A) Relational database
- B) Key-value database
- C) Document database
- D) Column-family database
Answer: D) Column-family database - Column-family databases optimize write performance and time-series queries.
- Question: JSON files are most commonly used for which scenario?
- A) Binary data storage
- B) Web API data exchange
- C) Columnar analytics
- D) Image storage
Answer: B) Web API data exchange - JSON dominates modern web APIs for request/response payloads.
DP-900 Success Tip: Remember file format characteristics: CSV for simplicity and compatibility, JSON for APIs and hierarchy, Parquet/ORC for analytics, Avro for streaming. Know database types: relational for transactions and consistency, key-value for simple fast lookups, document for flexible semi-structured data, column-family for time-series and writes, graph for relationships. Match storage options to workload requirements rather than forcing one solution for all scenarios. Understand Azure services for each type.
Hands-On Practice Lab
Lab Objective
Explore different file formats and database types by working with examples, understanding their characteristics, and identifying appropriate use cases.
Lab Activities
Activity 1: Work with File Formats
- Create CSV file: Export sample tabular data (like customer list) to CSV, open in text editor and spreadsheet
- Create JSON document: Structure same data as JSON with nested objects for addresses or orders
- Compare formats: Note CSV simplicity vs JSON hierarchy, file sizes, readability differences
- Parse formats: If possible, write simple code or use tools to parse CSV and JSON
- Explore Parquet: Research Parquet characteristics, when it's used, and compression benefits
Activity 2: Explore Relational Database
- Access SQL database: Use Azure SQL Database, local SQL Server, or online SQL sandbox
- Create related tables: Create Customers and Orders tables with foreign key relationship
- Insert sample data: Add sample customers and their orders
- Query relationships: Write JOIN query combining customer and order information
- Test transactions: Try inserting order with non-existent customer ID, observe referential integrity enforcement
Activity 3: Understand Document Database
- Create JSON documents: Create multiple product documents with varying fields (laptop with specs, shirt with sizes)
- Observe flexibility: Note how documents can have different structures unlike relational table rows
- Research Cosmos DB: Review Azure Cosmos DB documentation on document databases
- Compare to relational: Think about how same data would require multiple tables in relational model
- Consider queries: Think about how you would query by nested field vs table joins
Activity 4: Explore Key-Value Concepts
- Model key-value data: Design session data as key-value pairs (session-id â session-data)
- Create shopping cart: Model shopping cart as user-id â list of items
- Understand limitations: Consider how you cannot query "find all users with item X in cart"
- Identify use cases: List scenarios where simple key lookup is sufficient
- Research Azure services: Review Azure Table Storage and Cosmos DB Table API
Activity 5: Understand Graph Database Concepts
- Sketch social network: Draw nodes representing users with edges for "follows" or "friends"
- Model recommendation: Add edges for "purchased" connecting users to products
- Consider queries: Think about "friends of friends" or "users who bought X also bought Y"
- Compare to relational: Imagine JOIN queries required for same traversals in relational model
- Research Gremlin API: Review Azure Cosmos DB Gremlin API documentation
Activity 6: Design Polyglot Solution
- Choose scenario: Select business scenario (e-commerce, social media, IoT)
- Identify data types: List all data the system manages
- Select storage: Choose appropriate file format or database type for each data type
- Justify choices: Explain why each selection suits the data characteristics
- Consider integration: Think about how different storage systems would work together
Lab Outcomes
After completing this lab, you'll understand characteristics of common file formats (CSV, JSON, Parquet, Avro) and database types (relational, key-value, document, column-family, graph). You'll be able to identify appropriate storage options for different workload requirements and design polyglot persistence architectures using multiple storage technologies optimally. This knowledge demonstrates core data storage concepts tested in DP-900 exam.
Frequently Asked Questions
What are common data file formats and their characteristics?
Common data file formats include CSV (Comma-Separated Values) providing simple text-based tabular data with delimiters separating fields, widely compatible but lacking data type information; JSON (JavaScript Object Notation) offering human-readable hierarchical data with nested objects and arrays, dominant in web APIs; XML (eXtensible Markup Language) providing tagged hierarchical data with robust validation capabilities; Parquet delivering columnar storage optimized for analytical queries with excellent compression; Avro providing compact row-based binary format with schema evolution support; and ORC (Optimized Row Columnar) offering efficient columnar storage for big data. CSV excels at simple data exchange and spreadsheet compatibility. JSON dominates modern APIs through simplicity and JavaScript compatibility. Parquet and ORC optimize analytical workloads through columnar storage enabling efficient compression and column-specific queries. Format selection depends on use case: CSV for simple exports, JSON for APIs, Parquet/ORC for analytics.
What is a relational database and what are its key features?
Relational databases organize data in tables with rows and columns, using relationships between tables through primary and foreign keys to maintain data consistency and eliminate redundancy. Key features include structured schemas defining tables, columns, and data types before data insertion; ACID transactions ensuring atomicity, consistency, isolation, and durability; referential integrity maintaining relationships between tables; normalization reducing data duplication; SQL query language for data retrieval and manipulation; and indexes accelerating query performance. Relational databases excel at transactional workloads requiring consistency like financial systems, e-commerce order processing, and enterprise resource planning. Azure provides multiple relational database services including Azure SQL Database for SQL Server compatibility, Azure Database for PostgreSQL for open-source PostgreSQL, and Azure Database for MySQL for MySQL workloads. Relational databases suit scenarios requiring complex joins across multiple tables, strong consistency guarantees, and structured data with predictable schemas.
What are NoSQL databases and when should they be used?
NoSQL databases provide alternatives to relational databases, optimized for specific data models and access patterns rather than general-purpose SQL capabilities. NoSQL encompasses multiple database types including key-value stores providing simple lookups by unique keys, document databases storing semi-structured documents like JSON, column-family databases organizing data by columns for analytics, and graph databases modeling relationships through nodes and edges. NoSQL databases typically prioritize scalability, availability, and partition tolerance over strong consistency, following eventual consistency models. Use NoSQL for flexible schemas accommodating evolving data models, massive scale requiring horizontal partitioning across many servers, high-volume read/write operations, hierarchical or nested data unsuited to tables, or scenarios where eventual consistency is acceptable. Azure Cosmos DB provides globally distributed NoSQL database supporting multiple APIs including document, key-value, column-family, and graph models. NoSQL suits scenarios like product catalogs with varying attributes, IoT sensor data at massive scale, social networks with complex relationships, and real-time analytics requiring fast writes.
What are key-value databases and their use cases?
Key-value databases store data as simple key-value pairs where each unique key maps to a value containing arbitrary data like strings, numbers, JSON documents, or binary objects. This simplest NoSQL model provides extremely fast lookups by key, horizontal scalability through key-based partitioning, and flexible value formats without enforced schemas. Key-value stores excel at scenarios requiring fast reads and writes by known keys like session management storing user session data, caching frequently accessed data to reduce database load, shopping carts tracking temporary user selections, user preferences storing configuration settings, and real-time analytics requiring fast updates. Limitations include inability to query by value contents, no relationships between items, and no complex query support beyond key lookups. Azure Table Storage provides simple key-value storage, while Azure Cosmos DB Table API offers globally distributed key-value database with enhanced capabilities. Key-value databases suit applications prioritizing speed and scale for simple data retrieval patterns rather than complex queries.
What are document databases and how do they differ from relational databases?
Document databases store semi-structured documents typically in JSON, BSON, or XML formats, with each document containing self-describing data including nested objects and arrays. Unlike relational databases requiring predefined schemas and normalized tables, document databases allow flexible schemas where documents in the same collection can have different structures, enabling agile development and schema evolution. Documents embed related data within single records rather than splitting across multiple normalized tables, reducing joins and improving read performance. Document databases support rich queries against document contents including filtering by nested fields, though complex multi-document transactions may be limited. Use cases include content management systems storing articles with varying metadata, product catalogs where different products have different specifications, user profiles with varying attributes, and mobile applications requiring offline-first capabilities with document synchronization. Azure Cosmos DB provides globally distributed document database with automatic indexing, multiple consistency levels, and turnkey global replication. Document databases excel when data naturally exists as self-contained documents, schemas evolve frequently, or read performance outweighs transaction complexity.
What are column-family databases and when are they optimal?
Column-family databases organize data by columns rather than rows, storing each column family's data together on disk enabling efficient retrieval of specific columns without reading entire rows. This columnar storage provides excellent compression since column values share types and patterns, and enables efficient analytical queries reading few columns across many rows. Column-family databases like Apache Cassandra and Azure Cosmos DB Cassandra API excel at write-heavy workloads, massive scale requiring distribution across many nodes, time-series data where queries typically filter by time ranges, and analytical workloads aggregating specific columns across large datasets. The architecture sacrifices complex joins and transactions for horizontal scalability and write throughput. Use cases include IoT sensor data ingestion where millions of devices continuously write measurements, event logging capturing high-volume application events, time-series metrics storing performance data, and clickstream analytics tracking user behavior. Azure Synapse Analytics uses columnar storage in dedicated SQL pools for data warehousing. Column-family databases suit scenarios prioritizing write performance, scale, and column-specific analytical queries over transactional consistency and complex joins.
What are graph databases and what problems do they solve?
Graph databases model data as nodes (entities) and edges (relationships) with properties, optimized for scenarios where relationships between entities are as important as the entities themselves. Unlike relational databases requiring multiple joins to traverse relationships, graph databases store relationship information natively enabling efficient traversal across many relationship hops. Graph queries use specialized languages like Gremlin or Cypher expressing traversals like "find friends of friends" or "recommend products purchased by similar users." Use cases include social networks modeling users and connections, recommendation engines suggesting items based on user behavior patterns, fraud detection identifying suspicious relationship patterns, knowledge graphs connecting related concepts, organizational hierarchies, and network topology management. Azure Cosmos DB Gremlin API provides globally distributed graph database supporting complex traversals. Graph databases excel when applications require complex relationship queries, path finding between entities, pattern matching across relationships, or scenarios where relationship properties are as important as node properties. Traditional relational databases struggle with deep joins and recursive queries that graph databases handle efficiently.
How do you choose between relational and NoSQL databases?
Choosing between relational and NoSQL databases depends on data characteristics, consistency requirements, scale needs, and query patterns. Choose relational databases for structured data with predefined schemas, complex transactions requiring ACID guarantees, scenarios needing complex joins across multiple tables, applications requiring strong consistency, and when SQL expertise exists. Relational databases suit financial systems, ERP, CRM, and traditional transactional workloads. Choose NoSQL databases for flexible schemas evolving with applications, massive scale requiring horizontal partitioning, scenarios where eventual consistency is acceptable, hierarchical or nested data awkward in tables, and high-volume read/write operations. NoSQL suits product catalogs, IoT telemetry, social networks, and real-time analytics. Modern polyglot persistence architectures often combine both: relational databases for core transactional data requiring consistency, document databases for flexible product attributes, key-value stores for caching, and graph databases for recommendation engines. Azure provides both relational services (SQL Database, PostgreSQL, MySQL) and NoSQL services (Cosmos DB) enabling appropriate technology selection for each workload rather than forcing everything into one database type.
Written by Joe De Coppi - Last Updated November 14, 2025