What is the difference between a data lake and a data warehouse?

A data warehouse stores structured, processed data optimised for querying and reporting, often using a schema-on-write approach. In contrast, a data lake stores raw data in its native format (structured, semi-structured, or unstructured) and uses schema-on-read, making it more flexible for big data analytics but requiring more management to avoid becoming a 'data swamp'.

How does MapReduce work in simple terms?

MapReduce splits a large dataset into smaller chunks, which are processed in parallel by 'map' tasks that transform each chunk into key-value pairs. These pairs are then shuffled and sorted by key, and 'reduce' tasks aggregate the values for each key to produce the final output. For example, in word count, the map phase outputs (word, 1) pairs, and the reduce phase sums the counts for each word.

What is the CAP theorem and why is it important?

The CAP theorem states that in a distributed data store, you can only guarantee two of three properties: Consistency (all nodes see the same data at the same time), Availability (every request receives a response), and Partition tolerance (the system continues to operate despite network failures). It's important because it forces designers to make trade-offs based on their application's needs, e.g., banking systems prioritise consistency, while social media may prioritise availability.

What are the main types of NoSQL databases?

The four main types are: document stores (e.g., MongoDB) which store data as JSON-like documents; key-value stores (e.g., Redis) which store simple key-value pairs; column-family stores (e.g., Cassandra) which store data in columns rather than rows; and graph databases (e.g., Neo4j) which store nodes and relationships. Each type is optimised for different use cases, such as high-speed lookups or complex relationship queries.

How does Big Data relate to privacy and ethics?

Big Data raises significant privacy concerns because large datasets can be used to infer personal information, even if anonymised. Ethical issues include consent (users may not know how their data is used), bias in algorithms (e.g., discriminatory outcomes), and the digital divide (unequal access to data benefits). Regulations like GDPR aim to protect individuals by requiring transparency, data minimisation, and the right to be forgotten.

What is the difference between batch processing and stream processing in Big Data?

Batch processing handles large volumes of data at once, with high latency, and is suitable for tasks like generating monthly reports. Stream processing processes data in real-time as it arrives, with low latency, and is used for applications like fraud detection or live monitoring. Tools like Hadoop MapReduce are batch-oriented, while Apache Kafka and Spark Streaming support stream processing.

Big Data

AQA

A-Level

Processing big data involves distributed processing frameworks like MapReduce, Hadoop, and Spark. These technologies allow parallel processing of large datasets across clusters.

Objectives

Exam Tips

Pitfalls

Key Terms

Mark Points

Subtopics in this area

Processing Big Data

NoSQL databases

Characteristics of Big Data

Data analytics

Topic Overview

Big Data refers to extremely large and complex datasets that cannot be easily managed, processed, or analysed using traditional data processing tools. In the context of AQA A-Level Computer Science, Big Data is characterised by the 'three Vs': Volume (vast amounts of data), Velocity (high speed of data generation and processing), and Variety (different forms of data, such as structured, semi-structured, and unstructured). Understanding Big Data is crucial because it underpins modern technologies like artificial intelligence, real-time analytics, and cloud computing, and it raises important considerations around storage, processing, and ethical implications.

The study of Big Data in this specification focuses on how data is captured, stored, and analysed using distributed systems and parallel processing. Key technologies include distributed file systems (e.g., Hadoop HDFS), NoSQL databases (e.g., MongoDB), and MapReduce for processing. Students will explore the trade-offs between consistency, availability, and partition tolerance (the CAP theorem) and learn about data lakes versus data warehouses. This topic also covers the social and ethical issues, such as privacy, security, and the digital divide, making it relevant to both technical and societal aspects of computing.

Big Data is not just a standalone topic; it connects to other areas like databases, networks, and algorithms. For example, understanding SQL and relational databases provides a foundation for contrasting with NoSQL systems. Similarly, knowledge of network topologies and data transmission helps explain how data is moved in distributed systems. By mastering Big Data, students gain insight into how modern tech companies handle petabytes of data and the challenges of scalability, which is essential for careers in data science, software engineering, and IT infrastructure.

Key Concepts

Core ideas you must understand for this topic

→The three Vs of Big Data: Volume (scale of data), Velocity (speed of generation/processing), and Variety (different data types).
→Distributed file systems (e.g., HDFS) that store data across multiple machines for fault tolerance and scalability.
→MapReduce: a programming model for processing large datasets in parallel across a cluster.
→NoSQL databases: non-relational databases (e.g., document, key-value, column-family) designed for horizontal scaling and handling unstructured data.
→The CAP theorem: in a distributed system, you can only guarantee two of Consistency, Availability, and Partition tolerance simultaneously.

Learning Objectives

What you need to know and understand

Describe distributed processing (MapReduce, Hadoop, Spark)
Describe the structural characteristics of key-value, document, column-family, and graph NoSQL databases.
Compare the use cases for each NoSQL database type in the context of big data applications.
Evaluate the trade-offs between consistency, availability, and partition tolerance in distributed NoSQL systems.
Explain how NoSQL databases achieve horizontal scalability through techniques such as sharding and replication.
Distinguish between ACID and BASE transaction models and their relevance to NoSQL databases.
Define volume, velocity, variety, veracity, and value as the 5 Vs of big data.
Explain how each V imposes distinct technical requirements on data storage and processing.
Distinguish between structured, unstructured, and semi-structured data in the context of variety.
Analyse the relationship between veracity and value when drawing business conclusions.
Evaluate the trade-offs between volume and velocity in designing a big data solution.
Apply the 5 Vs framework to categorise challenges in a given case study.
Describe the characteristics and goals of descriptive, diagnostic, predictive, and prescriptive analytics.
Differentiate between the four analytics types by comparing their inputs, processes, and outputs.
Identify appropriate analytics approaches for given big data problems in various industries.
Explain the role of each analytics type within the overall data analytics lifecycle.

Marking Points

Key points examiners look for in your answers

Describe the MapReduce programming model.
Explain the role of Hadoop in big data processing.
Compare Hadoop with Spark for performance.
Identify use cases for distributed processing.
Award credit for accurately stating that a key-value store operates like a distributed hash map with unique keys mapping to opaque values.
Credit for explaining that document databases store semi-structured data (e.g., JSON, BSON) and support nested attributes and arrays.
Credit for describing column-family databases as storing data by columns grouped into families, with sparse storage and optimised for read/write performance.
Credit for outlining graph databases using nodes (entities), edges (relationships), and properties to represent highly interconnected data.
Credit for linking each database type to a suitable big data scenario, such as using key-value for session stores or graph for social networks.
Award credit for correctly defining each of the 5 Vs with clear, distinct descriptions.
Credit for providing relevant examples that illustrate each characteristic (e.g., social media streams for velocity, sensor data for volume).
Recognise the ability to link veracity to data cleaning and quality assurance processes.
Expect candidates to explain how value is derived from the other four Vs, not just defined in isolation.
Marks awarded for evaluating the interdependencies between the Vs in a holistic analysis.
Credit for applying the characteristics to justify the need for specific tools like Hadoop or Spark.
Award credit for accurately defining each analytics type with clear distinctions and real-world examples.
Expect students to match each analytics type to the specific question it answers (e.g., 'what happened?' for descriptive).
Look for the ability to explain how the four types build upon each other, such as predictive analytics relying on descriptive summaries.
Credit demonstration of understanding of the techniques associated with each type (e.g., data mining for descriptive, machine learning for predictive).
Mark for correct use of technical terminology and the ability to discuss the output of each analytics type (reports, dashboards, forecasts, recommendations).

Examiner Tips

Expert advice for maximising your marks

💡Use diagrams to explain MapReduce flow.
💡Highlight Spark's in-memory processing advantage.
💡Give real-world examples like log analysis.
💡When describing database types, always support your answer with a practical example of when each would be used.
💡Use clear, labelled diagrams to illustrate the data models if the question or mark scheme allows it.
💡In comparative questions, explicitly state both advantages and limitations with reference to big data characteristics (volume, velocity, variety).
💡Familiarise yourself with the CAP theorem outcomes: which NoSQL types sacrifice consistency for availability and partition tolerance, and why.
💡Use a mnemonic like 'V5' or create a mind map linking each V to a real-world scenario before writing long answers.
💡When asked to define the 5 Vs, always accompany each definition with a brief, precise example (e.g., 'Velocity: Twitter handles 500,000 tweets per minute').
💡For higher-mark questions, explicitly discuss the interplay between Vs – for instance, how low veracity can undermine the value of high-volume data.
💡In essay-style responses, structure your answer around each V sequentially, then synthesise in a final paragraph.
💡Label each V clearly when writing, and avoid generic statements; be data-driven where possible.
💡Structure responses by clearly addressing each analytics type with a definition, key question, techniques, and a concrete example.
💡Use a mnemonic or visual aid, such as the analytics continuum (descriptive → diagnostic → predictive → prescriptive), to organise your answer logically.
💡Always refer to the given scenario or case study in exam questions to show applied understanding, not just theory.
💡Be precise with language: avoid vague terms; instead of saying 'prescriptive tells you what to do,' specify that it recommends optimal decisions based on constraints and objectives.
💡When discussing the three Vs, always provide concrete examples (e.g., social media data for volume, stock market data for velocity, sensor data for variety). This shows deeper understanding.
💡For MapReduce, be able to explain the map and reduce phases with a simple example, such as word count. Examiners look for clarity in how data is split, processed, and combined.
💡In exam questions about ethical issues, link your points to specific data protection laws (e.g., GDPR) and discuss both benefits (e.g., personalised medicine) and drawbacks (e.g., loss of privacy).

Common Mistakes

Pitfalls to avoid in your exam answers

Confusing Hadoop with Spark.
Not understanding the map and reduce phases.
Ignoring fault tolerance mechanisms.
Misunderstanding NoSQL as meaning 'no SQL' rather than 'not only SQL'.
Assuming all NoSQL databases are entirely schema-less, ignoring the column-family structure or graph schema constraints.
Believing NoSQL databases are universally faster than relational databases without considering query complexity and indexing.
Misapplying ACID properties to all NoSQL types, when many favour BASE (Basically Available, Soft state, Eventual consistency).
Confusing velocity with variety, treating velocity as the range of data types rather than speed.
Failing to distinguish veracity from simple accuracy, missing the aspects of bias, noise, and inconsistency.
Overlooking value as a primary characteristic, treating big data projects as technical exercises without business rationale.
Providing definitions without concrete examples, leading to vague or inaccurate applications.
Mixing volume with storage size alone, ignoring the processing and streaming demands of large datasets.
Confusing predictive and prescriptive analytics: predictive forecasts future events, while prescriptive recommends specific actions to influence those events.
Believing that diagnostic analytics alone establishes causation; students often fail to mention the need for controlled experiments or statistical tests.
Treating the analytics types as isolated rather than recognising their progressive and interdependent nature.
Providing generic definitions without linking to specific big data tools or real-world scenarios, such as using retail sales data or healthcare records.
Misconception: Big Data always means using Hadoop or MapReduce. Correction: While Hadoop is a common framework, Big Data can be processed using other tools like Apache Spark, and the choice depends on the specific use case (e.g., real-time vs batch processing).
Misconception: NoSQL databases are always faster than SQL databases. Correction: NoSQL databases excel at handling large volumes of unstructured data and horizontal scaling, but SQL databases can be faster for complex queries on structured data with ACID compliance.
Misconception: The CAP theorem means you must choose two out of three properties. Correction: The CAP theorem states that in a distributed system, you can only guarantee two of consistency, availability, and partition tolerance at the same time, but you can still achieve trade-offs (e.g., eventual consistency).

Frequently Asked Questions

Common questions students ask about this topic

Before You Start

Prior knowledge that will help with this topic

•Basic understanding of databases, including relational databases and SQL.
•Knowledge of data structures (e.g., arrays, lists) and algorithms (e.g., sorting, searching).
•Familiarity with computer networks and client-server architecture.

Key Terminology

Essential terms to know

Parallel processing
Fault tolerance
Schema flexibility and denormalisation
Horizontal scaling and sharding
CAP theorem trade-offs
Use-case driven database selection
Querying via APIs and map-reduce
Aggregation and semi-structured data models
Volume: Scale and storage demands
Velocity: Real-time data streams
Variety: Structured to unstructured data
Veracity: Data integrity and reliability
Value: Extracting actionable insights
Interplay of the 5 Vs
Descriptive Analytics and Summarisation
Diagnostic Analytics and Root Cause Analysis
Predictive Modeling and Forecasting
Prescriptive Analytics and Optimisation

Likely Command Words

How questions on this topic are typically asked

Describe

Explain

Compare

Outline

Identify

Ready to test yourself?

Practice questions tailored to this topic

Big Data

Subtopics in this area

Topic Overview

Key Concepts

Learning Objectives

Marking Points

Examiner Tips

Common Mistakes

Frequently Asked Questions

Before You Start

Key Terminology

Likely Command Words

Ready to test yourself?

Related Topics in AQA A-Level Computer Science

E2E stub concept

Theory of computation

Fundamentals of computer organisation and architecture

Systematic approach to problem solving

Topic Synopsis

Key Concepts & Core Principles

Exam Tips & Revision Strategies

Common Misconceptions & Mistakes to Avoid

Examiner Marking Points

Big Data

Subtopics in this area

Topic Overview

Key Concepts

Learning Objectives

Marking Points

Examiner Tips

Common Mistakes

Frequently Asked Questions

What is the difference between a data lake and a data warehouse?

How does MapReduce work in simple terms?

What is the CAP theorem and why is it important?

What are the main types of NoSQL databases?

How does Big Data relate to privacy and ethics?

What is the difference between batch processing and stream processing in Big Data?

Before You Start

Key Terminology

Likely Command Words

Ready to test yourself?

Related Topics in AQA A-Level Computer Science

E2E stub concept

Theory of computation

Fundamentals of computer organisation and architecture

Systematic approach to problem solving