Processing big data involves distributed processing frameworks like MapReduce, Hadoop, and Spark. These technologies allow parallel processing of large dat
Topic Synopsis
Processing big data involves distributed processing frameworks like MapReduce, Hadoop, and Spark. These technologies allow parallel processing of large datasets across clusters.
Key Concepts & Core Principles
- The three Vs of Big Data: Volume (scale of data), Velocity (speed of generation/processing), and Variety (different data types).
- Distributed file systems (e.g., HDFS) that store data across multiple machines for fault tolerance and scalability.
- MapReduce: a programming model for processing large datasets in parallel across a cluster.
- NoSQL databases: non-relational databases (e.g., document, key-value, column-family) designed for horizontal scaling and handling unstructured data.
- The CAP theorem: in a distributed system, you can only guarantee two of Consistency, Availability, and Partition tolerance simultaneously.
Exam Tips & Revision Strategies
- Use diagrams to explain MapReduce flow.
- Highlight Spark's in-memory processing advantage.
- Give real-world examples like log analysis.
- When describing database types, always support your answer with a practical example of when each would be used.
- Use clear, labelled diagrams to illustrate the data models if the question or mark scheme allows it.
- In comparative questions, explicitly state both advantages and limitations with reference to big data characteristics (volume, velocity, variety).
- Familiarise yourself with the CAP theorem outcomes: which NoSQL types sacrifice consistency for availability and partition tolerance, and why.
- Use a mnemonic like 'V5' or create a mind map linking each V to a real-world scenario before writing long answers.
Common Misconceptions & Mistakes to Avoid
- Confusing Hadoop with Spark.
- Not understanding the map and reduce phases.
- Ignoring fault tolerance mechanisms.
- Misunderstanding NoSQL as meaning 'no SQL' rather than 'not only SQL'.
- Assuming all NoSQL databases are entirely schema-less, ignoring the column-family structure or graph schema constraints.
- Believing NoSQL databases are universally faster than relational databases without considering query complexity and indexing.
Examiner Marking Points
- Describe the MapReduce programming model.
- Explain the role of Hadoop in big data processing.
- Compare Hadoop with Spark for performance.
- Identify use cases for distributed processing.
- Award credit for accurately stating that a key-value store operates like a distributed hash map with unique keys mapping to opaque values.
- Credit for explaining that document databases store semi-structured data (e.g., JSON, BSON) and support nested attributes and arrays.
- Credit for describing column-family databases as storing data by columns grouped into families, with sparse storage and optimised for read/write performance.
- Credit for outlining graph databases using nodes (entities), edges (relationships), and properties to represent highly interconnected data.