Big Data refers to datasets that are too large or complex to be processed by traditional relational database systems. It is characterized by volume, veloci
Topic Synopsis
Big Data refers to datasets that are too large or complex to be processed by traditional relational database systems. It is characterized by volume, velocity, and variety, requiring distributed processing and functional programming techniques to extract meaningful patterns.
Key Concepts & Core Principles
- The three Vs of Big Data: Volume (scale of data), Velocity (speed of generation/processing), and Variety (different data types). Students should be able to give examples of each, e.g., sensor data (high velocity), social media posts (high variety), and transaction logs (high volume).
- Distributed storage and processing: Technologies like Hadoop and MapReduce allow data to be stored across multiple servers and processed in parallel. Understand the concept of 'data locality' – moving computation to where the data resides to reduce network traffic.
- Structured vs. unstructured data: Structured data fits neatly into tables (e.g., SQL databases), while unstructured data (e.g., text, images, video) requires different approaches like NoSQL databases or data lakes. Semi-structured data (e.g., JSON, XML) has some organisational properties but not a rigid schema.
- Data mining and machine learning: Big Data often involves finding patterns or making predictions using algorithms. Students should know that correlation does not imply causation, and that bias in data can lead to biased outcomes.
- Privacy and ethics: Key issues include anonymisation (which can be re-identified), informed consent, and the 'right to be forgotten'. The General Data Protection Regulation (GDPR) is a relevant legal framework.
Exam Tips & Revision Strategies
- Ensure you can clearly define and distinguish between volume, velocity, and variety.
- Be prepared to explain why functional programming is particularly suited to distributed processing tasks.
- Focus on the challenges posed by unstructured data rather than just the size of the data.
Common Misconceptions & Mistakes to Avoid
- Confusing Big Data with simply having a large database.
- Failing to explain the significance of the lack of structure in Big Data.
- Assuming relational databases can scale indefinitely across multiple machines.
Examiner Marking Points
- Definition of Big Data using the three Vs: volume, velocity, and variety.
- Explanation of why relational databases are inappropriate for Big Data.
- Understanding that processing must be distributed across multiple servers.
- Role of functional programming in writing distributed, correct, and efficient code.
- Knowledge of the fact-based model for data representation.
- Understanding of graph schema, including nodes, edges, and properties.