2. Design a Distributed File System in Java: FAANG Interviews

A distributed file system (DFS) is an essential component of modern large-scale computing systems. It allows multiple machines to store and retrieve files efficiently, ensuring scalability, fault tolerance, and high performance. This article provides a step-by-step design and implementation of a DFS in Java, covering key concepts such as sharding, data consistency, fault tolerance, and metadata management.

Requirements for a Distributed File System

Functional Requirements:

  • Store files across multiple machines.

  • Retrieve files efficiently.

  • Provide redundancy to prevent data loss.

  • Support scalability to handle growing data volumes.

Non-Functional Requirements:

  • High availability.

  • Fault tolerance.

  • Efficient metadata management.


Key Concepts

1. Sharding and Partitioning of Data

Data is split into chunks and distributed across multiple machines to balance the load.

Implementation:

  • Assign a unique identifier (ID) to each file.

  • Use consistent hashing to determine which node stores a specific file chunk.

public class ConsistentHashing {
    private final List<String> nodes;
    private final int numberOfReplicas;

    public ConsistentHashing(List<String> nodes, int numberOfReplicas) {
        this.nodes = nodes;
        this.numberOfReplicas = numberOfReplicas;
    }

    public String getNode(String key) {
        int hash = key.hashCode() % nodes.size();
        return nodes.get(hash);
    }
}

2. Data Consistency and Availability (CAP Theorem)

The CAP theorem states that a distributed system can only achieve two of the three guarantees:

  • Consistency: All nodes have the same data at any given time.

  • Availability: Every request gets a response (success/failure).

  • Partition Tolerance: The system works despite network failures.

In our design, we aim to balance these constraints based on the use case.

3. Distributed File Storage Mechanisms

Distributed file systems like HDFS or Google File System split files into fixed-size blocks and distribute them across nodes.

File Storage Example:

  • Split a file into chunks of 64 MB each.

  • Store chunks across multiple machines.

public class FileChunk {
    private String chunkId;
    private byte[] data;

    public FileChunk(String chunkId, byte[] data) {
        this.chunkId = chunkId;
        this.data = data;
    }

    public String getChunkId() {
        return chunkId;
    }

    public byte[] getData() {
        return data;
    }
}

4. Fault Tolerance and Replication

Replication ensures that copies of each chunk are stored on multiple machines to prevent data loss in case of node failure.

Implementation:

  • Use a replication factor to determine how many copies of a chunk to store.

  • Distribute replicas across different nodes.

public class ReplicationManager {
    private final int replicationFactor;

    public ReplicationManager(int replicationFactor) {
        this.replicationFactor = replicationFactor;
    }

    public List<String> replicate(String chunkId, List<String> availableNodes) {
        List<String> replicas = new ArrayList<>();
        for (int i = 0; i < replicationFactor; i++) {
            replicas.add(availableNodes.get(i % availableNodes.size()));
        }
        return replicas;
    }
}

5. Metadata Management and Index Structures

Metadata contains information about file locations, chunk mappings, and replicas. A central metadata server efficiently manages this information.

public class ReplicationManager {
    private final int replicationFactor;

    public ReplicationManager(int replicationFactor) {
        this.replicationFactor = replicationFactor;
    }

    public List<String> replicate(String chunkId, List<String> availableNodes) {
        List<String> replicas = new ArrayList<>();
        for (int i = 0; i < replicationFactor; i++) {
            replicas.add(availableNodes.get(i % availableNodes.size()));
        }
        return replicas;
    }
}

Implementation in Java

Main Distributed File System Class

import java.util.*;

public class DistributedFileSystem {
    private MetadataServer metadataServer;
    private ReplicationManager replicationManager;
    private ConsistentHashing consistentHashing;

    public DistributedFileSystem(List<String> nodes, int replicationFactor) {
        this.metadataServer = new MetadataServer();
        this.replicationManager = new ReplicationManager(replicationFactor);
        this.consistentHashing = new ConsistentHashing(nodes, replicationFactor);
    }

    public void storeFile(String fileName, byte[] data) {
        List<FileChunk> chunks = splitFile(fileName, data);
        List<String> chunkIds = new ArrayList<>();

        for (FileChunk chunk : chunks) {
            String node = consistentHashing.getNode(chunk.getChunkId());
            List<String> replicas = replicationManager.replicate(chunk.getChunkId(), Collections.singletonList(node));
            chunkIds.add(chunk.getChunkId());
            // Store chunk on node and replicas (simulate with a print statement)
            System.out.println("Storing chunk " + chunk.getChunkId() + " on nodes: " + replicas);
        }

        metadataServer.addFile(fileName, chunkIds);
    }

    public void retrieveFile(String fileName) {
        List<String> chunkIds = metadataServer.getChunks(fileName);
        for (String chunkId : chunkIds) {
            String node = consistentHashing.getNode(chunkId);
            // Simulate retrieval
            System.out.println("Retrieving chunk " + chunkId + " from node: " + node);
        }
    }

    private List<FileChunk> splitFile(String fileName, byte[] data) {
        List<FileChunk> chunks = new ArrayList<>();
        int chunkSize = 64 * 1024 * 1024; // 64 MB

        for (int i = 0; i < data.length; i += chunkSize) {
            int end = Math.min(data.length, i + chunkSize);
            byte[] chunkData = Arrays.copyOfRange(data, i, end);
            String chunkId = fileName + "_chunk_" + (i / chunkSize);
            chunks.add(new FileChunk(chunkId, chunkData));
        }

        return chunks;
    }

    public static void main(String[] args) {
        List<String> nodes = Arrays.asList("Node1", "Node2", "Node3");
        DistributedFileSystem dfs = new DistributedFileSystem(nodes, 3);

        byte[] fileData = new byte[128 * 1024 * 1024]; // 128 MB dummy file
        dfs.storeFile("myFile", fileData);
        dfs.retrieveFile("myFile");
    }
}

Advanced Features

  • Compression and Encryption: Compress and encrypt file chunks before storage.

  • Versioning: Maintain versions of files for recovery and auditing.

  • Monitoring: Use monitoring tools to track system health and usage.


Summary

Designing a distributed file system involves understanding distributed systems' core principles, including data partitioning, consistency, and fault tolerance. We can create a robust and efficient DFS suitable for large-scale applications by implementing the above concepts in Java.

Please stay tune, I will update Point 3 of FANNG Interview series, Please check top 10 interview questions here.

0 comments:

Post a Comment