Jump to content

User:Rahilsonusrhn/sandbox

From Wikipedia, the free encyclopedia

Knowledge

[edit]

Spark

-Data processing engine to store and process data in real time across a cluster using Resilient distributed datasets(RDD).

-RDD can have transformation(map,filter,join) and action(reduce,count,first)

- Has rich set of machine learning algorithms and complex analytics (ML Lib - predictive analysis, recomendation systems etc)

- Can do realtime stream processing

- has component GraphX that hepls in graphbased processing. etc linkedin

-has componet Spark Core that does fault tolerance, memory management, scheduling and distribution across cluster and interaction with storage systems like hdfs,rdbmsetc.

- Can integrate spark to query, analyze and transform data.

- Supports Java, python, scala etc

- much faster than hadoop


Hadoop

- uses mapreduce to process data

- does batch processing


Splunk

-generates graphs, reports, alerts, dashboards and visualizations

Apache Beam

-Data processing framework.

-It is execution platform,data and languge agnostic.

-Write code in beam and run on any data processing engine ex spark, map reduce, google cloud dataflow

-uses pipelines to read data(Pcollection) ,transform(Ptransform) and output data

- Can add sdk as a dependency in pom and use the libraries to process data

- has functions like trigger ,window

Apache Flume

Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for example) in a distributed fashion via it's strong coupling with the Hadoop cluster.

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.

  • Flume provides the feature of contextual routing.
  • The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.

apache mahout

analyze large sets of data effectively and in quick time. Uses mathematical models for

  • Recommendation
  • Classification
  • Clustering

Algorithms

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

Item-Based Colabrative filtering

Spectral clustering -derived from graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them

Random forest - uses decision trees to classify ramdom subsets to make predictions.

Matrix factorization for recommender systems


Scala

- statically typed and dynamically infered language

ELK

  • Logstash: Collect logs and events data. It even parses and transforms data. Logstash is the data collection pipeline tool. It collects data inputs and feeds into the Elasticsearch
  • ElasticSearch: The transformed data from Logstash is Store, Search, and indexed.
    • Has REST API web-interface with JSON output
    • Full-Text Search
    • Near Real Time (NRT) search
    • Sharded, replicated searchable, JSON document store
    • Schema-free, REST & JSON based distributed document store
  • Kibana: Kibana uses Elasticsearch DB to Explore, Visualize, and Search logs.


fault tolerant, scalable, messaging system

project example : receive playtech events and send them in kafka topic. Different consumers can process these messages. ensures events are enot missed and processing is scaled.

Bet placement services send huge amount of requests during match. These requests are pushed to kafka and our processors pracess these aginst the fixtures to result specific journeys. Like calling payment systems.

- Log compaction - removes duplicate events from logs. there is key compaction which can be used if order is to be preserved. - Kafka Topic - messages belonging to one category

- topic can have many partitions to which producers can publish data. data goes in partitions based on partitioning key if specified else round robin

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1   --partitions 1 --topic Hello-Kafka

- Kafka Cluster -has more than 1 brokers managed by zookeeper.

-only one broker acts as Controller

- Kafka Broker - contains topic partitions.

-receives msgs from producer and stores them in topic partition using offset

- Kafka Zookeeper - manages brokers in the cluster

- Kafka Producer

Define kafka template giving the properties file that contain kafka configuration.

   private boolean sendMessageTOKafka(Object messageDto) throws RecoverableException {
            byte[] bytes = jacksonObjectMapper.writeValueAsString(message).getBytes("UTF-8");
            Object key = messageDto.getEventID(messageDto);
            kafkaTemplate.send(topic, key, bytes).get();
            return true;
			}

- Kafka consumer

KafkaConsumer<String,String> consumer = new KafkaConsumer<String,String>(properties);
// subscribe to topics
consumer.subscribe(topics);
// The consumer reads data from Kafka through the polling method.
while(true)
{
ConsumerRecords<String, byte[]> records = kafkaConsumer.poll(pollTimeout);
for(Record record: records)
{
byte[] payload = record.value();
Object convertedMessage =unmarshall(payload)
//use msg from here also can use record.offset(), record.topic()
myMessageProceessor.handleMessage(convertedMessage);
}

- Kafka Consumer Group

The consumer join consumer group using group id and consumers in a group divides the topic partitions among themselves and each partition is only consumed by a single consumer from the group.


- Kafka streams (API)

Read records from one kafka topic and process it and write to other topic


//Read topics to consume and produce

String topic = configReader.getKStreamTopic();

String producerTopic = configReader.getKafkaTopic();

//Define serialization and deserialization types

final Serde stringSerde = Serdes.String();

final Serde longSerde = Serdes.Long();

// get stream data and apply functions

KStreamBuilder builder = new KStreamBuilder();

KStream<String, String> inputStreamData = builder.stream(stringSerde, stringSerde, producerTopic);

KStream<String, Long> processedStream = inputStreamData.mapValues(record -> record.length() )


- Kafka Connect (API)

Define connector class(like file stream connecter) and give source and sink properties


properties file

efi.kafka.replication.factor=3

efi.kafka.bootstrap-servers=at1p1xdkfk101.dbz.unix:9093,at1p1xdkfk102.dbz.unix:9093

efi.kafka.user=spspinhr01

efi.kafka.password=UzU

# kafka producer properties

efi.kafka.producer.retries=0

efi.kafka.producer.batch.size=16384

efi.kafka.producer.max.block.ms=3000

efi.kafka.producer.linger.ms=1

efi.kafka.producer.request.timeout.ms=5000

efi.kafka.producer.buffer.memory=33554432

efi.kafka.producer.acks=0

efi.kafka.producer.max.request.size=1048576

efi.kafka.producer.compression.type=none

efi.kafka.producer.max.in.flight.requests.per.connection=5

efi.kafka.producer.connections.max.idle.ms=540000

efi.kafka.producer.receive.buffer.bytes=32768

efi.kafka.producer.send.buffer.bytes=131072

efi.kafka.producer.metadata.max.age.ms=300000

efi.kafka.producer.reconnect.backoff.ms=50

efi.kafka.producer.retry.backoff.ms=100

# kafka consumer properties

efi.kafka.consumer.enable.auto.commit=false

efi.kafka.consumer.auto.commit.interval.ms=5000

efi.kafka.consumer.auto.offset.reset=latest

efi.kafka.consumer.reconnect.backoff.ms=5000

efi.kafka.consumer.retry.backoff.ms=5000

efi.kafka.consumer.max.poll.records=500

efi.kafka.consumer.max.poll.interval.ms=300000

efi.kafka.consumer.session.timeout.ms=300000

efi.kafka.consumer.heartbeat.interval.ms=3000

efi.kafka.consumer.partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor

efi.kafka.consumer.fetch.min.bytes=1

efi.kafka.consumer.fetch.max.bytes=52428800

efi.kafka.consumer.fetch.max.wait.ms=500

efi.kafka.consumer.max.partition.fetch.bytes=1048576

efi.kafka.consumer.connections.max.idle.ms=540000

efi.kafka.consumer.check.crcs=true

efi.kafka.consumer.request.timeout.ms=305000

efi.kafka.consumer.receive.buffer.bytes=65536

efi.kafka.consumer.send.buffer.bytes=131072

efi.kafka.consumer.metadata.max.age.ms=300000

efi.kafka.reconnect.period.ms=5000

efi.kafka.pollTimeoutMs=10000


Cassandra

[edit]

Distributed high performing scalable database. Cassandra nodes are in ring based topology with different strategies of replication like simple strategy (rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy)


Keyspace - its like a schema/container which can contain multiple tables.

CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}

column family -

primary key -

partitioning key-

clustering key -

compression- compression algorithm class LZ4Compressor (Cassandra 1.2.2 and later), SnappyCompressor, or DeflateCompressor. 25-30 % size decrease

compaction - It is like a IO operation that happens in the background . TO cleanup data n store data more efficiently. Compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable. SizeTieredCompactionStrategy (STCS): The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk. TimeWindowCompactionStrategy (TWCS) This strategy is an alternative for time series data. TWCS compacts SSTables using a series of time windows. While with a time window, TWCS compacts all SSTables flushed from memory into larger SSTables using STCS.

gossip protocol - node communicate with each other to pass/share data

project - Defined persistance modules like fixture store, provider store.

- used spring data with cassandra

- AbstractCassandraConfiguration class is extended and provided with the properties like ${cassandra.datastax.hosts}${cassandra.cluster.loadbalancing.localDc}

- Get CQL session from sessionfactrybean

- extend spring data crud repository

- @Table("fixture_sport_mapping_v2")

public class FixtureIDAndSportCodeMappingV2 {

    @PrimaryKeyColumn(name = "provider_code", type = PARTITIONED)

    private String providerCode;



KUBERNETES

[edit]

- Orchestrator for container deployment (other ex docker swarm but lacs autoscaling features)

https://kubernetes.io/docs/concepts/services-networking/service/#type-clusterip

-Architecture contains control panel which has kube api server(front end of contol panel), controller manager(for noticing and responding if node goes down), etcd, cloud controller manager(external cloud provider connections) and scheduler( to schedule pods)

Nodes contain kublet(provides environment and runs pods), kube proxy(manages network layer) and pods


- POD can contain many containers.

-zuul and eureka not needed as there is out of the box support for service discovery and gateway is ingress.

- Desired state managment.

- Uses cluster API services to achive deesired state using config in yaml.

- deployment yaml contains container images, replica's config, CPU usage config

- kublet processes run on pods to coordinate with cluster api services

- Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the control plane and contains the services necessary to run Pods

- Can use commandline interface kubectl or can use Kubernetes dashboard to manage deployments and pods.

---> kubectl create -f deployment-account.yaml

kubernetes.yml or deployment-account.yaml


apiVersion: extensions/v1beta1

kind: Deployment

metadata:

  name: account-service

  labels:

    run: account-service

spec:

  replicas: 1

  template:

    metadata:

      labels:

        run: account-service

    spec:

      containers:

      - name: account-service

        image: piomin/account-service

        ports:

        - containerPort: 2222

          protocol: TCP

      - name: mongo

        image: library/mongo

        ports:

        - containerPort: 27017

          protocol: TCP

- Service

kind: Service

apiVersion: v1

metadata:

  name: account-service

spec:

  selector:

    run: account-service

  ports:

    - name: port1

      protocol: TCP

      port: 2222

      targetPort: 2222

    - name: port2

      protocol: TCP

      port: 27017

      targetPort: 27017

  type: NodePort

-Ingress -Ingress may provide load balancing, SSL termination and name-based virtual hosting. Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.

ingress.yml


apiVersion: extensions/v1beta1

kind: Ingress

metadata:

  name: gateway-ingress

spec:

  backend:

    serviceName: default-http-backend

    servicePort: 80

  rules:

  - host: micro.all

    http:

      paths:

      - path: /account

        backend:

          serviceName: account-service

          servicePort: 2222

      - path: /customer

        backend:

          serviceName: customer-service

          servicePort: 3333



DOCKER

[edit]

To create image - mvn clean install dockerfile:build

Create dockerfile in project root

FROM openjdk:8-jdk-alpine

VOLUME /tmp

ADD target/hello-docker-0.0.1-SNAPSHOT.jar hello-docker-app.jar

ENV JAVA_OPTS=""

ENTRYPOINT [ "sh", "-c", "java $JAVA_OPTS -Djava.security.egd=file:/dev/./urandom -jar /


Deploy docker container

docker run -p 8080:9080 -t hello-howtodoinjava/hello-docker --name hello-docker-image


LOGGING

[edit]

Log4j - LoggerFactory.getlogger(classname) . Can define log severity , appenders, rolling strategy , location etc.

Log4j2- faster and better

SLF4j - as an absrtaction. Can be used with log4j. Helps change underlyig implmentation without changing the code.

logstash - Define logstash properties in logback.xml

Define encoder(LoggingEventCompositeJsonEncoder) and pattern.

logstash agent can be a docker image that runs on the host to send data from log folder to elastic serch. config can be done in logstash.yml or env variables.

agent injects the corelation id/ trace id.

JMS/RabbitMQ

[edit]

JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients.  There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ.

JMS supports two models: one to one and publish/subscriber.

JMS is specific for java users only, but RabbitMQ supports many technologies.


RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang.

RabbitMQ supports the AMQP model which has 4 models :

direct(producer sends to exchange and exchange sends the msg to queue where routing key matches the binding key)

fanout(producer sends msg to exchange and exchange send to all queues),

topic(partial match of keys)

headers(uses msg header instead of routing key)

Default is when routing key matches queue name


Heap Dump Analysis

Use Jmap tool that is shipped with jdk to extract dump. jmap -dump:format=b,file=<file_name> <pid>

Eclipse Memory Analyzer to analyze

can analyse objects created, memory used, threads in different states etc

Security(JWT,SAML,OAUTH)

[edit]

JWT - Json web token. generated by server after authentication username and password. Then encrypts userinfo in JWT token n sends to client. User info is already present in JWT when it is recieved so server doesnot have to save the session info. It just has to validate the token using its secret key. If we decode the token we can find header that contains encryption algo, payload that contains userinfo and expiry timestamp and finally the signature that is validated against the secret key.

SSO -SAML(Security assersion Markup language) 3 entities -> User, Service provider, identity provider(OpenId Connect/ OKTA/KeyClock)

Both authentication and authorization

saml xml contains - issuer, acs url, auth req id , timestamp

Step 1: User tries to access private resources from SP.

Step 2: SP generates SAML Request.

Step 3: After generating SAML Request SP redirects the user to IdP.

Step 4:  IdP ask the user to authenticate with login details.

Step 5: IdP validates the user and generates SAML Response that contains the SAML Assertion required for SP.

Step 6: The IdP redirects the user to SP’s Assertion Consumer Service (ACS).

Step 7: ACS validates the user and allows the user to access the protected resource.

Step 8: Now users able to access resources from SP.


OAUTH/OAUTH2.0 - For Authorization

  1. The client application requests authorization by directing the resource owner to the authorization server.
  2. The authorization server authenticates the resource owner and informs the user about the client and the data requested by the client. Clients cannot access user credentials since authentication is performed by the authentication server.
  3. Once the user grants permission to access the protected data, the authorization server redirects the user to the client with the temporary authorization code.
  4. The client requests an access token in exchange for the authorization code.
  5. The authorization server authenticates the client, verifies the code, and will issue an access token to the client.
  6. Now the client can access protected resources by presenting the access token to the resource server.
  7. If the access token is valid, the resource server returns the requested resources to the client.

JAVA8

[edit]

Class: A class is a blueprint or template for creating objects. It defines the properties (attributes) and behaviors (methods) that the objects of the class will have.

Object: An object is an instance of a class. It is a runtime entity with its own set of data members (attributes) and methods (functions). Encapsulation: Encapsulation is the bundling of data (attributes) and methods that operate on the data into a single unit, i.e., a class. It helps in hiding the internal details of an object and only exposing what is necessary.

Inheritance: Inheritance allows a class (subclass/derived class) to inherit the properties and behaviors of another class (superclass/base class). It promotes code reusability and establishes a relationship between classes. public class Animal { public void eat() { System.out.println("Animal is eating"); } } public class Dog extends Animal { public void bark() { System.out.println("Dog is barking"); } } // Usage

Dog myDog = new Dog()

; myDog.eat(); // Inherited method

myDog.bark(); // Method specific to Dog class

Polymorphism: Polymorphism allows objects of different classes to be treated as objects of a common superclass. It can be achieved through method overloading and method overriding. overloading- compiletime polymorphism overriding - runtime polymorphism // compiler doesnt know which method is called. achived using upcasting Covariant Return Type : The covariant return type specifies that the return type may vary in the same direction as the subclass.

  1. class A{    
  2. A get(){return this;}    
  3. }    
  4.    
  5. class B1 extends A{    
  6. @Override  
  7. B1 get(){return this;}    
  8. void message(){System.out.println("welcome to covariant return type");}    


super :

  1. super can be used to refer immediate parent class instance variable.
  2. super can be used to invoke immediate parent class method.
  3. super() can be used to invoke immediate parent class constructor.


Instance initializer block : is used to initialize the instance data member. It run each time when object of the class is created.

IS-A relationship : Inheritance

HAS-A relationship : Aggregation (User Has address)

Final : can be on method class and variable.

Final class cannot be extended.

final method cannot be overriden

final variables cannot be changed - can be initialized in constructor or staticblock

Upcasting :

  1. class A{}  
  2. class B extends A{}  
  1. A a=new B();//upcasting  


static binding : type of the object is determined at compiled time(by the compiler)

dynamic binding : just like runtime polymorphism.

Abstract class : concrete or abstract methods

  • An abstract class must be declared with an abstract keyword.
  • It can have abstract and non-abstract methods.
  • It cannot be instantiated.
  • It can have constructors and static methods also.
  • It can have final methods which will force the subclass not to change the body of the method.
  1. abstract class Bank{    
  2. abstract int getRateOfInterest();    
  3. }    
  4. class SBI extends Bank{    
  5. int getRateOfInterest(){return 7;}    
  6. }   


Interface : Since Java 8, we can have default and static methods in an interface.

Since Java 9, we can have private methods in an interface.

Multiple inheritace can be achieved using interfaces.


Exception handling :

Throwable

Exception Error

checked,unchecked


System Design

[edit]
URL Shortner.
[edit]

Questions : size of url(depends on DAU), read write ratio(100:1), How many years to store

Shortened URL can be a combination of numbers (0-9) and characters (a-z, AZ).

Capacity estimation -> writes - 10 million/day = 10x10`6/10^5 -> 100 writes/sec

reads = 100*100 = 10k reads/sec

storage = 1 million per day x 100 years x 365 days x datasize= 400x10`9 x datasize

datasize = short url + long url +createdDate (100 bytes=10`2). hence 40 TB

in UTF-8 charset -> char= 1 byte, date =3 bytes, integer=4 bytes


API -> POST api/v1/generate

• request parameter: {longUrl: longURLString} • return shortURL with http ok 200

GET api/v1/shortUrl

• Return longURL for HTTP redirection(301 permanant)

DB Design ->. url table -> id, shorturl,longurl,createdate


Shortening Algo -> base 64 encoding (64`7 or 64'6 based on datasize=40TB)

sha1 or MD5 hash ( then take first 7 char) collisions- take next 7

can use bloom filter too for collision detection

other way is use a uniqe id generator like snowflake. can also pregenerate

Rate Limiter
[edit]

Questions : is rate limiting based on userId or ip etc

token bucket(redis), leaky bucket(using Queue), sliding window

replenish rate and burst rate

can have expiery in redis

can use lua script that makes atomic transactions for race condition. `can use optimistic locking using setnx

give 429 response (too many requests)

can also have queues to make extra req wait

Consistent hashing
[edit]

Only limited amounts of keys are remapped

there could be uneven distribution when a node goes down. hence virtual nodes are used.

in virtual nodes for each node we have few replica nodes. Hence standard deviation decreases

class ConsistentHashing {

private final TreeMap<Long, String> circle = new TreeMap<>();

private final int numberOfReplicas;

public ConsistentHashing(int numberOfReplicas, List<String> nodes)

{ this.numberOfReplicas = numberOfReplicas;

for (String node : nodes) {

addNode(node); } }

public void addNode(String node)

{

for (int i = 0; i < numberOfReplicas; i++)

{ String virtualNode = node + "-" + i;

circle.put(hash(virtualNode), virtualNode);

} }

public void removeNode(String node) {

for (int i = 0; i < numberOfReplicas; i++)

{

String virtualNode = node + "-" + i;

circle.remove(hash(virtualNode));

} }

public String getNode(String key) {

if (circle.isEmpty()) { return null; }

long hashKey = hash(key);

Map.Entry<Long, String> entry = circle.ceilingEntry(hashKey);

if (entry == null) {

// Wrap around if the key is greater than the largest hash

entry = circle.firstEntry();

} return entry.getValue();

}

Distributed KeyValue Store
[edit]

put(key,value)

get(key)

CAP theorem

use a ring of nodes. consistent hashing,

every node has. data replicated,

UniqueId generator
[edit]

UUID-> 32 char- 128 bit

might have duplicates

sorting is not possible

collisions can occur

TimeStamp -> 41 bits

multiple servers can have same timestamp

Snowflake approach -> timestamp+Machineid+sequence number

WebCrawler
[edit]

Question : purpose ? search engine indexing ?

should store html content only or images, documents too ?

what is the refresh rate ?

For 1 billion pages per month

size of each website -100kb

1billion x 100kb = 10x10`12= 10 TB/month

QPS: 10`9 / 30 days *10`5= 3.3 x10`2= 330 qps


Can have multiple services like URL processor service, URL Downloader Service, Parsing service,

Seed Urls -> collect websites for different categories like news, ecommerce etc n put it in seedurl table in db

Url Processor Service -> checks if a url is already processed(as multiple websites can have same links) if not forwards req to queue for downloader service

Can have a prioritizer implemented that will prioritze websites based on ranking/category while putting to queue

Downloader service -> downloads contents and puts it in some file storage. can use Robots.txt that websites have to crawl only allowed pages.

Parser Service -> will parse the contents of html page

DNS cache -> implement a distributed cache to overcome dns lookup bottleneck

Notification System
[edit]

Question -> which type of notificaations (sms,email, mobile push, ivr etc)

Apple push notification System

Android - Firebase

Sms - diff providers based on the region/country

Email - can have your own mail server or third party integration

Take the request and put it in a rabbitmq queue from any services

Rabbit mq supports priority so imp notifications can be handled first

Ratelimiting can be done using redis

News Feed - Twitter
[edit]

Question -> DAU (100Million)

What are most imp features to build. (user can publish post, friends can see post , users can follow other users etc)

  1. follow others
  2. post tweet
  3. view feed
    POST /v1/feed Params: • content: content is the text of the post.

GET /v1/me/feed Params: • auth_token:

DB. can be relational with sharding or graphDB

Feedservice -> queue / kafka -> feed cache(entry for every userid)

for ppl with millions of folllowers cant update for allusers in cache. so fetch dynamically and add to feed.

Images / videos from CDN

-- Users table

CREATE TABLE users (

    user_id INT PRIMARY KEY,

    username VARCHAR(255) NOT NULL,

);

-- Feeds table

CREATE TABLE feeds (

    feed_id INT PRIMARY KEY,

    user_id INT,

    content TEXT NOT NULL,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (user_id) REFERENCES users(user_id)

);

-- Followers table

CREATE TABLE followers (

    follower_id INT,

    following_id INT,

    PRIMARY KEY (follower_id, following_id),

    FOREIGN KEY (follower_id) REFERENCES users(user_id),

    FOREIGN KEY (following_id) REFERENCES users(user_id)

);

Chat System
[edit]

Questions -> Do we need to store data in backend

Is group chat allowed

How many msgs/day

- cant use stateless so no REST (as everytime new connection)

- cant use polling

- use websockets

- use multiple chat servers

-use key value store like redis to msgs

-after authentication client connects to one of the chat servers. charserver info is sent n client connects to chatserver

-zookeper to keep track of chatservers (using service discovery)

-> use snowflake for msgid (for ordering)

-> DB design (msgid,fromid,toid,content,timestamp)

(groupid,msgid,userid,content,timestamp)

-> User A miight connect to server A and B might connect to other server

-> use msg queue. As soon as a msg is recieved in a server put it to the queue and other server will recieve it

-> to know online/offline can use redis

-> if user is not online notification service can send notification

[edit]

Questions -> Suggesions based on frequency

is ranking required when giving suggesions

how many suggesions per prefix (10)

API -> GET /api/v1/suggesions.? prefix="abc"

- NATIVE SOLUTION : Can use rdbms with querystring, frequency as coulms and update freq on every count

use like operator (not good design)

- use tri data structure ( to generate can log all http req and save in hadoop/kafka and use spark)

- can have replicas or shard based on starting letter


class TrieNode {

    private final TrieNode[] children;

    private boolean isEndOfWord;

    public TrieNode() {

        this.children = new TrieNode[26]; // Assuming lowercase English letters

        this.isEndOfWord = false;

    }

    public TrieNode getChild(char ch) {

        return children[ch - 'a'];

    }

    public TrieNode createChild(char ch) {

        TrieNode node = new TrieNode();

        children[ch - 'a'] = node;

        return node;

    }

    public boolean isEndOfWord() {

        return isEndOfWord;

    }

    public void setEndOfWord(boolean endOfWord) {

        isEndOfWord = endOfWord;

    }

}

public class Trie {

    private final TrieNode root;

    public Trie() {

        this.root = new TrieNode();

    }

    public void insert(String word) {

        TrieNode current = root;

        for (char ch : word.toCharArray()) {

            if (current.getChild(ch) == null) {

                current = current.createChild(ch);

            } else {

                current = current.getChild(ch);

            }

        }

        current.setEndOfWord(true);

    }

    public boolean search(String word) {

        TrieNode node = searchNode(word);

        return node != null && node.isEndOfWord();

    }

    public boolean startsWith(String prefix) {

        return searchNode(prefix) != null;

    }

    private TrieNode searchNode(String word) {

        TrieNode current = root;

        for (char ch : word.toCharArray()) {

            if (current.getChild(ch) == null) {

                return null;

            } else {

                current = current.getChild(ch);

            }

        }

        return current;

    }

}

YouTube/Netflix
[edit]

Assume the product has 5 million daily active users (DAU).

• Users watch 5 videos per day. • 10% of users upload 1 video per day.

• Assume the average video size is 300 MB.

• Total daily storage space needed: 5 million * 10% * 300 MB = 150TB

-use CDN

-Use blob storage for raw files

-save metadata to DB

- save videos in s3 by giving presigned url

- Once data is in use transcoder to convert data to different bitrates and resolutions and formats

- can use aws lambda for this

Proximity Service
[edit]

Can use quad tree - each node has 0 to 4 children

-divide 2d into 4 quadrants

- in quadtree leaf nodes can be decided based on no of locations

Other approch - divide world into 00,01,11,10

- each quadrant can again be divided into 0000,0010,...

ce- can put it in db abd do sql like query

DB - locationID

name

lat

long

description

Can use s2 library too from google(latlong to cell id. Can have range queries too)


Chatgpt
[edit]

Functional Requirements -> Create conversation by sending a prompt

View Conversation

update conversation

Delete conversation

Feedback( thumbsup, thumbsdown)

Non Functional -> Latency

Security

Scalabliity

Conversation service will recieve the msg-> Profanity Service(ML model)

ChatgptService -> calls a (ML model)-> save request and response in DB

Use feedback like thumbsup to send to ML model to train the model

Have risk model to ensure no curruption

Talk of API details for crud


generative pretrained transformer. takes data from books, web crawling etc,

Ex : Earth is ....a planet, a place where humans live, part of solarsystem. probability based scoring, greedy . temperature , topk etc

then fine tune data, then reward model based on emotions parameter, reinforced learning using rewards

Distributed msg queue
[edit]

Functional Req -> publish and consume from queue

Non Func -> Is it topic based or fanout based or direct message

Scalability -> 10k topics x 10 million msg/day = 100gbmsgs/day

`latency-> time to deliver and consume

producer pushes to the queue(can use batching)

consumer pulls from the queue(can pull in batch)

Message -> Key, Value

Write ahead log(append only log)- use segmentation to avoid large file size

partition using consistent hashing and write to different nodes

Have Metadata storage that would have offset, followers for replicas, topic details, retention policy

Zookeper to coordinate(out of the box solution for metadata storage, state storage and service for heartbeat)

Configuration of acknowledgement

Digital Wallet
[edit]

-use rdbms(using sharding)

-Deduct can be on one shard and credit can be on other

-Substract first and then add

-2 phase commit protocol -> prepare(lock), commit

coordinator service can be single point of faluire

locking not great option

-SAGA -> small local transactions

Compensating transactions

Saga execution coordinator

Can use event sourcing for reproduction

Google Docs
[edit]

Should be able to create docs

Should be able to see other user editing

-> websockets for realtime changes

-> users will have local version

-> positional indexing. Doc will have positions and that data is sent over websocket

-> instead of 1,2,3 can use .1,.2,.3 generated in runtime

S3 Object Storage
[edit]

Microservices

[edit]

- Api Gateway (ZUUL) - Dynamic Routing Monitoring, Security

[edit]
Component Example Features Steps Link
API Gateway Zuul (bundled with Spring Cloud)

@EnableZuulProxy

Dynamic Routing @SpringBootApplication

@EnableZuulProxy


spring.application.name = api-gateway

# routing for service 1

zuul.routes.service_1.path = /api/service_1/**

zuul.routes.service_1.url = http://localhost:8081/

# routing for service 2

zuul.routes.service_2.path = /api/service_2/**

zuul.routes.service_2.url = http://localhost:8082/

https://rb.gy/9btjty
Security :

Zuul with OAuth

  • Define Authorization server
  • Define Resource server
  • Configure API server


Application.yml
server:
  port: 8080
zuul:
  sensitiveHeaders: Cookie,Set-Cookie
  routes:
    spring-security-oauth-resource:
      path: /spring-security-oauth-resource/**
      url: http://localhost:8082/spring-security-oauth-resource
    oauth:
      path: /oauth/**
      url: http://localhost:8081/spring-security-oauth-server/oauth
security:
  oauth2:
    resource:
      jwt:
        key-value: 123
Java Config :
@Configuration
@Configuration
@EnableResourceServer
public class GatewayConfiguration extends ResourceServerConfigurerAdapter {
    @Override
    public void configure(final HttpSecurity http) throws Exception {
	http.authorizeRequests()
          .antMatchers("/oauth/**")
          .permitAll()
          .antMatchers("/**")
	  .authenticated();
    }
}
https://rb.gy/0of2px
Service Discovery ClientSide SD @EnableEurekaClient annotation enables the Eureka client. The @LoadBalanced annotation configures the RestTemplate to use Ribbon, which has been configured to use the Eureka client to do service discovery
Serverside SD An AWS Elastic Load Balancer (ELB) is an example of a server-side discovery router. A client makes HTTP(s) requests (or opens TCP connections) to the ELB, which load balances the traffic amongst a set of EC2 instances. An ELB can load balance either external traffic from the Internet or, when deployed in a VPC, load balance internal traffic. An ELB also functions as a Service Registry. EC2 instances are registered with the ELB either explicitly via an API call or automatically as part of an auto-scaling group.

CODING

reverse a sentence -> use "\\s" to split an then

Is string palindrome -> even charecters and atmost one odd charecter

BFS of a tree -> use queue and add root first then poll and add children

DFS of a tree -> this is pre order traversel using recursion