User:Rahilsonusrhn/sandbox
Knowledge
[edit]Spark
-Data processing engine to store and process data in real time across a cluster using Resilient distributed datasets(RDD).
-RDD can have transformation(map,filter,join) and action(reduce,count,first)
- Has rich set of machine learning algorithms and complex analytics (ML Lib - predictive analysis, recomendation systems etc)
- Can do realtime stream processing
- has component GraphX that hepls in graphbased processing. etc linkedin
-has componet Spark Core that does fault tolerance, memory management, scheduling and distribution across cluster and interaction with storage systems like hdfs,rdbmsetc.
- Can integrate spark to query, analyze and transform data.
- Supports Java, python, scala etc
- much faster than hadoop
Hadoop
- uses mapreduce to process data
- does batch processing
Splunk
-generates graphs, reports, alerts, dashboards and visualizations
Apache Beam
-Data processing framework.
-It is execution platform,data and languge agnostic.
-Write code in beam and run on any data processing engine ex spark, map reduce, google cloud dataflow
-uses pipelines to read data(Pcollection) ,transform(Ptransform) and output data
- Can add sdk as a dependency in pom and use the libraries to process data
- has functions like trigger ,window
Apache Flume
Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for example) in a distributed fashion via it's strong coupling with the Hadoop cluster.
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.
- Flume provides the feature of contextual routing.
- The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
apache mahout
analyze large sets of data effectively and in quick time. Uses mathematical models for
- Recommendation
- Classification
- Clustering
Algorithms
User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.
Item-Based Colabrative filtering
Spectral clustering -derived from graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them
Random forest - uses decision trees to classify ramdom subsets to make predictions.
Matrix factorization for recommender systems
Scala
- statically typed and dynamically infered language
ELK
- Logstash: Collect logs and events data. It even parses and transforms data. Logstash is the data collection pipeline tool. It collects data inputs and feeds into the Elasticsearch
- ElasticSearch: The transformed data from Logstash is Store, Search, and indexed.
- Has REST API web-interface with JSON output
- Full-Text Search
- Near Real Time (NRT) search
- Sharded, replicated searchable, JSON document store
- Schema-free, REST & JSON based distributed document store
- Kibana: Kibana uses Elasticsearch DB to Explore, Visualize, and Search logs.
fault tolerant, scalable, messaging system
project example : receive playtech events and send them in kafka topic. Different consumers can process these messages. ensures events are enot missed and processing is scaled.
Bet placement services send huge amount of requests during match. These requests are pushed to kafka and our processors pracess these aginst the fixtures to result specific journeys. Like calling payment systems.
- Log compaction - removes duplicate events from logs. there is key compaction which can be used if order is to be preserved. - Kafka Topic - messages belonging to one category
- topic can have many partitions to which producers can publish data. data goes in partitions based on partitioning key if specified else round robin
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Hello-Kafka
- Kafka Cluster -has more than 1 brokers managed by zookeeper.
-only one broker acts as Controller
- Kafka Broker - contains topic partitions.
-receives msgs from producer and stores them in topic partition using offset
- Kafka Zookeeper - manages brokers in the cluster
- Kafka Producer
Define kafka template giving the properties file that contain kafka configuration.
private boolean sendMessageTOKafka(Object messageDto) throws RecoverableException {
byte[] bytes = jacksonObjectMapper.writeValueAsString(message).getBytes("UTF-8");
Object key = messageDto.getEventID(messageDto);
kafkaTemplate.send(topic, key, bytes).get();
return true;
}
- Kafka consumer
KafkaConsumer<String,String> consumer = new KafkaConsumer<String,String>(properties);
// subscribe to topics
consumer.subscribe(topics);
// The consumer reads data from Kafka through the polling method.
while(true)
{
ConsumerRecords<String, byte[]> records = kafkaConsumer.poll(pollTimeout);
for(Record record: records)
{
byte[] payload = record.value();
Object convertedMessage =unmarshall(payload)
//use msg from here also can use record.offset(), record.topic()
myMessageProceessor.handleMessage(convertedMessage);
}
- Kafka Consumer Group
The consumer join consumer group using group id and consumers in a group divides the topic partitions among themselves and each partition is only consumed by a single consumer from the group.
- Kafka streams (API)
Read records from one kafka topic and process it and write to other topic
//Read topics to consume and produce
String topic = configReader.getKStreamTopic();
String producerTopic = configReader.getKafkaTopic();
//Define serialization and deserialization types
final Serde stringSerde = Serdes.String();
final Serde longSerde = Serdes.Long();
// get stream data and apply functions
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> inputStreamData = builder.stream(stringSerde, stringSerde, producerTopic);
KStream<String, Long> processedStream = inputStreamData.mapValues(record -> record.length() )
- Kafka Connect (API)
Define connector class(like file stream connecter) and give source and sink properties
properties file
efi.kafka.replication.factor=3
efi.kafka.bootstrap-servers=at1p1xdkfk101.dbz.unix:9093,at1p1xdkfk102.dbz.unix:9093
efi.kafka.user=spspinhr01
efi.kafka.password=UzU
# kafka producer properties
efi.kafka.producer.retries=0
efi.kafka.producer.batch.size=16384
efi.kafka.producer.max.block.ms=3000
efi.kafka.producer.linger.ms=1
efi.kafka.producer.request.timeout.ms=5000
efi.kafka.producer.buffer.memory=33554432
efi.kafka.producer.acks=0
efi.kafka.producer.max.request.size=1048576
efi.kafka.producer.compression.type=none
efi.kafka.producer.max.in.flight.requests.per.connection=5
efi.kafka.producer.connections.max.idle.ms=540000
efi.kafka.producer.receive.buffer.bytes=32768
efi.kafka.producer.send.buffer.bytes=131072
efi.kafka.producer.metadata.max.age.ms=300000
efi.kafka.producer.reconnect.backoff.ms=50
efi.kafka.producer.retry.backoff.ms=100
# kafka consumer properties
efi.kafka.consumer.enable.auto.commit=false
efi.kafka.consumer.auto.commit.interval.ms=5000
efi.kafka.consumer.auto.offset.reset=latest
efi.kafka.consumer.reconnect.backoff.ms=5000
efi.kafka.consumer.retry.backoff.ms=5000
efi.kafka.consumer.max.poll.records=500
efi.kafka.consumer.max.poll.interval.ms=300000
efi.kafka.consumer.session.timeout.ms=300000
efi.kafka.consumer.heartbeat.interval.ms=3000
efi.kafka.consumer.partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor
efi.kafka.consumer.fetch.min.bytes=1
efi.kafka.consumer.fetch.max.bytes=52428800
efi.kafka.consumer.fetch.max.wait.ms=500
efi.kafka.consumer.max.partition.fetch.bytes=1048576
efi.kafka.consumer.connections.max.idle.ms=540000
efi.kafka.consumer.check.crcs=true
efi.kafka.consumer.request.timeout.ms=305000
efi.kafka.consumer.receive.buffer.bytes=65536
efi.kafka.consumer.send.buffer.bytes=131072
efi.kafka.consumer.metadata.max.age.ms=300000
efi.kafka.reconnect.period.ms=5000
efi.kafka.pollTimeoutMs=10000
Cassandra
[edit]Distributed high performing scalable database. Cassandra nodes are in ring based topology with different strategies of replication like simple strategy (rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy)
Keyspace - its like a schema/container which can contain multiple tables.
CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}
column family -
primary key -
partitioning key-
clustering key -
compression- compression algorithm class
LZ4Compressor (Cassandra 1.2.2 and later), SnappyCompressor, or DeflateCompressor. 25-30 % size decrease
compaction - It is like a IO operation that happens in the background . TO cleanup data n store data more efficiently. Compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable. SizeTieredCompactionStrategy (STCS)
: The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk. TimeWindowCompactionStrategy (TWCS)
This strategy is an alternative for time series data. TWCS compacts SSTables using a series of time windows. While with a time window, TWCS compacts all SSTables flushed from memory into larger SSTables using STCS.
gossip protocol - node communicate with each other to pass/share data
project - Defined persistance modules like fixture store, provider store.
- used spring data with cassandra
- AbstractCassandraConfiguration class is extended and provided with the properties like ${cassandra.datastax.hosts}${cassandra.cluster.loadbalancing.localDc}
- Get CQL session from sessionfactrybean
- extend spring data crud repository
- @Table("fixture_sport_mapping_v2")
public class FixtureIDAndSportCodeMappingV2 {
@PrimaryKeyColumn(name = "provider_code", type = PARTITIONED)
private String providerCode;
KUBERNETES
[edit]- Orchestrator for container deployment (other ex docker swarm but lacs autoscaling features)
https://kubernetes.io/docs/concepts/services-networking/service/#type-clusterip
-Architecture contains control panel which has kube api server(front end of contol panel), controller manager(for noticing and responding if node goes down), etcd, cloud controller manager(external cloud provider connections) and scheduler( to schedule pods)
Nodes contain kublet(provides environment and runs pods), kube proxy(manages network layer) and pods
- POD can contain many containers.
-zuul and eureka not needed as there is out of the box support for service discovery and gateway is ingress.
- Desired state managment.
- Uses cluster API services to achive deesired state using config in yaml.
- deployment yaml contains container images, replica's config, CPU usage config
- kublet processes run on pods to coordinate with cluster api services
- Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the control plane and contains the services necessary to run Pods
- Can use commandline interface kubectl or can use Kubernetes dashboard to manage deployments and pods.
---> kubectl create -f deployment-account.yaml
kubernetes.yml or deployment-account.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: account-service
labels:
run: account-service
spec:
replicas: 1
template:
metadata:
labels:
run: account-service
spec:
containers:
- name: account-service
image: piomin/account-service
ports:
- containerPort: 2222
protocol: TCP
- name: mongo
image: library/mongo
ports:
- containerPort: 27017
protocol: TCP
- Service
kind: Service
apiVersion: v1
metadata:
name: account-service
spec:
selector:
run: account-service
ports:
- name: port1
protocol: TCP
port: 2222
targetPort: 2222
- name: port2
protocol: TCP
port: 27017
targetPort: 27017
type: NodePort
-Ingress -Ingress may provide load balancing, SSL termination and name-based virtual hosting. Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.
ingress.yml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: gateway-ingress
spec:
backend:
serviceName: default-http-backend
servicePort: 80
rules:
- host: micro.all
http:
paths:
- path: /account
backend:
serviceName: account-service
servicePort: 2222
- path: /customer
backend:
serviceName: customer-service
servicePort: 3333
DOCKER
[edit]To create image - mvn clean install dockerfile:build
Create dockerfile in project root
FROM openjdk:8-jdk-alpine
VOLUME /tmp
ADD target/hello-docker-0.0.1-SNAPSHOT.jar hello-docker-app.jar
ENV JAVA_OPTS=""
ENTRYPOINT [ "sh", "-c", "java $JAVA_OPTS -Djava.security.egd=file:/dev/./urandom -jar /
Deploy docker container
docker run -p 8080:9080 -t hello-howtodoinjava/hello-docker --name hello-docker-image
LOGGING
[edit]Log4j - LoggerFactory.getlogger(classname) . Can define log severity , appenders, rolling strategy , location etc.
Log4j2- faster and better
SLF4j - as an absrtaction. Can be used with log4j. Helps change underlyig implmentation without changing the code.
logstash - Define logstash properties in logback.xml
Define encoder(LoggingEventCompositeJsonEncoder) and pattern.
logstash agent can be a docker image that runs on the host to send data from log folder to elastic serch. config can be done in logstash.yml or env variables.
agent injects the corelation id/ trace id.
JMS/RabbitMQ
[edit]JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients. There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ.
JMS supports two models: one to one and publish/subscriber.
JMS is specific for java users only, but RabbitMQ supports many technologies.
RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang.
RabbitMQ supports the AMQP model which has 4 models :
direct(producer sends to exchange and exchange sends the msg to queue where routing key matches the binding key)
fanout(producer sends msg to exchange and exchange send to all queues),
topic(partial match of keys)
headers(uses msg header instead of routing key)
Default is when routing key matches queue name
Heap Dump Analysis
Use Jmap tool that is shipped with jdk to extract dump. jmap -dump:format=b,file=<file_name> <pid>
Eclipse Memory Analyzer to analyze
can analyse objects created, memory used, threads in different states etc
Security(JWT,SAML,OAUTH)
[edit]JWT - Json web token. generated by server after authentication username and password. Then encrypts userinfo in JWT token n sends to client. User info is already present in JWT when it is recieved so server doesnot have to save the session info. It just has to validate the token using its secret key. If we decode the token we can find header that contains encryption algo, payload that contains userinfo and expiry timestamp and finally the signature that is validated against the secret key.
SSO -SAML(Security assersion Markup language) 3 entities -> User, Service provider, identity provider(OpenId Connect/ OKTA/KeyClock)
Both authentication and authorization
saml xml contains - issuer, acs url, auth req id , timestamp
Step 1: User tries to access private resources from SP.
Step 2: SP generates SAML Request.
Step 3: After generating SAML Request SP redirects the user to IdP.
Step 4: IdP ask the user to authenticate with login details.
Step 5: IdP validates the user and generates SAML Response that contains the SAML Assertion required for SP.
Step 6: The IdP redirects the user to SP’s Assertion Consumer Service (ACS).
Step 7: ACS validates the user and allows the user to access the protected resource.
Step 8: Now users able to access resources from SP.
OAUTH/OAUTH2.0 - For Authorization
- The client application requests authorization by directing the resource owner to the authorization server.
- The authorization server authenticates the resource owner and informs the user about the client and the data requested by the client. Clients cannot access user credentials since authentication is performed by the authentication server.
- Once the user grants permission to access the protected data, the authorization server redirects the user to the client with the temporary authorization code.
- The client requests an access token in exchange for the authorization code.
- The authorization server authenticates the client, verifies the code, and will issue an access token to the client.
- Now the client can access protected resources by presenting the access token to the resource server.
- If the access token is valid, the resource server returns the requested resources to the client.
JAVA8
[edit]Class: A class is a blueprint or template for creating objects. It defines the properties (attributes) and behaviors (methods) that the objects of the class will have.
Object: An object is an instance of a class. It is a runtime entity with its own set of data members (attributes) and methods (functions). Encapsulation: Encapsulation is the bundling of data (attributes) and methods that operate on the data into a single unit, i.e., a class. It helps in hiding the internal details of an object and only exposing what is necessary.
Inheritance: Inheritance allows a class (subclass/derived class) to inherit the properties and behaviors of another class (superclass/base class). It promotes code reusability and establishes a relationship between classes. public class Animal { public void eat() { System.out.println("Animal is eating"); } } public class Dog extends Animal { public void bark() { System.out.println("Dog is barking"); } } // Usage
Dog myDog = new Dog()
; myDog.eat(); // Inherited method
myDog.bark(); // Method specific to Dog class
Polymorphism: Polymorphism allows objects of different classes to be treated as objects of a common superclass. It can be achieved through method overloading and method overriding. overloading- compiletime polymorphism overriding - runtime polymorphism // compiler doesnt know which method is called. achived using upcasting Covariant Return Type : The covariant return type specifies that the return type may vary in the same direction as the subclass.
- class A{
- A get(){return this;}
- }
- class B1 extends A{
- @Override
- B1 get(){return this;}
- void message(){System.out.println("welcome to covariant return type");}
super :
- super can be used to refer immediate parent class instance variable.
- super can be used to invoke immediate parent class method.
- super() can be used to invoke immediate parent class constructor.
Instance initializer block : is used to initialize the instance data member. It run each time when object of the class is created.
IS-A relationship : Inheritance
HAS-A relationship : Aggregation (User Has address)
Final : can be on method class and variable.
Final class cannot be extended.
final method cannot be overriden
final variables cannot be changed - can be initialized in constructor or staticblock
Upcasting :
- class A{}
- class B extends A{}
- A a=new B();//upcasting
static binding : type of the object is determined at compiled time(by the compiler)
dynamic binding : just like runtime polymorphism.
Abstract class : concrete or abstract methods
- An abstract class must be declared with an abstract keyword.
- It can have abstract and non-abstract methods.
- It cannot be instantiated.
- It can have constructors and static methods also.
- It can have final methods which will force the subclass not to change the body of the method.
- abstract class Bank{
- abstract int getRateOfInterest();
- }
- class SBI extends Bank{
- int getRateOfInterest(){return 7;}
- }
Interface : Since Java 8, we can have default and static methods in an interface.
Since Java 9, we can have private methods in an interface.
Multiple inheritace can be achieved using interfaces.
Exception handling :
Throwable
Exception Error
checked,unchecked
System Design
[edit]URL Shortner.
[edit]Questions : size of url(depends on DAU), read write ratio(100:1), How many years to store
Shortened URL can be a combination of numbers (0-9) and characters (a-z, AZ).
Capacity estimation -> writes - 10 million/day = 10x10`6/10^5 -> 100 writes/sec
reads = 100*100 = 10k reads/sec
storage = 1 million per day x 100 years x 365 days x datasize= 400x10`9 x datasize
datasize = short url + long url +createdDate (100 bytes=10`2). hence 40 TB
in UTF-8 charset -> char= 1 byte, date =3 bytes, integer=4 bytes
API -> POST api/v1/generate
• request parameter: {longUrl: longURLString} • return shortURL with http ok 200
GET api/v1/shortUrl
• Return longURL for HTTP redirection(301 permanant)
DB Design ->. url table -> id, shorturl,longurl,createdate
Shortening Algo -> base 64 encoding (64`7 or 64'6 based on datasize=40TB)
sha1 or MD5 hash ( then take first 7 char) collisions- take next 7
can use bloom filter too for collision detection
other way is use a uniqe id generator like snowflake. can also pregenerate
Rate Limiter
[edit]Questions : is rate limiting based on userId or ip etc
token bucket(redis), leaky bucket(using Queue), sliding window
replenish rate and burst rate
can have expiery in redis
can use lua script that makes atomic transactions for race condition. `can use optimistic locking using setnx
give 429 response (too many requests)
can also have queues to make extra req wait
Consistent hashing
[edit]Only limited amounts of keys are remapped
there could be uneven distribution when a node goes down. hence virtual nodes are used.
in virtual nodes for each node we have few replica nodes. Hence standard deviation decreases
class ConsistentHashing {
private final TreeMap<Long, String> circle = new TreeMap<>();
private final int numberOfReplicas;
public ConsistentHashing(int numberOfReplicas, List<String> nodes)
{ this.numberOfReplicas = numberOfReplicas;
for (String node : nodes) {
addNode(node); } }
public void addNode(String node)
{
for (int i = 0; i < numberOfReplicas; i++)
{ String virtualNode = node + "-" + i;
circle.put(hash(virtualNode), virtualNode);
} }
public void removeNode(String node) {
for (int i = 0; i < numberOfReplicas; i++)
{
String virtualNode = node + "-" + i;
circle.remove(hash(virtualNode));
} }
public String getNode(String key) {
if (circle.isEmpty()) { return null; }
long hashKey = hash(key);
Map.Entry<Long, String> entry = circle.ceilingEntry(hashKey);
if (entry == null) {
// Wrap around if the key is greater than the largest hash
entry = circle.firstEntry();
} return entry.getValue();
}
Distributed KeyValue Store
[edit]put(key,value)
get(key)
CAP theorem
use a ring of nodes. consistent hashing,
every node has. data replicated,
UniqueId generator
[edit]UUID-> 32 char- 128 bit
might have duplicates
sorting is not possible
collisions can occur
TimeStamp -> 41 bits
multiple servers can have same timestamp
Snowflake approach -> timestamp+Machineid+sequence number
WebCrawler
[edit]Question : purpose ? search engine indexing ?
should store html content only or images, documents too ?
what is the refresh rate ?
For 1 billion pages per month
size of each website -100kb
1billion x 100kb = 10x10`12= 10 TB/month
QPS: 10`9 / 30 days *10`5= 3.3 x10`2= 330 qps
Can have multiple services like URL processor service, URL Downloader Service, Parsing service,
Seed Urls -> collect websites for different categories like news, ecommerce etc n put it in seedurl table in db
Url Processor Service -> checks if a url is already processed(as multiple websites can have same links) if not forwards req to queue for downloader service
Can have a prioritizer implemented that will prioritze websites based on ranking/category while putting to queue
Downloader service -> downloads contents and puts it in some file storage. can use Robots.txt that websites have to crawl only allowed pages.
Parser Service -> will parse the contents of html page
DNS cache -> implement a distributed cache to overcome dns lookup bottleneck
Notification System
[edit]Question -> which type of notificaations (sms,email, mobile push, ivr etc)
Apple push notification System
Android - Firebase
Sms - diff providers based on the region/country
Email - can have your own mail server or third party integration
Take the request and put it in a rabbitmq queue from any services
Rabbit mq supports priority so imp notifications can be handled first
Ratelimiting can be done using redis
News Feed - Twitter
[edit]Question -> DAU (100Million)
What are most imp features to build. (user can publish post, friends can see post , users can follow other users etc)
- follow others
- post tweet
- view feed
POST /v1/feed Params: • content: content is the text of the post.
GET /v1/me/feed Params: • auth_token:
DB. can be relational with sharding or graphDB
Feedservice -> queue / kafka -> feed cache(entry for every userid)
for ppl with millions of folllowers cant update for allusers in cache. so fetch dynamically and add to feed.
Images / videos from CDN
-- Users table
CREATE TABLE users (
user_id INT PRIMARY KEY,
username VARCHAR(255) NOT NULL,
);
-- Feeds table
CREATE TABLE feeds (
feed_id INT PRIMARY KEY,
user_id INT,
content TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
-- Followers table
CREATE TABLE followers (
follower_id INT,
following_id INT,
PRIMARY KEY (follower_id, following_id),
FOREIGN KEY (follower_id) REFERENCES users(user_id),
FOREIGN KEY (following_id) REFERENCES users(user_id)
);
Chat System
[edit]Questions -> Do we need to store data in backend
Is group chat allowed
How many msgs/day
- cant use stateless so no REST (as everytime new connection)
- cant use polling
- use websockets
- use multiple chat servers
-use key value store like redis to msgs
-after authentication client connects to one of the chat servers. charserver info is sent n client connects to chatserver
-zookeper to keep track of chatservers (using service discovery)
-> use snowflake for msgid (for ordering)
-> DB design (msgid,fromid,toid,content,timestamp)
(groupid,msgid,userid,content,timestamp)
-> User A miight connect to server A and B might connect to other server
-> use msg queue. As soon as a msg is recieved in a server put it to the queue and other server will recieve it
-> to know online/offline can use redis
-> if user is not online notification service can send notification
Auto Complete- Google Search
[edit]Questions -> Suggesions based on frequency
is ranking required when giving suggesions
how many suggesions per prefix (10)
API -> GET /api/v1/suggesions.? prefix="abc"
- NATIVE SOLUTION : Can use rdbms with querystring, frequency as coulms and update freq on every count
use like operator (not good design)
- use tri data structure ( to generate can log all http req and save in hadoop/kafka and use spark)
- can have replicas or shard based on starting letter
class TrieNode {
private final TrieNode[] children;
private boolean isEndOfWord;
public TrieNode() {
this.children = new TrieNode[26]; // Assuming lowercase English letters
this.isEndOfWord = false;
}
public TrieNode getChild(char ch) {
return children[ch - 'a'];
}
public TrieNode createChild(char ch) {
TrieNode node = new TrieNode();
children[ch - 'a'] = node;
return node;
}
public boolean isEndOfWord() {
return isEndOfWord;
}
public void setEndOfWord(boolean endOfWord) {
isEndOfWord = endOfWord;
}
}
public class Trie {
private final TrieNode root;
public Trie() {
this.root = new TrieNode();
}
public void insert(String word) {
TrieNode current = root;
for (char ch : word.toCharArray()) {
if (current.getChild(ch) == null) {
current = current.createChild(ch);
} else {
current = current.getChild(ch);
}
}
current.setEndOfWord(true);
}
public boolean search(String word) {
TrieNode node = searchNode(word);
return node != null && node.isEndOfWord();
}
public boolean startsWith(String prefix) {
return searchNode(prefix) != null;
}
private TrieNode searchNode(String word) {
TrieNode current = root;
for (char ch : word.toCharArray()) {
if (current.getChild(ch) == null) {
return null;
} else {
current = current.getChild(ch);
}
}
return current;
}
}
YouTube/Netflix
[edit]Assume the product has 5 million daily active users (DAU).
• Users watch 5 videos per day. • 10% of users upload 1 video per day.
• Assume the average video size is 300 MB.
• Total daily storage space needed: 5 million * 10% * 300 MB = 150TB
-use CDN
-Use blob storage for raw files
-save metadata to DB
- save videos in s3 by giving presigned url
- Once data is in use transcoder to convert data to different bitrates and resolutions and formats
- can use aws lambda for this
Proximity Service
[edit]Can use quad tree - each node has 0 to 4 children
-divide 2d into 4 quadrants
- in quadtree leaf nodes can be decided based on no of locations
Other approch - divide world into 00,01,11,10
- each quadrant can again be divided into 0000,0010,...
ce- can put it in db abd do sql like query
DB - locationID
name
lat
long
description
Can use s2 library too from google(latlong to cell id. Can have range queries too)
Chatgpt
[edit]Functional Requirements -> Create conversation by sending a prompt
View Conversation
update conversation
Delete conversation
Feedback( thumbsup, thumbsdown)
Non Functional -> Latency
Security
Scalabliity
Conversation service will recieve the msg-> Profanity Service(ML model)
ChatgptService -> calls a (ML model)-> save request and response in DB
Use feedback like thumbsup to send to ML model to train the model
Have risk model to ensure no curruption
Talk of API details for crud
generative pretrained transformer. takes data from books, web crawling etc,
Ex : Earth is ....a planet, a place where humans live, part of solarsystem. probability based scoring, greedy . temperature , topk etc
then fine tune data, then reward model based on emotions parameter, reinforced learning using rewards
Distributed msg queue
[edit]Functional Req -> publish and consume from queue
Non Func -> Is it topic based or fanout based or direct message
Scalability -> 10k topics x 10 million msg/day = 100gbmsgs/day
`latency-> time to deliver and consume
producer pushes to the queue(can use batching)
consumer pulls from the queue(can pull in batch)
Message -> Key, Value
Write ahead log(append only log)- use segmentation to avoid large file size
partition using consistent hashing and write to different nodes
Have Metadata storage that would have offset, followers for replicas, topic details, retention policy
Zookeper to coordinate(out of the box solution for metadata storage, state storage and service for heartbeat)
Configuration of acknowledgement
Digital Wallet
[edit]-use rdbms(using sharding)
-Deduct can be on one shard and credit can be on other
-Substract first and then add
-2 phase commit protocol -> prepare(lock), commit
coordinator service can be single point of faluire
locking not great option
-SAGA -> small local transactions
Compensating transactions
Saga execution coordinator
Can use event sourcing for reproduction
Google Docs
[edit]Should be able to create docs
Should be able to see other user editing
-> websockets for realtime changes
-> users will have local version
-> positional indexing. Doc will have positions and that data is sent over websocket
-> instead of 1,2,3 can use .1,.2,.3 generated in runtime
S3 Object Storage
[edit]Microservices
[edit]- Api Gateway (ZUUL) - Dynamic Routing Monitoring, Security
[edit]Component | Example | Features | Steps | Link |
---|---|---|---|---|
API Gateway | Zuul (bundled with Spring Cloud)
@EnableZuulProxy |
Dynamic Routing | @SpringBootApplication
@EnableZuulProxy spring.application.name = api-gateway # routing for service 1 zuul.routes.service_1.path = /api/service_1/** zuul.routes.service_1.url = http://localhost:8081/ # routing for service 2 zuul.routes.service_2.path = /api/service_2/** zuul.routes.service_2.url = http://localhost:8082/ |
https://rb.gy/9btjty |
Security :
Zuul with OAuth |
server:
port: 8080
zuul:
sensitiveHeaders: Cookie,Set-Cookie
routes:
spring-security-oauth-resource:
path: /spring-security-oauth-resource/**
url: http://localhost:8082/spring-security-oauth-resource
oauth:
path: /oauth/**
url: http://localhost:8081/spring-security-oauth-server/oauth
security:
oauth2:
resource:
jwt:
key-value: 123
@Configuration
@Configuration
@EnableResourceServer
public class GatewayConfiguration extends ResourceServerConfigurerAdapter {
@Override
public void configure(final HttpSecurity http) throws Exception {
http.authorizeRequests()
.antMatchers("/oauth/**")
.permitAll()
.antMatchers("/**")
.authenticated();
}
}
|
https://rb.gy/0of2px | ||
Service Discovery | ClientSide SD | @EnableEurekaClient annotation enables the Eureka client. The @LoadBalanced annotation configures the RestTemplate to use Ribbon, which has been configured to use the Eureka client to do service discovery
|
||
Serverside SD | An AWS Elastic Load Balancer (ELB) is an example of a server-side discovery router. A client makes HTTP(s) requests (or opens TCP connections) to the ELB, which load balances the traffic amongst a set of EC2 instances. An ELB can load balance either external traffic from the Internet or, when deployed in a VPC, load balance internal traffic. An ELB also functions as a Service Registry. EC2 instances are registered with the ELB either explicitly via an API call or automatically as part of an auto-scaling group. |
CODING
reverse a sentence -> use "\\s" to split an then
Is string palindrome -> even charecters and atmost one odd charecter
BFS of a tree -> use queue and add root first then poll and add children
DFS of a tree -> this is pre order traversel using recursion