Apache Pinot

Pinot
Original author(s)	Kishore Gopalakrishna;
Developer(s)	Apache Pinot
Stable release	0.7.1 / 18 March 2021; 4 years ago
Repository	Pinot repository
Written in	Java
Operating system	Cross-platform
Type	distributed; real-time; column-oriented data store;
License	Apache License 2.0
Website	pinot.apache.org

Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion.^[1] The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.^[2]

Pinot was first created at LinkedIn after the engineering staff determined that there were no off the shelf solutions that met the social networking site's requirements like predictable low latency, data freshness in seconds, fault tolerance and scalability.^[2] Pinot is used in production by technology companies such as Uber,^[3] Microsoft,^[1] and Factual^[4].

History

Pinot was started as an internal project at LinkedIn in 2013 to power a variety of user-facing and business-facing products.^[5] The first analytics product at LinkedIn to use Pinot was a redesign of the social networking site's feature that allows members to see who has viewed their profile in real-time.^[5]

"Who’s Viewed Your Profile" (as the name suggests) is LinkedIn’s flagship analytics product for our members. It allows members to see who has viewed their profile in real-time. In early 2014, we launched a completely redesigned version of this product to give users more power. This product needed to run complex queries on large volumes of profile view data to identify interesting insights dynamically. Pinot is the infrastructure that started powering this new redesigned product.^[5]
— Praveen Neppalli Naga, Real-time Analytics at Massive Scale with Pinot, LinkedIn Engineering Blog

The project was open-sourced in June 2015 under an Apache 2.0 license and was donated to the Apache Software Foundation by LinkedIn in June 2019.^[2]^[1]

Architecture

Pinot uses Apache Helix for cluster management. Helix is embedded as an agent within the different components and uses Apache ZooKeeper for coordination and maintaining the overall cluster state and health. All Pinot servers and brokers are managed by Helix. Helix is a generic cluster management framework to manage partitions and replicas in a distributed system. It's helpful to think of Helix as an event-driven discovery service with push and pull notifications that drive the state of a cluster to an ideal configuration. A finite-state machine maintains a contract of stateful operations that drives the health of the cluster towards its optimal configuration. Query load is optimized as Helix updates routing configurations between nodes based on where data is stored in the cluster.

Query management

For every query, a cluster's broker performs the following:

Fetches the routes that are computed for a query based on the routing strategy defined in a table's configuration.
Computes the list of segments to query from on each server.
Scatter-Gather: sends the requests to each server and gathers the responses.
Merge: merges the query results returned from each server.
Sends the query result to the client.

Queries are received by brokers—which checks the request against the segment-to-server routing table—scattering the request between real-time and offline servers.

Cluster management

Pinot leverages Apache Helix for cluster management. Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system. Helix uses Zookeeper to store cluster state and metadata.

Features

A column-oriented database with various compression schemes such as Run Length and Fixed Bit Length
Pluggable indexing technologies - Sorted Index, Bitmap Index, Inverted Index, Star-Tree Index, Range Index
Ability to optimize query/execution plan based on query and segment metadata
Near real-time ingestion from streams such as Kafka, Kinesis and batch ingestion from sources such as Hadoop, S3, Azure, GCS
SQL-like language that supports selection, aggregation, filtering, group by, order by, distinct queries on data
Support for multi-valued fields
Horizontally scalable and fault-tolerant

References

^ ^a ^b ^c Pawar, Neha. "Pinot Joins Apache Incubator" Archived 2019-04-02 at the Wayback Machine, LinkedIn Engineering, 01 April 2019
^ ^a ^b ^c Gopalakrishna, Kishore. "Open Sourcing Pinot: Scaling the Wall of Real-Time Analytics". engineering.linkedin.com. LinkedIn. Archived from the original on 10 September 2015. Retrieved 3 September 2020.
^ Wang, Haibo (15 January 2020). "Engineering SQL Support on Apache Pinot at Uber". Uber Engineering Blog. Uber. Retrieved 3 September 2020.
^ Melz, Eric (17 May 2020). "Pinot @ Factual". Medium. Archived from the original on 3 September 2020. Retrieved 3 September 2020.
^ ^a ^b ^c Naga, Praveen. "Real-time Analytics at Massive Scale with Pinot". engineering.linkedin.com. LinkedIn.

External links

Official website

[pinot-joins-apache-foundation-1] Pawar, Neha. "Pinot Joins Apache Incubator" Archived 2019-04-02 at the Wayback Machine, LinkedIn Engineering, 01 April 2019

[open-sourcing-pinot-2] Gopalakrishna, Kishore. "Open Sourcing Pinot: Scaling the Wall of Real-Time Analytics". engineering.linkedin.com. LinkedIn. Archived from the original on 10 September 2015. Retrieved 3 September 2020.

[pinot-at-uber-3] Wang, Haibo (15 January 2020). "Engineering SQL Support on Apache Pinot at Uber". Uber Engineering Blog. Uber. Retrieved 3 September 2020.

[pinot-at-factual-4] Melz, Eric (17 May 2020). "Pinot @ Factual". Medium. Archived from the original on 3 September 2020. Retrieved 3 September 2020.

[linkedin-announces-pinot-5] Naga, Praveen. "Real-time Analytics at Massive Scale with Pinot". engineering.linkedin.com. LinkedIn.

[1]

[2]

[3]

[4]

[5]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category