Data Architectures
Data Architectures
o Architecture diagram :
https://www.uber.com/en-GB/blog/ubers-lakehouse-
architecture/
o Defterde architecture var
o Gpt:
1. Data Sources:
They use Apache Spark and Spark SQL for batch processing,
data transformation, and advanced analytics.
The insights derived from this data processing and analysis are
then fed back into their applications to enhance the overall
customer experience.
1. Data Collection:
2. Data Ingestion:
3. Data Storage:
4. Data Processing:
Introduction:
Netflix is a leading global entertainment company that provides streaming
services to millions of customers worldwide. The company has built a
robust data architecture and AI ecosystem to power its personalized
recommendations, content optimization, and operational efficiency. This
report delves into the cloud services and open-source software utilized by
Netflix, as well as an in-depth analysis of their data architecture.
Amazon S3: The company's data lake, where raw and processed
data is stored in a scalable and cost-effective manner.
Data Services and Consumption:
Netflix has developed several data services and tools to expose and
consume the data:
Centralized data lake: The Amazon S3-based data lake serves as the
primary repository for raw and processed data, enabling flexible and
scalable data storage and processing.
Spotify :
o Architecture diagram : detaylı :
https://engineering.atspotify.com/2016/02/spotifys-event-
delivery-the-road-to-the-cloud-part-i/
o Daha simple architecture :
https://engineering.atspotify.com/2024/05/data-platform-
explained-part-ii/
o Defterde architecture var
Intro:
Spotify is a leading music streaming service with over 456 million active users as of 2023.
To power its personalized recommendations, real-time analytics, and data-driven features,
Spotify has built a robust and scalable data platform leveraging a variety of cloud
services and open-source software.
Data Collection
For collecting user interaction data, such as music playback events, Spotify utilizes the
following technologies:
Apache Kafka: Spotify uses Kafka as the backbone of its event delivery system. They
employ Kafka Syslog Producer to ingest event data from their services, Kafka Brokers to
store and manage the incoming data, and Kafka Groupers to compress and batch the
events.
Data Storage
Spotify stores the raw event data and other structured data using the following
technologies:
Hadoop Distributed File System (HDFS): HDFS is used as the primary storage solution for
Spotify's large-scale data sets, providing scalable and fault-tolerant storage.
Cloud Object Storage: Spotify utilize cloud-based object storage services, Google Cloud
Storage, HDFS clusters.
Data Processing
Spotify processes the ingested data using a combination of batch and stream processing
frameworks:
Apache Flink: for batch and stream processing of event data, enabling both low and high
latency data processing and analysis.
Spotify leverages machine learning and AI techniques to power various aspects of its
platform, including:
TensorFlow or PyTorch: Spotify may use these popular open-source machine learning
frameworks to develop and deploy their personalization, recommendation, and A/B
testing models.
Automated Machine Learning (AutoML): Spotify may utilize AutoML services, such as
Google Cloud AutoML or Amazon SageMaker, to accelerate the development of AI models
and reduce the burden on data science teams.
Conclusion
Spotify has built a comprehensive data platform that leverages a wide range of cloud
services and open-source technologies to power its data-driven features and insights. By
combining tools like Apache Kafka, Hadoop, Spark, Flink, TensorFlow, and various cloud
services, Spotify has created a scalable and flexible data infrastructure to keep pace with
the growing demands of its business and user base.
TikTok :
o Architecture diagram : https://www.lavivienpost.com/how-
tiktok-works-architecture-illustrated/#5
o Defterde architecture var
o Gpt:
Introduction
TikTok, the wildly popular short-form video sharing app, has built an extremely
sophisticated and scalable technology stack to power its platform. By leveraging a
combination of cloud services and open-source software, TikTok has created a robust and
efficient data architecture to support its data-intensive, AI-driven recommendation
system.
User Activity Data: TikTok collects a wide range of user activity data, including
watch time, swipes, likes, shares, and comments. This data is collected from user
devices via the TikTok mobile app.
Data Ingestion: TikTok uses open-source tools like Apache Flume and Scribe to
collect and aggregate the user activity data, which is then ingested into a Kafka
message queue.
Real-Time Processing: For real-time data processing, TikTok utilizes Apache Storm
and Apache Flink, which are both distributed stream processing frameworks.
These tools enable TikTok to process the incoming data streams in near real-time.
Batch Processing: TikTok also leverages the Apache Hadoop ecosystem for batch
data processing. This includes tools like MapReduce, YARN, and HDFS for
distributed data processing and storage.
In-House Tools: TikTok has also developed several in-house tools to optimize its
cloud-native infrastructure, such as ByteMesh for service mesh, KiteX for API
management, and Sonic for real-time data processing.
Model Deployment: TikTok uses TensorFlow Lite to deploy its AI models on the
client-side (mobile devices) for real-time inference and personalization.
1. Data Collection: User activity data is collected from the TikTok mobile app and
ingested into a Kafka message queue.
2. Real-Time Processing: Apache Storm and Apache Flink are used for real-time
processing and analysis of the incoming data streams.
3. Batch Processing: The Apache Hadoop ecosystem, including tools like MapReduce,
YARN, and HDFS, is used for batch processing and storage of the data.
5. AI and Machine Learning: TensorFlow and PyTorch are used to build and deploy
advanced AI models for personalization and recommendation.
This sophisticated and scalable data architecture allows TikTok to process massive
amounts of user data, develop highly personalized AI models, and deliver a seamless and
addictive user experience at a global scale.
X:
o Architecture diagram :
https://blog.x.com/engineering/en_us/topics/infrastructure/202
1/processing-billions-of-events-in-real-time-at-twitter-
o Defterde architecture var
o Gpt:
Executive Summary
Twitter processes approximately 400 billion events in real-time and generates petabyte-
scale data every day. To handle this massive scale of data, Twitter has built a robust data
infrastructure and architecture leveraging both on-premise and cloud-based services and
open-source software.
Hadoop
Vertica
Kafka
Twitter Eventbus
BigQuery
PubSub
These data sources feed into Twitter's data processing pipelines, which utilize both batch
and real-time processing capabilities.
Batch processing using Scalding pipelines to ingest data from Hadoop logs into
Manhattan distributed storage systems.
Real-time processing using Heron topology to ingest data from Kafka topics and
store it in the Twitter Nighthawk distributed cache.
On-premise preprocessing and relay event processing pipelines that convert Kafka
topic events to PubSub topic events with at-least-once semantics.
Dataflow jobs for deduping, real-time aggregation, and sinking data into
BigTable.
Cloud Services
Twitter's data infrastructure leverages both on-premise services and Google Cloud
Platform services, including:
Open-Source Software
The blog post does not provide specific details on how Twitter uses AI and machine
learning in their data infrastructure. However, it is likely that the aggregated interaction
and engagement data is used to power various AI-powered features and services within
Twitter's platform.
Conclusion
Twitter's data infrastructure is designed to handle the massive scale of real-time events
and data they process daily. By leveraging a combination of on-premise and cloud-based
services, as well as open-source software, Twitter has built a robust and scalable data
architecture to support their business needs.
Meta:
https://engineering.fb.com/2023/01/31/production-engineering/meta-
asynchronous-computing/
https://engineering.fb.com/2023/03/09/open-source/velox-open-source-
execution-engine/
https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-
arrow-15-composable-data-management/
https://engineering.fb.com/2024/05/22/data-infrastructure/composable-
data-management-at-meta/
https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-
arrow-15-composable-data-management/
Executive Summary
Data Collection
Data Ingestion
The collected data sources feed into Meta's data processing pipelines,
which utilize both batch and real-time processing capabilities.
Open-Source Software
Cloud Services
Amazon Web Services (AWS) for hosting the majority of Meta's data
infrastructure:
Meta heavily utilizes AI and machine learning across their various products
and services, including:
Conclusion