0% found this document useful (0 votes)
120 views14 pages

Data Architectures

Uploaded by

reis cumhur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views14 pages

Data Architectures

Uploaded by

reis cumhur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

 Uber :

o Architecture diagram :
https://www.uber.com/en-GB/blog/ubers-lakehouse-
architecture/
o Defterde architecture var
o Gpt:
1. Data Sources:

 Uber collects massive amounts of real-time data from driver


and rider interactions, as well as data from IoT sensors in
vehicles.

 They also integrate data from external sources like weather,


traffic, and events.

2. Data Ingestion and Processing:

 Uber uses Apache Kafka for high-throughput, low-latency


ingestion of real-time data streams.

 They then leverage Apache Spark Streaming to process the


real-time data, performing tasks like trip optimization, demand
prediction, and fraud detection.

3. Data Storage and Analytics:

 Uber stores raw data in a data lake on Amazon S3.

 They use Apache Spark and Spark SQL for batch processing,
data transformation, and advanced analytics.

 Uber also utilizes Presto, an interactive SQL engine, to enable


ad-hoc queries on the data lake.

4. Machine Learning and Insights:

 Uber applies machine learning models built on Spark to


optimize their operations, such as improving ETAs, pricing, and
demand forecasting.

 The insights derived from this data processing and analysis are
then fed back into their applications to enhance the overall
customer experience.

Comparing this to the "New World Order" diagram, we can see


strong alignment in Uber's use of Kafka, Spark Streaming, Spark
SQL, and the overall data lake/warehouse architecture. This
provides confidence that the diagram accurately represents the
key components of Uber's data ecosystem, even if it doesn't
depict Uber's specific implementation details.

Uber's Data Architecture and Technologies [REPORT]

1. Data Collection:

 Uber utilizes various data sources to capture information about


its operations, including:

 Rider and driver mobile applications (event data)

 Sensors and IoT devices in vehicles (operations data)

 Third-party data sources (e.g., weather, traffic)

2. Data Ingestion:

 Uber leverages Apache Kafka, a distributed streaming


platform, to ingest and process real-time data from the various
sources.

 Kafka enables high-throughput, low-latency data ingestion and


processing.

3. Data Storage:

 Uber's primary data storage solution is Amazon S3 (Simple


Storage Service), a scalable and cost-effective object storage
service.

 S3 serves as the company's data lake, storing raw and


processed data.

 For fast, high-performance data storage and querying, Uber


utilizes Cassandra, a distributed NoSQL database.

4. Data Processing:

 Apache Spark is Uber's primary data processing engine, used


for both batch and real-time data processing.

 Spark's capabilities, including Spark Streaming and Spark SQL,


enable Uber to handle a variety of data processing workloads.

 Uber also employs Apache Hive, a data warehouse


infrastructure built on top of Hadoop, for large-scale data
analysis and reporting.
5. Machine Learning and AI:

 Uber develops and deploys various machine learning models


to power its product features and optimize operations.

 For model development and training, the company utilizes


open-source frameworks like TensorFlow and PyTorch.

 To serve and deploy these models in production, Uber has built


its own internal platform, called Michelangelo, which integrates
with the company's data infrastructure.

6. Data Visualization and Exploration:

 Uber leverages Tableau for advanced data visualization and


exploration, allowing users to interact with and gain insights
from the data.

 The company also uses open-source tools like Jupyter


Notebooks and Python's data analysis libraries (e.g., Pandas,
Matplotlib) for exploratory data analysis and prototyping.

7. Data Governance and Orchestration:

 Uber has implemented a comprehensive data governance


framework to ensure data quality, security, and compliance.

 The company uses tools like Airflow, an open-source workflow


management platform, to orchestrate and automate various
data processing and model training pipelines.

Overall, Uber's data architecture is designed to handle the scale


and complexity of its operations, leveraging a combination of
cloud services (e.g., Amazon S3, Cassandra) and open-source
technologies (e.g., Apache Kafka, Spark, TensorFlow) to collect,
process, and derive insights from its vast amounts of data. This
robust data infrastructure supports the company's ongoing efforts
to innovate and optimize its services through advanced analytics
and machine learning.
 Netflix :
o Architecture diagram : https://www.tableau.com/blog/tableau-
cloud-netflix-original-64442
o Defterde architecture var
o Gpt:

Report: Netflix's Data Architecture and AI Ecosystem

Introduction:
Netflix is a leading global entertainment company that provides streaming
services to millions of customers worldwide. The company has built a
robust data architecture and AI ecosystem to power its personalized
recommendations, content optimization, and operational efficiency. This
report delves into the cloud services and open-source software utilized by
Netflix, as well as an in-depth analysis of their data architecture.

Data Collection and Ingestion:


Netflix relies on a variety of data sources to power its services, including:

 Kafka: Used for ingesting real-time event data from user


interactions, device telemetry, and other sources.

 AGIS HUS: An internal operational data system that provides data on


content, subscriber activity, and other business-critical information.

Data Processing and Storage:


Netflix's data processing and storage infrastructure includes:

 Hadoop: Used for large-scale data processing and analytics, utilizing


technologies like HDFS, MapReduce, and Hive.

 Spark: Employed for real-time data processing, stream processing,


and advanced analytics, leveraging tools like Spark Streaming and
Spark SQL.

 Teradata: A data warehouse solution used for fast, high-performance


data storage and querying.

 Druid: A real-time, distributed data store used for power interactive


analytics and visualization.

 Amazon S3: The company's data lake, where raw and processed
data is stored in a scalable and cost-effective manner.
Data Services and Consumption:
Netflix has developed several data services and tools to expose and
consume the data:

 Kragle: A data catalog and discovery service that helps Netflix


employees find and access relevant data.

 Metacat: A metadata management service that provides a unified


view of data assets across the organization.

 Tableau: The primary data visualization and business intelligence


tool used by Netflix for exploring and analyzing data.

 R, Python, and Jupyter: Utilized by data scientists and analysts for


advanced data exploration, model development, and research.

Artificial Intelligence and Machine Learning:


Netflix leverages various AI and ML technologies to power its personalized
recommendations, content optimization, and operational efficiency:

 TensorFlow: An open-source machine learning framework used for


developing and deploying ML models.

 PyTorch: Another popular open-source ML framework used by


Netflix's AI/ML teams.

 Custom ML models: Netflix has developed various custom ML


models, including for content recommendations, fraud detection,
and operational optimization.

 Experimentation and A/B testing: The company extensively uses A/B


testing and experimentation to validate and optimize the
performance of its AI/ML models.

Data Architecture and Insights:


Netflix's data architecture is designed to support its data-driven decision-
making and business operations:

 Centralized data lake: The Amazon S3-based data lake serves as the
primary repository for raw and processed data, enabling flexible and
scalable data storage and processing.

 Polyglot persistence: The use of various data storage solutions


(Teradata, Druid) allows Netflix to optimally store and access data
based on the specific requirements of different use cases.

 Metadata management: Tools like Kragle and Metacat provide a


centralized view of data assets, ensuring effective data discovery
and governance.
 Data-driven insights: The combination of data processing, AI/ML, and
visualization tools allows Netflix to derive actionable insights that
drive business decisions, content optimization, and operational
efficiency.
Conclusion:
Netflix's data architecture and AI ecosystem demonstrate the company's commitment to
leveraging data and technology to deliver personalized and engaging experiences to its
customers. The strategic use of cloud services and open-source software, coupled with a
robust data management and AI/ML capabilities, has enabled Netflix to stay ahead of the
competition and continue its growth trajectory.

 Spotify :
o Architecture diagram : detaylı :
https://engineering.atspotify.com/2016/02/spotifys-event-
delivery-the-road-to-the-cloud-part-i/
o Daha simple architecture :
https://engineering.atspotify.com/2024/05/data-platform-
explained-part-ii/
o Defterde architecture var

Intro:
Spotify is a leading music streaming service with over 456 million active users as of 2023.
To power its personalized recommendations, real-time analytics, and data-driven features,
Spotify has built a robust and scalable data platform leveraging a variety of cloud
services and open-source software.

Data Collection

For collecting user interaction data, such as music playback events, Spotify utilizes the
following technologies:

Apache Kafka: Spotify uses Kafka as the backbone of its event delivery system. They
employ Kafka Syslog Producer to ingest event data from their services, Kafka Brokers to
store and manage the incoming data, and Kafka Groupers to compress and batch the
events.

Data Storage

Spotify stores the raw event data and other structured data using the following
technologies:

Hadoop Distributed File System (HDFS): HDFS is used as the primary storage solution for
Spotify's large-scale data sets, providing scalable and fault-tolerant storage.

Cloud Object Storage: Spotify utilize cloud-based object storage services, Google Cloud
Storage, HDFS clusters.
Data Processing

Spotify processes the ingested data using a combination of batch and stream processing
frameworks:

Apache Flink: for batch and stream processing of event data, enabling both low and high
latency data processing and analysis.

Machine Learning and Artificial Intelligence

Spotify leverages machine learning and AI techniques to power various aspects of its
platform, including:

TensorFlow or PyTorch: Spotify may use these popular open-source machine learning
frameworks to develop and deploy their personalization, recommendation, and A/B
testing models.

MLflow: Spotify could employ MLflow, an open-source ML model management platform, to


streamline the model development, deployment, and monitoring lifecycle.

Automated Machine Learning (AutoML): Spotify may utilize AutoML services, such as
Google Cloud AutoML or Amazon SageMaker, to accelerate the development of AI models
and reduce the burden on data science teams.

Conclusion

Spotify has built a comprehensive data platform that leverages a wide range of cloud
services and open-source technologies to power its data-driven features and insights. By
combining tools like Apache Kafka, Hadoop, Spark, Flink, TensorFlow, and various cloud
services, Spotify has created a scalable and flexible data infrastructure to keep pace with
the growing demands of its business and user base.
 TikTok :
o Architecture diagram : https://www.lavivienpost.com/how-
tiktok-works-architecture-illustrated/#5
o Defterde architecture var
o Gpt:

Introduction
TikTok, the wildly popular short-form video sharing app, has built an extremely
sophisticated and scalable technology stack to power its platform. By leveraging a
combination of cloud services and open-source software, TikTok has created a robust and
efficient data architecture to support its data-intensive, AI-driven recommendation
system.

Data Collection and Ingestion

 User Activity Data: TikTok collects a wide range of user activity data, including
watch time, swipes, likes, shares, and comments. This data is collected from user
devices via the TikTok mobile app.

 Data Ingestion: TikTok uses open-source tools like Apache Flume and Scribe to
collect and aggregate the user activity data, which is then ingested into a Kafka
message queue.

Data Processing and Storage

 Real-Time Processing: For real-time data processing, TikTok utilizes Apache Storm
and Apache Flink, which are both distributed stream processing frameworks.
These tools enable TikTok to process the incoming data streams in near real-time.

 Batch Processing: TikTok also leverages the Apache Hadoop ecosystem for batch
data processing. This includes tools like MapReduce, YARN, and HDFS for
distributed data processing and storage.

 Database Systems: TikTok uses a variety of database systems to store the


processed data, including MySQL for structured data and MongoDB for semi-
structured data.
Cloud Infrastructure

 Cloud Platforms: TikTok's infrastructure is built on major public cloud platforms,


primarily Amazon Web Services (AWS) and Microsoft Azure. These platforms
provide the scalable compute, storage, and networking resources needed to
handle TikTok's massive data volumes and traffic.

 Container Orchestration: TikTok uses Kubernetes, an open-source container


orchestration platform, to manage and scale its microservices-based architecture.

 In-House Tools: TikTok has also developed several in-house tools to optimize its
cloud-native infrastructure, such as ByteMesh for service mesh, KiteX for API
management, and Sonic for real-time data processing.

AI and Machine Learning

 Deep Learning Frameworks: TikTok leverages popular open-source deep learning


frameworks like TensorFlow and PyTorch to build its advanced AI models for
computer vision, natural language processing, and recommendation systems.

 Model Deployment: TikTok uses TensorFlow Lite to deploy its AI models on the
client-side (mobile devices) for real-time inference and personalization.

 Experimentation Platform: TikTok has developed an extensive experimentation


platform to test and iterate on its machine learning models, allowing the company
to continuously improve its recommendation algorithms.

Data Architecture Overview


TikTok's data architecture can be summarized as follows:

1. Data Collection: User activity data is collected from the TikTok mobile app and
ingested into a Kafka message queue.

2. Real-Time Processing: Apache Storm and Apache Flink are used for real-time
processing and analysis of the incoming data streams.

3. Batch Processing: The Apache Hadoop ecosystem, including tools like MapReduce,
YARN, and HDFS, is used for batch processing and storage of the data.

4. Database Storage: Structured data is stored in MySQL, while semi-structured data


is stored in MongoDB.

5. AI and Machine Learning: TensorFlow and PyTorch are used to build and deploy
advanced AI models for personalization and recommendation.

6. Cloud Infrastructure: The entire architecture is built on top of public cloud


platforms like AWS and Azure, with Kubernetes for container orchestration and in-
house tools for optimization.

This sophisticated and scalable data architecture allows TikTok to process massive
amounts of user data, develop highly personalized AI models, and deliver a seamless and
addictive user experience at a global scale.
 X:
o Architecture diagram :
https://blog.x.com/engineering/en_us/topics/infrastructure/202
1/processing-billions-of-events-in-real-time-at-twitter-
o Defterde architecture var
o Gpt:
Executive Summary

Twitter processes approximately 400 billion events in real-time and generates petabyte-
scale data every day. To handle this massive scale of data, Twitter has built a robust data
infrastructure and architecture leveraging both on-premise and cloud-based services and
open-source software.

Data Collection and Ingestion

Twitter collects data from a variety of sources, including:

 Hadoop

 Vertica

 Manhattan distributed databases

 Kafka

 Twitter Eventbus

 Google Cloud Storage (GCS)

 BigQuery

 PubSub

These data sources feed into Twitter's data processing pipelines, which utilize both batch
and real-time processing capabilities.

Data Processing and Storage

Twitter's data infrastructure has evolved from a lambda architecture to a kappa


architecture over time. The key components include:
Old (Lambda) Architecture

 Batch processing using Scalding pipelines to ingest data from Hadoop logs into
Manhattan distributed storage systems.

 Real-time processing using Heron topology to ingest data from Kafka topics and
store it in the Twitter Nighthawk distributed cache.

New (Kappa) Architecture

 On-premise preprocessing and relay event processing pipelines that convert Kafka
topic events to PubSub topic events with at-least-once semantics.

 Google Cloud Platform services, including:

 Dataflow jobs for deduping, real-time aggregation, and sinking data into
BigTable.

Cloud Services

Twitter's data infrastructure leverages both on-premise services and Google Cloud
Platform services, including:

 PubSub for real-time event processing

 Dataflow for data processing and aggregation

 BigTable for real-time data storage

Open-Source Software

Twitter utilizes several open-source software components in their data infrastructure,


including:

 Scalding for batch data processing

 Heron for real-time data processing

 TimeSeries AggregatoR (TSAR) for integrated batch and real-time processing

AI and Machine Learning

The blog post does not provide specific details on how Twitter uses AI and machine
learning in their data infrastructure. However, it is likely that the aggregated interaction
and engagement data is used to power various AI-powered features and services within
Twitter's platform.

Conclusion

Twitter's data infrastructure is designed to handle the massive scale of real-time events
and data they process daily. By leveraging a combination of on-premise and cloud-based
services, as well as open-source software, Twitter has built a robust and scalable data
architecture to support their business needs.
Meta:
https://engineering.fb.com/2023/01/31/production-engineering/meta-
asynchronous-computing/

https://engineering.fb.com/2023/03/09/open-source/velox-open-source-
execution-engine/

https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-
arrow-15-composable-data-management/

https://engineering.fb.com/2024/05/22/data-infrastructure/composable-
data-management-at-meta/

https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-
arrow-15-composable-data-management/

Data Infrastructure and Architecture at Meta

Executive Summary

Meta (formerly Facebook) processes over 6 exabytes of data per day,


including data from various sources such as social interactions, user
profiles, content uploads, and more. To handle this massive scale of data,
Meta has built a robust data infrastructure and architecture leveraging
both on-premise and cloud-based services, as well as open-source
software.

Data Collection

Meta collects data from a variety of sources, including:

 Apache Kafka for real-time data streams

 Apache Hive for batch data processing


 Amazon S3 for object storage

 Amazon Redshift for data warehousing

 Amazon DynamoDB for NoSQL data storage

Data Ingestion

The collected data sources feed into Meta's data processing pipelines,
which utilize both batch and real-time processing capabilities.

Open-Source Software

 Apache Spark for batch and streaming data processing

 Apache Flink for real-time data processing

 Apache Hadoop for distributed data storage and processing

 Apache Airflow for workflow orchestration

Cloud Services

 Amazon Web Services (AWS) for hosting the majority of Meta's data
infrastructure:

 Amazon S3 for object storage

 Amazon Redshift for data warehousing

 Amazon DynamoDB for NoSQL data storage

 Amazon EMR for managed Hadoop and Spark clusters

 Amazon Kinesis for real-time data streaming

AI and Machine Learning

Meta heavily utilizes AI and machine learning across their various products
and services, including:

 Facebook's newsfeed ranking and content recommendation


algorithms

 Computer vision models for image and video understanding

 Natural language processing models for text understanding and


generation

 Reinforcement learning models for content optimization and user


engagement

To support these AI and ML models, Meta's data infrastructure includes:


 Custom-built AI hardware (e.g., GPU clusters, TPUs) for model
training and inference

 Large-scale data processing and storage solutions to handle the


massive amounts of training data

 Specialized ML platforms and frameworks, such as PyTorch and


TensorFlow, for model development and deployment

Conclusion

Meta's data infrastructure is designed to handle the massive scale of data


they process daily, which includes both real-time interactions and batch
data sources. By leveraging a combination of open-source software and
cloud-based services, Meta has built a robust and scalable data
architecture to support their various products and services, including their
extensive use of AI and machine learning.

You might also like