0% found this document useful (0 votes)
89 views16 pages

Dataiku Datsheet

Dataiku-datsheet

Uploaded by

Milind Dhobe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views16 pages

Dataiku Datsheet

Dataiku-datsheet

Uploaded by

Milind Dhobe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Sheet

2021
Everyday AI,
Extraordinary People
Everyday AI, Extraordinary People

Dataiku is the platform for Everyday AI, systemizing the use of data for
exceptional business results. Organizations that use Dataiku elevate their
people (whether technical and working in code or on the business side
and low- or no-code) to extraordinary, arming them with the ability to
make better day-to-day decisions with data.

More than 450 companies worldwide use Dataiku to systemize their


use of data and AI, driving diverse use cases from fraud detection to
customer churn prevention, predictive maintenance to supply chain
optimization, and everything in between.
Connectivity
Dataiku allows you to seamlessly connect to your data no matter
where it’s stored or in what format. That means easy access for
everyone — whether technical or not — to the data they need.

SQL Databases
☑ MySQL Streaming Data Sources
☑ PostgreSQL ☑ Kafka
☑ Vertica ☑ AWS SQS
☑ Amazon Redshift ☑ Spark
☑ Pivotal Greenplum
☑ Teradata
☑ IBM Netezza Remote Data Sources
☑ SAP HANA
☑ FTP
☑ Oracle
☑ SCP
☑ Azure Synapse
☑ SFTP
☑ Google BigQuery
☑ HTTP
☑ Google Cloud SQL
☑ IBM DB2
☑ Exasol Cloud Object Storage
☑ MemSQL
☑ Amazon S3
☑ Snowflake
☑ Google Cloud Storage
☑ Custom connectivity through JDBC
☑ Azure Blob Storage
☑ Azure Data Lake Store Gen1 & Gen2
NoSQL Databases
☑ MongoDB Custom Data Sources - Extended Connectivity
☑ Cassandra Through Dataiku Plugins
☑ ElasticSearch
☑ Connect to REST APIs
☑ Create custom file formats
Hadoop & Spark Supported Distributions ☑ Connect to databases
☑ Cloudera
☑ Hortonworks Optimized Sync Between:
☑ Google DataProc
☑ MapR ☑ Snowflake and WASB
☑ Amazon EMR ☑ S3 and Amazon Redshift
☑ DataBricks ☑ Snowflake and S3

Hadoop File Formats Native Support for Snowflake


in Spark Driver
☑ CSV
☑ Parquet
☑ ORC
☑ SequenceFile
☑ RCFile

3
Data Sheet - Dataiku
Exploratory Analytics
Sometimes you need to do a deep dive on your data, but other times,
it’s important to understand it at a glance. From exploring available
datasets to dashboarding, Dataiku makes this type of analysis easy.

Data Analysis Data Cataloging

• Automatically detect dataset schema • Search for data, comments, features,


and data types or models in a centralized catalog

• Assign semantic meanings to your datasets’ • Explore data from all your existing
columns connections

• Build univariate statistics automatically


& derive data quality checks Data Visualization

• Dataset audit • Create standard charts (histogram,


☑ Automatically produce data quality and statistical bar charts, etc) and scale charts’ computation
analysis of entire Dataiku datasets’ by leveraging underlying systems (in-database
☑ Support of several backends for audit (in-memory, aggregations)
Spark, SQL)
• Create custom charts using
Advanced Analysis ☑ Custom Python-based or R-based Charts
☑ Custom Web Applications (HTML/JS/CSS/Flask)
☑ Shiny Web Application (R)
• Interactive visual statistics ☑ Bokeh and Dash Web Applications (Python)
☑ Univariate analysis and statistical tests on single or
multiple populations.
☑ Statistics and tests on multiple populations Dashboarding
☑ Correlations analysis
☑ Principal Components Analysis
• User-managed reports and dashboards
• Leverage predefined Python-based Jupyter
☑ RMarkdown reports
Notebooks
☑ Jupyter Notebooks reports
☑ All analysis supported in Visual Statistics ☑ Custom Insights (GGplot, Plotly, Matplotlib)
☑ High dimensional data visualization (t-SNE) ☑ Custom interactive, web-based visualizations
☑ Topic modeling

• Time series
☑ Time series data prep with visual recipes for
resampling, windowing, extrema extraction, interval
extraction
☑ Time series visualization
☑ Time series forecasting

4
Data Sheet - Dataiku
Data Preparation
Traditionally, data preparation takes up to 80% of the time of a data
project. But Dataiku’s data prep features make that process 10x faster
and easier, which means more time for more impactful (and creative)
work.

Visual Data Transformation Dataset Sampling

• Design your data transformation jobs using • First


records, random selection, stratified
a point-and-click interface sampling, etc.
☑ Group
☑ Filter
☑ Sort Interactive Data Preparation
☑ Stack
☑ Join • Processors(90 built-in from simple text
☑ Fuzzy Join
☑ Window
processing to custom Python- or formula-based
☑ Sync transformations)
☑ Distinct
☑ Top-N • Scaledata preparation scripts using
☑ Pivot in-database (SQL) or in-cluster (Spark) processing
☑ Split

• Scale your transformations by running them


directly in distributed computations systems
(SQL, Hive, Spark, Impala)

• See and tune the underlying code generated for


the task

5
Data Sheet - Dataiku
Machine Learning
Dataiku offers the latest machine learning technologies all in one place
so that data scientists can focus on what they do best: building and
optimizing the right model for the use case at hand.

Automated Machine Learning (AutoML)

• Automated ML strategies • Algorithms


☑ Quick prototypes ☑ Python-based
☑ Interpretable models + Ordinary Least Squares
☑ High performance + Ridge Regression
+ Lasso Regression
• Features handling for machine learning + Logistic regression
+ Random Forests
☑ Support for numerical, categorical, text and vector + Gradient Boosted Trees
features + XGBoost
☑ Automatic preprocessing of categorical features (Dummy + Decision Tree
encoding, impact coding, hashing, custom preprocessing, + Support Vector Machine
etc.) + Stochastic Gradient Descent
☑ Automatic preprocessing of numerical features (Standard + K Nearest Neighbors
scaling, quantile-based binning, custom preprocessing, + Extra Random Trees
etc.) + Artificial Neural Network
☑ Automatic preprocessing of text features (TF/IDF, Hashing + Lasso Path
trick, Truncated SVD, Custom preprocessing) + Custom Models offering scikit-learn compatible API’s
☑ Various missing values imputation strategies (ex: LightGBM)
+ Features generation ☑ Spark MLLib-based
◊ Feature-per-feature derived variables (square, square root…) + Logistic Regression
◊ Linear and polynomial combinations + Linear Regression
+ Features selection + Decision Trees
◊ Filter and embedded methods + Random Forest
+ Gradient Boosted Trees
• Choose between several ML backends + Naive Bayes
to train your models + Custom models

☑ TensorFlow ☑ H20-based
☑ Keras + Deep Learning
☑ Scikit-learn + GBM
☑ XGBoost + GLM
☑ MLLib + Random Forest
☑ H2O + Naive Bayes

6
Data Sheet - Dataiku
Machine Learning
Dataiku offers the latest machine learning technologies all in one place
so that data scientists can focus on what they do best: building and
optimizing the right model for the use case at hand.

Automated Machine Learning (AutoML)

• Hyperparameters optimization ☑ Audit model performances


+ Confusion matrix
☑ Freely set and search hyperparameters + Decision chart
☑ Support for grid, random, and Bayesian + Lift chart
hyperparameter optimization and search + ROC curve
☑ Cross validation strategies + Probabilities distribution chart
+ Support for several train/test splitting policies (incl. + Detailed Metrics (Accuracy, F1 Score, ROC-AUC Score, MAE,
custom) RMSE, etc.)
+ K-Fold cross testing
+ Optimize model tuning on several metrics (Explained • Automatically create ensemble from several
Variance Score, MAPE, MAE, MSE, Accuracy, F1 Score, Cost
matrix, AUC, etc.) models
☑ Interrupt and resume grid search ☑ Linear stacking (for regression models) or logistic
☑ Visualize grid search results stacking (for classification problems)
☑ Auto-recalibration on the predicted probabilities ☑ Prediction averaging or median (for regression
☑ Distributed hyperparameter search on Kubernetes problems)
☑ Majority voting (for classification problems)
• Analyzing model training results
☑ Get insights from your model • Scoring capabilities
+ Scored data
☑ Real-time serverless scoring API
+ Features importance
☑ Distributed batch with Spark
+ Model parameters
☑ SQL (in-database scoring)
+ Partial dependence plots
☑ Dataiku built-in engine
+ Regression coefficients
+ Bias and performance analysis on subpopulations
• Model export
+ Individual prediction explanations
+ Model fairness report ☑ Export trained models as a set of Java classes for
+ Interactive scoring (what-if analysis) extremely efficient scoring in any JVM application.
+ ML diagnostics ☑ Export a trained model as a PMML file for scoring with
+ Model assertions any PMML-compatible scorer
☑ Publish training results to Dataiku Dashboards
• Automated model documentation
☑ Leverage pre-built templates or create your own for
standardized model documentation without the
manual work

7
Data Sheet - Dataiku
Machine Learning
Dataiku offers the latest machine learning technologies all in one place
so that data scientists can focus on what they do best: building and
optimizing the right model for the use case at hand.

Model Deployment Deep Learning

• Model versioning • Support for Keras with Tensorflow backend


• Batch scoring • User-defined model architecture
• Real-time scoring • Personalize training settings
☑ Expose your models through REST API’s for real-time • Support for multiple inputs for your models
scoring by other applications
• Support for CPU and GPU
• Expose arbitrary functions and models through • Support pre-trained models
REST API’s
• Extract features from images
☑ Write custom R, Python or SQL based functions or
models
• Tensorboard integration
☑ Automatically turn them into API endpoints for
operationalization
Unsupervised Learning
• Easily manage all your model deployments
☑ One-click deployment of models
• Automated features engineering
(similar to supervised learning)
• Docker & Kubernetes • Optional dimensionality reduction
☑ Deploy models into Docker containers for • Outliers detection
operationalization
☑ Automatically push images to Kubernetes clusters for
• Algorithms
high scalability ☑ K-means
☑ Works “out of the box” with Spark on Kubernetes ☑ Gaussian Mixture
☑ Agglomerative clustering
• Model monitoring mechanism ☑ Spectral clustering
☑ DBSCAN
☑ Control model performances over time ☑ Interactive clustering (two-step clustering)
☑ Data drift detection ☑ Isolation forest (anomaly detection)
☑ Automatically retrain models in case of performance ☑ Custom Models
drift
☑ Customize your retraining strategies
Model Training
• Logging
• Train models over Kubernetes
☑ Log and audit all queries sent to your models

8
Data Sheet - Dataiku
Output Features
After all the work of finding insights, it’s important to effectively
communicate them with stakeholders around the organization to inspire
action. Dataiku puts the power of AI in the hands of everyone to make
intelligence-driven decisions.

Charts Dashboards
☑ Bar, line, curve, pie, donut, scatter, boxplot,
☑ Create interactive insights with charts, tables, notebook
2D distribution, lift
exports, webapps and more
☑ Maps: Scatter, binned, administrative
☑ Tables
Dataiku WebApps
Dataiku Applications ☑ Use code to build highly customized applications that can be
leveraged as an API for users
☑ Create user-friendly interfaces on top of projects for users to ☑ Supports R-Shiny, Dash Plotly, Bokeh, HTML, CSS, JS, and
customize and parametrize in a few clicks and without code Flash
☑ Share applications on Dataiku as a recipe or as an API

9
Data Sheet - Dataiku
Automation Features
When it comes to streamlining and automating workflows, Dataiku
allows data teams to put the right processes in place to ensure models
are properly monitored and easily managed in production.

Data Flow Scenarios


☑ Keep track of the dependencies between your datasets
☑ Manage the complete data lineage ☑ Trigger the execution of your data flows and applications on
☑ Check consistency of data, schemas or data types a scheduled or event-driven basis
☑ Organize flows into zones ☑ Create complete custom execution scenarios by assembling
a set of actions to do (steps)
☑ Leverage built-in steps or define your own steps through
Partitioning a Python API
☑ Publish the results of the scenarios to various channels
through Reporters (Send emails with custom templates;
☑ Leverage HDFS or SQL partitioning mechanisms to optimize
attach datasets, logs, files, or reports to your Reporters; send
computation time
notifications to Slack or Hipchat)

Metrics & Checks Automation Environments


☑ Create Metrics assessing data consistency and quality ☑ Use dedicated Dataiku Automation nodes for production
☑ Adapt the behavior of your data pipelines and jobs based on pipelines
Checks against these Metrics ☑ Connect and deploy on production systems (data lakes,
☑ Leverage Metrics and Checks to measure potential ML databases)
models drift over time ☑ Activate, use or revert multiple Dataiku project bundles

Monitoring

☑ Track the status of your production scenarios


☑ Visualize the success and errors of your Dataiku jobs

10
Data Sheet - Dataiku
Code
Work in the tools and with the languages you already know (even in
Dataiku) — everything can be done with code and fully customized.
And for tasks where it’s easier to use a visual interface, Dataiku
provides the freedom to switch seamlessly between the two.

Support of Multiple Languages


for Coding “Recipes” • Create reusable custom components
☑ Dataiku Plugins to package and ship complex code-
☑ Python ☑ Hive ☑ PySpark based functions in a visual interface to less-technical
☑ R ☑ Impala ☑ SparkR users
☑ SQL ☑ Spark Scala ☑ Sparklyr ☑ Extend native Dataiku capabilities through code-based
☑ Shell ☑ Spark SQL Plugins (Custom connectors, custom data preparation
processor, custom web applications for interactive
analysis and visualization, etc.)
☑ Create Python-based custom steps for your Dataiku
Create and Use Custom Code Environments recipes and scenarios

☑ Support for multiple versions of Python (2.7, 3.4, 3.5, • APIs


3.6)
☑ Support for Conda ☑ Manage the Dataiku platform through CLI or Python
☑ Install R and Python libraries directly from Dataiku’s SDK
interface ☑ Train and deploy ML models programmatically
☑ Open environment to install any R or Python libraries ☑ Expose custom Python & R functions through REST
API’s
☑ Manage packages dependencies and create
reproducible environments • Leverage your favorite IDE to develop and test
code
Scale Code Execution
☑ RStudio for R code
☑ Scale your code by submitting Python or R jobs to ☑ Sublime Text
Kubernetes cluster, either on-premises or through ☑ VS Code
cloud services (EKS, AKS, GKE) ☑ PyCharm

• Interactive Notebooks for data scientists


☑ Full integration of Jupyter notebooks with Python, R or
PySpark kernels
☑ Use pre-templated Notebooks to speed up your work
☑ Interactively query databases or data lakes through
SQL Notebooks (support for Hive)
☑ Run Jupyter Notebooks over Kubernetes

• Python & R Libraries


☑ Create your own R or Python libraries or helpers
☑ Share them within all the Dataiku instance
☑ Easily use your pre-existing code assets
☑ Benefit from Git integration to streamline development
workflow

11
Data Sheet - Dataiku
Collaboration
Dataiku was designed from the ground up with collaboration in mind.
From knowledge sharing to change management to monitoring, data
teams — including scientists, engineers, analysts, and more — can
work faster and smarter together.

Shared Platform (for Data Scientists,


Data Engineers, Analysts, etc.)

Version Control

☑ Git-based version control recording all changes made


in Dataiku

Knowledge Management and Sharing

☑ Create and export Wikis to document projects


☑ Engage with other users of the platform through
Discussions
☑ Tag, comment and favorite any Dataiku objects

Team Activity Monitoring

• Global search to quickly find all project assets,


plugins, wiki, reference docs, etc.

• Share custom, code-based capabilities with


less-technical users in a visual interface
• Shared code-based components
☑ Distribute reusable code snippets for all users
☑ Package arbitrary complex function, operation or
business logic to be used by less-technical users
☑ Integrate with remote Git repositories such as Github

12
Data Sheet - Dataiku
Governance & Security
Dataiku makes data governance easy, bringing enterprise-level security
with fine-grained access rights and advanced monitoring for admins or
project managers.

User Profiles
• Resources management
• Role-based access
(fine-grained or custom) ☑ Dynamically start and stop Hadoop clusters from
Dataiku
☑ Control server resources allocation directly from the
• Authentication management
user interface
☑ Use SSO systems
☑ Connect to your corporate database (LDAP, Active • Platform management
Directory…) to manage users and groups
☑ Integrate with your corporate workload management
• Enterprise-grade security tools using Dataiku CLI and APIs

☑ Track and monitor all actions in Dataiku using an audit Custom policy framework for data protection
trail
☑ Authenticate against Hadoop clusters and databases
and external regulations compliance
through Kerberos
☑ Implement GDPR rules and processes directly
☑ Supports users impersonation for full traceability and
☑ Framework capabilities
compliance
◊ Document data sources with sensitive information, and
enforce good practices
◊ Restrict access to projects and data sources with sensitive
information
◊ Audit the sensitive information in a Dataiku instance

13
Data Sheet - Dataiku
Architecture
Dataiku was built for the modern enterprise, and its architecture
ensures that businesses can stay open (i.e., not tied down to a certain
technology) and that they can scale their data efforts.

• No client installation for Dataiku users • Traceability and debugging through full system
logs
• Dataiku nodes (use dedicated Dataiku
environments or nodes to design, run, and • Open platform
deploy your ML applications)
☑ Native support of Jupyter notebooks
• Integrations ☑ Install and manage any of your favorite Python or R
packages and libraries, or integrate with external Git
☑ Leverage distributed systems to scale computations repositories
through Dataiku ☑ Freely reuse your existing corporate codebase
☑ Automatically turn Dataiku jobs into SQL, Spark, ☑ Extend the Dataiku platform with custom components
MapReduce, Hive, or Impala jobs for in-cluster or
in-database processing to avoid unnecessary data
movements or copies

• Modern architecture (Docker, Kubernetes, GPU


for deep learning)

14
Data Sheet - Dataiku
©2021 DATAIKU | DATAIKU.COM

You might also like