0% found this document useful (0 votes)
9 views5 pages

Johny DataScientist

JD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

Johny DataScientist

JD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

JOHNY SHAIK

[email protected]| +1(657) 667 - 6201 | linkedin.com/in/johny|github.com/johnyshaik


ELEVATOR

Experienced Data Professional skilled in crafting robust Data Models and warehouses and conducting high-level analytics across diverse
industries. Proficient in GenAI, ML, NLP, and LLMs. Thrives in dynamic environments, leveraging emerging technologies to drive project
success. Collaborative team player with exceptional interpersonal and analytical skills.

PROFESSIONAL SUMMARY
 With a wealth of experience spanning 8+ years, I have established myself as a trusted Data Expert proficient in data cleaning,
transformations, profiling, modeling, visualization, engineering, and warehousing.
 My expertise encompasses the fields of Data Science, Data Mining, Advanced Analytics, BI, and Reporting, consistently driving
cost savings and maximizing ROI.
 Within the healthcare industry, I have played a significant role in handling claims data for Medicare, Medicare-Advantage, and
Commercial Insurance Payers, gaining extensive engagement in this field.
 Experience with Amazon Web Services (Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, Amazon SQS,
AWS Identity and access management, Amazon SNS, AWS Cloud Watch, Amazon EBS, Amazon CloudFront, VPC, DynamoDB,
Lambda and Redshift)
 Good knowledge in Data Extraction, Transforming and Loading (ETL) using various tools such as SQL Server Integration
Services (SSIS), Data Transformation Services (DTS).
 Data Ingestion to Azure Services and processing the data in In Azure Databricks.
 Experience in using AWS Lambda and Glue for creating highly functional Data pipelines.
 Developed and trained deep learning models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), or
Transformers for Gen AI.
 Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups.
 Hands on experience working in GCP services like Big Query, Cloud Storage (GCS), cloud function, cloud dataflow, Pub/sub, Cloud
Shell, GSUTIL, Big Query, Data Proc, Operations Suite (Stack driver).
 Experience in Designing, Architecting, and implementing cloud-based web applications using AWS and GCP.
 Developed NLP models for tasks such as text classification, named entity recognition (NER), sentiment analysis, document
summarization, and machine translation using machine learning and deep learning techniques.
 Highly Skilled at using SQL, NumPy, Pandas and Spark for Data Analysis and Model building, Cognitive Design, deploying and
operating highly available, scalable, and fault-tolerant systems using Amazon Web Services (AWS).
 Implemented & maintained the branching and build/release strategies using Version Control tools GIT, Subversion, Bitbucket and
experienced in migrating GIT repositories to AWS.
 Strong Hadoop and stage uphold involvement in major Hadoop Distributions like Cloudera, Hortonworks, Amazon EMR, and Azure
HDInsight.
 Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure
SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake
store using Azure Data factory.
 Experience on Azure cloud segments (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB,
SQL DWH, and Cosmos DB).
 Worked on various Azure Services like Data Lake, Data Lake Analytics, SQL Database, Synapse, Data Bricks, Data factory, Analysis
services, Logic Apps and SQL
 Proficiency in extracting structured data from semi-structured JSON format using dynamic SQL allows for seamless integration into
relational data tables, contributing to enhanced data management.
 As an accomplished Data Modeler, I am well-versed in Logical and Physical Data Modeling, Data Profiling, and Data Quality
maintenance. My expertise includes the development of comprehensive data mapping documents and functional specifications.
 Leveraging Python libraries such as Pandas, NumPy, Scikit-learn, seaborn, Matplotlib, and Plotly, I effectively mine, transform,
and analyze data, utilizing advanced techniques and tools.
 I adhere to industry best practices in Data Visualization, ensuring that my work effectively communicates with the intended business
audience and conveys the desired insights.
TECHNICAL SKILLS:

Languages Python, R, SQL, T-SQL, Java.

Databases MySQL, Postgre SQL, Oracle, HBase, Amazon Redshift, MS SQL Server,
Teradata.
Statistical Methods Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law,
Principal Component Analysis (PCA), Dimensionality Reduction, Cross-
Validation, Auto-correlation.
Machine Learning Regression analysis, Bayesian Method, Decision Tree, Random Forests, Support
Vector Machine, Neural Network, Sentiment Analysis, K-Means Clustering, KNN and Ensemble
Method, Natural Language Processing (NLP)
Cloud Services Azure, AWS, GCP.

Hadoop Ecosystem Hadoop, Spark, MapReduce, Hive, HDFS, Sqoop, Flume

Tableau Suite of Tools which includes Desktop, Server and Online, Server
Reporting Tools
Reporting Services (SSRS)
Data Visualization Tableau, Matplotlib, Seaborn, ggplot2
Operating Systems PowerShell, UNIX/UNIX Shell Scripting (via PuTTY client), Linux and Windows

Client: BestBuy, Charlette, NC. Mar 2023 – Present


Role: Senior Data Scientist
Responsibilities:
 Automated Email Response System using GPT-3.5: Developed and implemented an automated email response system leveraging
GPT-3.5, harnessing LLM capabilities to generate personalized and contextually relevant replies, improving response times, and
heightening customer satisfaction.
 Interactive Document Summarization Tool with Transformer Networks: Innovated an interactive document summarization tool
powered by transformer networks, harnessing LLM capabilities for efficient content analysis and customization and seamlessly
integrating the tool into document management systems to streamline operations.
 Lead the full machine learning system implementation process: Collecting data, model design, feature selection, system implementation,
and evaluation.
 Leveraged Gen AI for anomaly detection tasks, training models to distinguish between normal and abnormal patterns in data, such as
detecting anomalies in medical images or financial transactions.
 Developed deep learning models and neural networks architectures for advanced Gen AI applications, leveraging techniques such as
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers.
 Developed predictive models, forecasts, and analysis to turn data into actionable solutions.
 Recommended improvements on bad features and forecasted demand for products using ARIMA.
 Utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, &
KNN for data analysis.
 Created synthetic datasets using Generative AI architect models to augment training data, enhancing model robustness, generalization,
and performance across various machine learning tasks.
 Worked on Natural Language Processing with NLTK module for application development for automated customer response.
 Led research and development projects focused on Generative AI applications, collaborating with cross-functional teams to explore
innovative solutions and drive business impact.
 Designed and developed conversational agents, chatbots, or virtual assistants using NLP technologies to understand user queries,
generate responses, and provide personalized assistance through natural language interactions.
 Implemented Gen AI models to generate human-like text, including language translation, text summarization, dialogue generation, or
creative writing tasks.

 Advanced Text analytics using Deep learning techniques such as Convolution neural networks to determine the sentiment of texts.
 Implemented text processing and NLP techniques like topic modeling, aspect-based sentiment analysis.
 Worked closely with other data scientists to assist on feature engineering, model training frameworks, and model deployments
implementing documentation discipline.
 Supported client by developing Machine Learning Algorithms on Big Data using PySpark to analyze transaction fraud, Cluster Analysis
etc.
 Led initiatives to enhance cybersecurity measures by developing robust anomaly detection systems, effectively identifying, and
mitigating potential threats to organizational infrastructure and data integrity.
 Leveraged operational AI techniques to streamline Level 1 (L1) processes, resulting in significant efficiency improvements and
reduction in manual intervention, thereby optimizing resource allocation and enhancing overall operational performance
 Implemented automation solutions for L1 processes, effectively reducing response times and minimizing human error, resulting in
enhanced productivity and seamless workflow management across organizational functions.
 Participated in open-source communities, contributing code, research, and insights to advance the field of Generative AI and foster
collaboration among researchers and practitioners.
 Utilized generative AI techniques for artistic and design applications, creating digital artwork, graphic designs, or generative sculptures.
 Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3,
training ML model and deploying it for prediction.
 Collaborated with data scientists and cross-functional teams to identify opportunities for applying Gen AI techniques in business
contexts.
 Worked on tuning Spark applications to enhance processing speed and resource utilization, achieving significant performance gains.
 Leveraged Spark for data ingestion, extraction, transformation, and loading (ETL) processes, ensuring efficient and scalable handling of
diverse data sources.
 Developed MapReduce/Spark modules for machine learning & predictive analytics in Hadoop on AWS.

 Utilized Spark's machine learning capabilities to perform classifications, regressions, and dimensionality reduction on large datasets.

 Developed and implemented robust test automation frameworks using Selenium WebDriver for efficient and maintainable test scripts.
 Applied Bayesian techniques to tackle uncertainty in model parameters and make robust predictions in dynamic environments.
 Conducted sensitivity analyses and model validation using Bayesian methods to ensure the reliability of model outputs.
 Integrated Selenium tests into continuous integration systems such as Jenkins, ensuring automated execution with every build.
 Stayed abreast of the latest Selenium updates and best practices, incorporating new features and improvements into the automation
process.

Technical Environment: MLbase, Pyspark, AWS, Azure, Agile, MapReduce, regression, logistic regression, random forest, neural networks,
Avro, NLTK, XML, MLLib, Git & JSON.

Client: Morgan Stanley Jan 2022 – Feb 2023


Role: Senior Data Scientist
Responsibilities:
· Implemented Machine Learning, Computer Vision, Deep Learning and Neural Networks algorithms using TensorFlow, Keras and designed
Prediction Model using Data Mining Techniques with help of Python, and Libraries like NumPy, SciPy, Matplotlib, Pandas, Scikit-learn.
· Used pandas, NumPy, Seaborne, SciPy, matplotlib, Scikit-learn, NLTK for developing various machine learning algorithms.
 Contributed to open-source projects and research publications in the field of Generative AI, staying abreast of the latest advancements
and best practices.

· Worked with text feature engineering techniques like n-grams, TF-IDF, word2vec etc.

· Applied Support vector machines (SVM) and it's kernels such Polynomial, RBF-kernel on machine learning problems.
· Worked on imbalanced datasets and used the appropriate metrics while working on the imbalanced datasets.
· Worked with deep neural networks and Convolutional Neural Networks (CNN's) and Recurrent Neural networks (RNN's).
· Developed low-latency applications and interpretable models using machine learning algorithms.
· Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap
analysis.
· Applied unsupervised learning techniques to perform cluster analysis on large-scale datasets, uncovering hidden patterns and structures within
the data to drive strategic decision-making processes.
· Programmed by a utility in Python that used multiple packages (SciPy, NumPy, pandas).
· Implemented Classification using supervised algorithms like Logistic Regression, SVM, Decision trees, KNN, Naive Bayes.
· Responsible for design and development of advanced R/Python programs to prepare to transform and harmonize data sets in preparation for
modeling

· Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7
· Utilized advanced clustering algorithms such as K-means, hierarchical clustering, and DBSCAN to segment data into distinct groups based on
similarities, facilitating targeted marketing strategies and personalized customer experiences.
· Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each
document a response label for further classification.
 Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
· Implemented Agile Methodology for building an internal application.
· Data Manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, Powerball and Smart View.

· Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project
solutions.
· Researched, evaluated, architected, and deployed new tools, frameworks, and patterns to build sustainable Big Data platforms for the clients.
· Data transformation from various resources data organization features extraction from raw and stored.
Technical Environment: Python, MLlib, regression, PCA, T-SNE, Cluster analysis, SQL, Scala, NLP, Spark, Kafka, Mongo DB, logistic
regression, Hadoop, PySpark, CNN's, RNN's, Oracle 12c, Netezza, MySQL Server, SSRS, T-SQL, Tableau, Teradata, random forest, OLAP,
Azure, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS, Linux.

Client: INTEGRIS Health, Remote Aug 2020 – Dec 2021


Role: Data Scientist
Responsibilities:

· Developed a Machine Learning testbed with 24 different model learning and feature learning algorithms.

· Responsible for working with various teams on a project to develop analytics-based solution to target roaming subscribers specifically.

· Worked with several R packages including knitr, dplyr, SparkR, CausalInfer, Space-Time.

· Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms.

· Combination of these elements (travel prediction & multi-dimensional segmentation) would enable operators to conduct highly targeted and
personalized roaming services campaigns leading to significant subscriber uptake.
· Scaled up to Machine Learning pipelines: 4600 processors, 35000 GB memory achieving 5-minute execution.
· Develop Python, Pyspark, HIVE scripts to filter/map/aggregate data. Scoop to transfer data to and from Hadoop.
· Configured the project on WebSphere 6.1 application servers.
· By thorough systematic search, demonstrated performance surpassing the state-of-the-art (deep learning).
· Developed in-disk, huge (100GB+), highly complex Machine Learning models.
· Developed advanced time series models to analyze and forecast complex temporal data patterns.
· Collaborated with cross-functional teams to integrate time series models into business decision-making processes.
· Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning
methods including classifications, regressions, dimensionally reduction etc. and utilized the engine to increase user lifetime by 45% and triple
user conversations for target categories.
· Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib
libraries.
· Successfully addressed the challenges of modeling intricate time series data with irregular patterns and non-linear trends.
· Extensively worked on Data Modeling tools Erwin Data Modeler to design the Data Models.
· Developed various Qlik-View Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Big data.
· Participated in all phases of Data mining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap
Analysis.
· Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snowflake Schemas.
· Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each
document a response label for further classification.

· Created SQL tables with referential integrity and developed queries using SQL, SQL PLUS and PL/SQL.
· Designed and developed Use Case, Activity Diagrams, Sequence Diagrams, OOD (Object oriented Design) using UML and Visio.
· Interaction with Business Analysts, SMEs, and other Data Architects to understand Business needs and functionality for various project
solutions.

Technical Environment: AWS, R, Informatica, Python, HDFS, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes
MS Vision, Map-Reduce, Rational Rose, SQL, and MongoDB.
Client: Novartis Healthcare June 2019 to July 2020
Role: Data Scientist
Responsibilities:
· Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS
and Streaming sources.
· Responsible for building scalable distributed data solutions using Hadoop.
· Build data pipelines using airflow in GCP for ETL related jobs using different airflow operators.
· Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop
operators and python callable and branching operators.
· Experienced in Maintaining the Hadoop cluster on GCP using the Google cloud storage, Big Query and Dataproc.
· Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
· Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.
· Used the GCP environment to perform the following: Cloud Function’s for event-based triggering, Cloud Monitoring and Alerting.
· Using G-cloud function with Python to load data into big query for on arrival csv files in GCS bucket.
· Work on Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming.
· Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
· Developed Kafka consumer API in python for consuming data from Kafka topics.
· Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file.
· Designed GCP Cloud composer DAG to load data from on-prem csv files to GCP Big Query Tables. Scheduled DAG to load incremental
mode.
· Configured Snow pipe to pull the data from Google Cloud buckets into Snowflakes table.
· Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables.
· Stored in Hive to perform data analysis to meet the business specification logic.
· Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and
engineering type of roles.
· On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
· Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS
and Streaming sources.
Technical Environment: Spark, Spark-Streaming, Spark SQL, GCP, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell
scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra & Agile
Methodologies.
Client: Synchrony Technologies, Hyderabad Sep 2015 to Feb 2019
Role: Data Analyst
Responsibilities:
· Identify business, functional, and technical requirements through meetings and interviews and JAD sessions.
· Define the ETL mapping specification and Design the ETL process to source the data from sources and load it into DWH tables.
· Designed the logical and physical schema for data marts and integrated the legacy system data into data marts.
· Integrate Data stage Metadata to Informatica Metadata and created ETL mappings and workflows.
· Designed mapping and identified and resolved performance bottlenecks in Source to Target, Mappings.
· Developed Mappings using Source Qualifier, Expression, Filter, Look up, Update Strategy, Sorter, Joiner, Normalizer and Router
transformations.
· Involved in writing, testing, and implementing triggers, stored procedures and functions at Database level using PL/SQL.
· Developed Stored Procedures to test ETL Load per batch and provided performance optimized solution to eliminate duplicate records.
· Provide the team with technical leadership on ETL design and development best practices.
Environment: Informatica Power Center v 8.6.1, Power Exchange, IBM Rational Data Architect, MS SQL Server, Teradata, PL/SQL, IBM
Control Center, TOAD, Microsoft Project Plan.

Education: Malla Redy Engineering College, Hyderabad, India. Aug 2011 to July 2015.

You might also like