DATASCIENCE Capstone
DATASCIENCE Capstone
Dec 9 2023
0
OUTLINE
1. EXECUTIVE SUMMARY ....................................................................................................................................2
1.1. SUMMARY OF ALL RESULTS ............................................................................................................................................2
1.2. SUMMARY OF METHODOLOGIES ....................................................................................................................................2
2. INTRODUCTION .................................................................................................................................................3
2.1. PROJECT BACKGROUND AND CONTEXT ........................................................................................................................3
2.2. PROBLEMS ANSWERED ...................................................................................................................................................3
3. METHODOLOGY .................................................................................................................................................4
3.1. EXECUTIVE SUMMARY ....................................................................................................................................................4
Data collection methodology .............................................................................................................................................4
Perform data wrangling ......................................................................................................................................................4
Perform exploratory data analysis (EDA) using visualization and SQL .........................................................4
Perform interactive visual analytics using Folium and Plotly Dash .................................................................4
Perform predictive analysis using classification models .......................................................................................5
4. DATA COLLECTION ...........................................................................................................................................6
4.1. DATA COLLECTION – SPACEX API. .............................................................................................................................6
4.2. DATA COLLECTION – SCRAPING. ..................................................................................................................................6
4.3. .......................................................................................................................................................................................................6
5. DATA WRANGLING ...........................................................................................................................................7
6. EXPLORATORY DATA ANALYSIS .................................................................................................................8
6.1. EDA WITH DATA VISUALIZATION ...............................................................................................................................8
6.2. EXPLORATORY DATA ANALYSIS WITH SQL...............................................................................................................9
7. BUILD AN INTERACTIVE MAP WITH FOLIUM ..................................................................................... 10
8. BUILD A DASHBOARD WITH PLOTLY DASH ........................................................................................ 11
9. PREDICTIVE ANALYSIS................................................................................................................................. 11
Flight Number vs. Launch Site........................................................................................................................................ 13
Payload vs. Launch Site ..................................................................................................................................................... 14
Success Rate vs. Orbit Type .............................................................................................................................................. 15
Flight Number vs. Orbit Type .......................................................................................................................................... 16
Payload Mass vs Orbit Type ............................................................................................................................................. 17
10. ................................................................................................................................................................................... 29
11. LAUNCH SITES PROXIMITY ANALYSIS .............................................................................................. 30
11.1. GLOBAL MAP OF ALL LAUNCH SITES’ LOCATION .................................................................................................... 30
11.2. LAUNCH OUTCOMES FOR EACH LAUNCH CENTER ................................................................................................... 31
11.3. VANDENBERG CA SITE ANALYSIS ............................................................................................................................. 32
12. PLOTLY DASHBOARD .............................................................................................................................. 34
12.1. LAUNCH RECORDS - TOTAL LAUNCHES BY SITE NAME ........................................................................................ 35
12.2. LAUNCH RECORDS MOST SUCCESSFUL SITE ........................................................................................................... 36
12.3. PAYLOAD VS LAUNCH OUTCOME FOR ALL SITES .................................................................................................... 37
13. PREDICTIVE ANAYSIS – CLASSIFICATION........................................................................................ 41
13.1. ACCURACY OF THE MODEL ......................................................................................................................................... 41
13.2. CONFUSION MATRIX ANALYSIS ................................................................................................................................. 42
0
FIGURE TABLE OF CONTENTS
Figure 1 5
Figure 2 6
Figure 3 6
Figure 4 6
Figure 5 7
Figure 6 8
Figure 7 13
Figure 8 14
Figure 9 15
Figure 10 16
Figure 11 17
Figure 12 18
Figure 13 19
Figure 14 20
Figure 15 21
Figure 16 22
Figure 17 23
Figure 18 24
Figure 19 25
Figure 20 26
Figure 21 28
Figure 22 30
Figure 23 31
Figure 24 32
Figure 25 34
Figure 26 35
Figure 27 36
Figure 28 37
Figure 29 38
Figure 30 42
Figure 31 43
1
1. EXECUTIVE SUMMARY
This is a capstone project for data analysis with python delivered on EdX The training
materials were initial very well thought, the challenges are that the journals are not
updated. Some libraries are obsolete. The data sets are not consistent for all analyses
We need to see where SpaceX underperforms. All four US launching sitees are further to the
equator than the EU launch site in Guyana. Right now, due to the current political situation
the Soyuz launch pad is not used by the Russian counterparts. Orinoco could use the Guyana
site Figure 22 for heavy payloads given that less fuel is needed for lift. If we limit the
analysis to the US, we notice that we can launch heavy payloads from Florida and moderate
payloads from California. The site from California is not in the hurricane path and it is
attractive for customers from the Americas and from the Far East. Here you can see the
location: Vandenberg CA Site analysis
Spacex does not have a good record for placing payloads in the the GTO orbit. More recent
launches show that this rate is improving, Figure 10 We could court potential clients for
this orbit placement. Here is the success rate for orbit placement for SpaceX: Figure 9. We
also notice that the payload for this orbit is between 4000 and 6000 kg, ideal from
launching from California. See Figure 11Figure 11
According to SQL analysis SpaceX has tremendous success using F9 FT boosters F9 B4
boosters and F9 B5 boosters. The F5 boosters are used for heavy payloads The dashboard
shows another picture, B4 having a 50% success rate Figure 29 but, as stated – the set of
data for the dashboard is incomplete. See Figure 1
If we were to predict a successful launch based on historic data several classification models
can be used. For the moment three models have the same accuracy rate of .83% . See
Accuracy of the Model
One can use the model they are more familiar with. The confusion matrix shows similar
data. False positives are encountered for all three successful models as it can be seen in
Figure 30
One should not use node clustering since false positives and false negatives are present
2
2. INTRODUCTION
2.1. Project background and context
The data analysis department at ORINOCO Space produced this report at the request of CFO
Artu Ditto . The findings of the report will be used to design a strategy to compete with
SPACEX. We have been tasked to analyze if the launch costs our competitor, SPACEX is
providing in bidding documents are substantiated by the successful reutilization of the first
stage of the rockets the competitor uses.
We need to see where spaceX underperforms. Falcon 9 is a two steps rocket. It is the
flagship of SpaceX. A launch of payload costs on average 62 million dollars; other providers
cost upward of 165 million dollars each. Much of the savings is because SpaceX can reuse
the first stage. We need to determine if the first stage will land. Then we can determine the
cost of a launch. We can also determine which launch pads can accommodate ORINOCO
launches to compete with SPACEX
3
3. METHODOLOGY
4
I used the data from skillsets which is incomplete.
Figure 1
In support of this statement I am providing this snapshot. As you can see, there are more
than 100 records for SQL analysis and only 55 for the Dash dashboard building. We used the
pandas library to process data and the plotly library for plotting graphs. The dash library
was used in order to create an interactive dashboard presenting the SPACEX correlation
between payload and success for a specific site and booster type. A slider allows to set the
payload mass within a range.
5
4. DATA COLLECTION
The completed SpaceX API data collection process can be found here: Data
Collection SpaceX API
We designed our Application Programming Interface using the python library
requests obtaining a json file. Then we used functions from pandas, numpy and
datetime Python libraries to transform the data into data frames for SPACEX API.
This allows for flexibility both ways. it is easy to eliminate un-necessary data. We
only looked for Falcon 9 booster data and we have expanded the data frame
querying the SPACEX API for launch longitude and latitude, payload ,payload orbit
, launch date, Launch Site, outcome with helper functions provided by IBM Skillset
Figure 2
Figure 3
The completed web scraping notebook process can be found here: Data
Collection Web Scraping
4.3. A Wiki Falcon9 web page that was scrapped of information We designed our
Application Programming Interface using known Python libraries such as
requests obtaining a json file. Then we used the beautiful soup library to define a
soup object – a HTML table. This was the basis of creating a dictionary of useful data.
We used helper functions provided by IBM Skillset to process web scraped data. We
eliminated type errors in the process of creating the dictionary. The dictionary is
converted to a data frame. We notice the data columns have some common
elements with the data from SpaceX API but data is formatted differently.
Figure 4
6
5. DATA WRANGLING
The complete Data Wrangling notebook can be found here: Data Wrangling
There are four steps in the data wrangling:
Discovery: Define success rate for Falcon9 rockets
Transformation
- Correct missing data.
- add several columns Orbit, Class(landing outcome) to the data frame
- transform objects in the data frame in string or integers
Validation: check landing success rate
Publishing – output data frames to csv file for Estimated Data Analysis
Here is a comparison between the Data Collecting output and data Wrangling output
for SpaceX
Figure 5
7
6. EXPLORATORY DATA ANALYSIS
6.1. EDA with Data Visualization
The complete Exploratory Data Analysis notebook can be found here: EDA
with data visualization notebook
Using the wrangled data from SpaceX We visualized the dependency of success
failure of independent variables: We used pandas, numpy to process data and
matplotlib and seaborn to build graphs. We created the following
• plot out the FlightNumber vs. PayloadMass and overlay the outcome of
the launch
• plot the relationship between Flight Number and Launch Site and overlay
the outcome of the launch
• plot the relationship between launch sites and the payload mass of each
launch and overlay the outcome of the launch
• plot the relationship between success rate of each Orbit type
• plot the relationship between Flight Number and Orbit type and overlay
the outcome of the launch
• plot the relationship between Orbit type and payload mass of each launch
overlay the outcome of the launch
• plot the trend of launch success over time
The plotted graph model offered by Skillset a length/height ratio of 5:1 making
observation very difficult.
The variation of success/failure depending on Flight Number and Payload can be
better discerned using the 2:1 proportion in the scatter plot style.
Figure 6
Based on these visual cues we have generated a data frame with containing
variables that affect the success rate. The variables chosen are :
FlightNumber
PayloadMass
Orbit
LaunchSite
Flights
GridFins
Reused
Legs
LandingPad
Block
ReusedCount
Serial
8
6.2. Exploratory Data Analysis with SQL
The complete notebook with Exploratory Data Analysis with SQL can be found
here: EDA with SQL
SQL queries perform much faster than manipulating csv rows. For the current
project we could have extracted the following data using pandas data frames:
• Display the names of the unique launch sites in the space mission
• Display 5 records where launch sites begin with the string 'KSC'
• Display the total payload mass carried by boosters launched by NASA
(CRS)
• Display average payload mass carried by booster version F9 v1.1
• List the date where the successful landing outcome in drone ship was
achieved.
• List the names of the boosters which have success in ground pad and
have payload mass greater than 4000 but less than 6000
• List the total number of successful and failure mission outcomes
• List the names of the booster versions which have carried the maximum
payload mass. Use a subquery
• List the records which will display the month names, successful landing
outcomes in ground pad, booster versions, launch site for the months in
year 2017
• List the records which will display the month names, successful landing
outcomes in ground pad, booster versions, launch site for the months in
year 2017
• List the records which will display the month names, successful landing
outcomes in ground pad, booster versions, launch site for the months in
year 2017
9
7. BUILD AN INTERACTIVE MAP WITH FOLIUM
The complete notebook with Folium Map can be found here: Folium Launch
Site Map
10
8. BUILD A DASHBOARD WITH PLOTLY DASH
The script generating the dashboard can be found here: Site Launch Dasboard
The data set offered by Skils Network is half the size of the data that is analyzed with other
methods. The results obtained here conflict with the sql data interpretation. We used
the pandas library to process data, plotly for plotting graphs and dash in order to create an
interactive dashboard presenting the SPACEX correlation between payload and success for
a specific site and booster type. A slider allows to set the payload mass within a range. Map
visualization is quite cumbersome in this document
9. PREDICTIVE ANALYSIS
The complete notebook with Predictive Analysis can be found here: Predicted Analysis
We used pandas, and numpy to process data. Seaborn was useful to normalize data. The
sklearn library is deprecated. Used scikit-learn instead in order to split our data into test
and train sets, sift through classification algorithms to find the best among these algorithms:
Support Vector Machine, Classification Trees, Node Clustering and Logistic Regression.
11
SECTION 2
INSIGHTS DRAWN FROM EDA
12
Flight Number vs. Launch Site
Figure 7
13
Payload vs. Launch Site
Figure 8
14
Success Rate vs. Orbit Type
Figure 9
15
Flight Number vs. Orbit Type
Figure 10
16
Payload Mass vs Orbit Type
Figure 11
17
Launch Success Yearly Trend
Figure 12
18
All Launch Site Names
Figure 13
19
Launch Site Names Begin with 'KSC
Figure 14
20
Total Payload Mass
Figure 15
21
Average Payload Mass by F9 v1.1
Figure 16
22
First Successful Ground Landing Date
Figure 17
23
Successful Booster Names for Successful Ground Pad Landing with Payload between 4 and 6 tons
Figure 18
24
Total Number of Successful and Failure Mission Outcomes
Figure 19
25
Boosters Carried Maximum Payload
Figure 20
26
Impossible requirement
This data scientist package course has been plagued by a chronic lack of data maintaining. This is the original requirement in the
PowerPoint Capstone document – Skillset competence at its best. Please see the code for 2017 in my journal EDA with SQL
27
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20 – Also for more recent data
Figure 21
28
10.
29
11. LAUNCH SITES PROXIMITY ANALYSIS
11.1. Global Map of All launch sites’ location
Here is a location of the three sites from the data set. (CA, Fl – 2 sites). The map was embedded with the equator and a location of another
launch site (Guyana) was plotted. All launch sites are in the proximity of the coast, the site furthest to the Equator is the Ca site. Guyana space
center is very close to the Equator
Figure 22
30
11.2. Launch outcomes for each launch center
Looking on the data it seems that the number of failures is concentrated in Space Launch complex 40 in Florida and at the Vandenberg AFB.
The perspective is skewed, keep in mind that no data until 2020 is included in the data frame provided by Skils Network
Figure 23
31
11.3. Vandenberg CA Site analysis
The coastline, highway and railway are very close to the launching site. This is shown in the top magnified map. The distance to
the closest city, Lompoc CA from the launch site is 14.11 km (or less than 10 miles for those who use Imperial measuring units).
The two images showing proximity are not at scale Given the proximity to the city and the fact that this site is the most distant
from the Equator, payloads should be restricted in size, security should be provided when launches take place to prevent
accidents and restrict access
Figure 24
32
33
12. PLOTLY DASHBOARD
I have installed Dash on my system and I can run the dashboard locally. I advise you to do the same instead of using the virtual environment
offered by skillsets
Figure 25
The dashboard displays a success/failure pie chart and a scatter graph that shows the correlation between payload and success. A slide
button can adjust the payload for a range. Booster type is color coded. The data frame contains 87 records
34
12.1. Launch Records - Total Launches by Site Name
The payload is adjusted to a range between 2000 and 3000 kg. You can see that the pie chart shows the launches for the CCAFS SLC 40 site is
33. This is the site with most launches but the number of launches in the scatter plot below is limited by the slider
Figure 26
35
12.2. Launch Records Most Successful Site
The most successful launch site for launches is KSC LC 39A. There are 17 successful launches
Figure 27
36
12.3. Payload vs Launch Outcome for all Sites
This is quite a crowded graph – we can notice most failures are due to early booster versions. We notice that FT and B5 boosters perform very
well for a wide range of payloads
Figure 28
37
If we want to dig more, we can choose to focus on a certain booster type. As expected, version V.11 shows lots of failures. B4
booster are quit lame too, only 50% success rate is shown. Even for the most successful site, B4 has a poor record. But as we recall
from the sql analysis B4 boosters were very successful in the 4k to 6k range. One explanation is that the data frame offered for sql
analysis is not the same with the data frame w
Figure 29
38
39
40
13. PREDICTIVE ANAYSIS – CLASSIFICATION
13.1. Accuracy of the Model
The four models and the accuracy for the test and train sets are summarized below. Logistic Regression. SVM and decision tree algorithms
have similar highest accuracy for the test sample and decision tree has the highest train accuracy among the three methods. Node clustering
should not be used based only by accuracy
41
13.2. Confusion Matrix Analysis
For the 3 most reliable methods the problem is some successful landings show as failed. We got false positives
Figure 30
42
For the clustering model we notice that there will be both successful landings predicted as failed and failed landings predicted as
successful . We get both false negatives and false positives
Figure 31
43