update assignment 1

clarab00 · clarab00 · commit 51809a9c9659 · 2023-11-30T13:45:43.000+01:00
diff --git a/Data/csv/assignment1.ipynb b/Data/csv/assignment1.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": 2,
    "id": "3dac29da-6756-4bf0-83ee-19fce6251f7e",
    "metadata": {
     "tags": []
@@ -28,7 +28,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 68,
+   "execution_count": 3,
    "id": "8054ee79-e401-4394-891d-c0d8a0d51266",
    "metadata": {
     "tags": []
@@ -47,9 +47,26 @@
     "Source: https://www.kaggle.com/datasets/luisernestogarca/nyc-living-languages-and-distribution"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e4b86c28-3408-4e74-87c3-86ee0e59a0ac",
+   "metadata": {},
+   "source": [
+    "I'm very interested in languages and linguistic diversity and wanted to choose a dataset that reflects that. It was harder to find a dataset that also included geospatial data than expected, but the NYC living languages dataset stood out as a very comprehensive, usable dataset with valuable data.\n",
+    "\n",
+    "Research questions I based my EDA off: \n",
+    "What are the most commonly spoken minority languages in NYC?\n",
+    "What regions of origin do they have?\n",
+    "What locations have the highest linguistic diversity? Is there any \"clustering\" of minority language communities in certain boroughs?\n",
+    "\n",
+    "For the next steps, I'm thinking about \n",
+    "- exploring the history of the language communities in the dataset – there's a column with a description of the language and how speakers emigrated to NYC (diversity visa program etc.), which could be used for some NLP tasks to find out how and why minority language communities developed. The dataset could be expanded with years of migration streams etc.\n",
+    "- exploring other dataset with e.g. economic factors to find correlations and find out why certain  neighborhoods are more linguistically diverse than others.\n"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 69,
+   "execution_count": 4,
    "id": "4a1c38f1-8373-4c35-b0bd-d6b86eb0cd83",
    "metadata": {},
    "outputs": [
@@ -216,7 +233,7 @@
        "4   achi1257   Austronesian       ace  "
       ]
      },
-     "execution_count": 69,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -227,7 +244,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 112,
+   "execution_count": 5,
    "id": "ce6f03cb-cbe0-40f5-83d0-939702effb7b",
    "metadata": {
     "tags": []
@@ -238,27 +255,27 @@
      "output_type": "stream",
      "text": [
       "<class 'pandas.core.frame.DataFrame'>\n",
-      "Index: 1267 entries, 0 to 1273\n",
+      "RangeIndex: 1274 entries, 0 to 1273\n",
       "Data columns (total 15 columns):\n",
-      " #   Column             Non-Null Count  Dtype  \n",
-      "---  ------             --------------  -----  \n",
-      " 0   language           1267 non-null   object \n",
-      " 1   endonym            1267 non-null   object \n",
-      " 2   description        1267 non-null   object \n",
-      " 3   world_region       1196 non-null   object \n",
-      " 4   country            1196 non-null   object \n",
-      " 5   global_speakers    1048 non-null   object \n",
-      " 6   primary_location   1196 non-null   object \n",
-      " 7   add_neighborhoods  286 non-null    object \n",
-      " 8   latitude           1196 non-null   float64\n",
-      " 9   longitude          1196 non-null   float64\n",
-      " 10  size               1196 non-null   object \n",
-      " 11  status             1196 non-null   object \n",
-      " 12  glottocode         1098 non-null   object \n",
-      " 13  lg_family          1191 non-null   object \n",
-      " 14  iso_639_3          1072 non-null   object \n",
-      "dtypes: float64(2), object(13)\n",
-      "memory usage: 158.4+ KB\n"
+      " #   Column             Non-Null Count  Dtype \n",
+      "---  ------             --------------  ----- \n",
+      " 0   language           1274 non-null   object\n",
+      " 1   endonym            1274 non-null   object\n",
+      " 2   description        1274 non-null   object\n",
+      " 3   world_region       1203 non-null   object\n",
+      " 4   country            1201 non-null   object\n",
+      " 5   global_speakers    1054 non-null   object\n",
+      " 6   primary_location   1203 non-null   object\n",
+      " 7   add_neighborhoods  293 non-null    object\n",
+      " 8   latitude           1203 non-null   object\n",
+      " 9   longitude          1201 non-null   object\n",
+      " 10  size               1203 non-null   object\n",
+      " 11  status             1202 non-null   object\n",
+      " 12  glottocode         1103 non-null   object\n",
+      " 13  lg_family          1194 non-null   object\n",
+      " 14  iso_639_3          1072 non-null   object\n",
+      "dtypes: object(15)\n",
+      "memory usage: 149.4+ KB\n"
      ]
     }
    ],
@@ -268,7 +285,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 115,
+   "execution_count": 6,
    "id": "c51dbf56-cc41-4fd1-90f5-eb5e31b8ad00",
    "metadata": {
     "tags": []
@@ -277,25 +294,25 @@
     {
      "data": {
       "text/plain": [
-       "language              object\n",
-       "endonym               object\n",
-       "description           object\n",
-       "world_region          object\n",
-       "country               object\n",
-       "global_speakers       object\n",
-       "primary_location      object\n",
-       "add_neighborhoods     object\n",
-       "latitude             float64\n",
-       "longitude            float64\n",
-       "size                  object\n",
-       "status                object\n",
-       "glottocode            object\n",
-       "lg_family             object\n",
-       "iso_639_3             object\n",
+       "language             object\n",
+       "endonym              object\n",
+       "description          object\n",
+       "world_region         object\n",
+       "country              object\n",
+       "global_speakers      object\n",
+       "primary_location     object\n",
+       "add_neighborhoods    object\n",
+       "latitude             object\n",
+       "longitude            object\n",
+       "size                 object\n",
+       "status               object\n",
+       "glottocode           object\n",
+       "lg_family            object\n",
+       "iso_639_3            object\n",
        "dtype: object"
       ]
      },
-     "execution_count": 115,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -314,7 +331,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 70,
+   "execution_count": 7,
    "id": "e57c7f23-33b9-40fd-b360-15955b42d66b",
    "metadata": {
     "tags": []
@@ -347,7 +364,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 71,
+   "execution_count": 8,
    "id": "b9cede0b-8163-4795-9284-000f3ba8951a",
    "metadata": {
     "tags": []
@@ -359,7 +376,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 72,
+   "execution_count": 9,
    "id": "9677709f-02c7-4a28-96ec-d790afdcd566",
    "metadata": {
     "tags": []
@@ -397,7 +414,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 73,
+   "execution_count": 10,
    "id": "6c5eab1f-070e-46f4-b85c-dfa987451dd9",
    "metadata": {
     "tags": []
@@ -409,7 +426,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 74,
+   "execution_count": 11,
    "id": "41f0de8c-496d-4e87-8e48-110c4b4e6f76",
    "metadata": {
     "tags": []
@@ -421,7 +438,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 75,
+   "execution_count": 12,
    "id": "337c391c-01bd-4f9b-aa68-ec7de0f0e633",
    "metadata": {
     "tags": []
@@ -446,6 +463,39 @@
     "print(elmhurst_lgs)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8760e53f-894b-49aa-947e-c6561c5f7d89",
+   "metadata": {},
+   "source": [
+    "What's the most common region of origin for minority languages in Elmhurst?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "7a326e0b-48b2-411f-9cd4-46f8f9d885ce",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "world_region\n",
+      "Southeastern Asia    31\n",
+      "Southern Asia        13\n",
+      "Eastern Asia          5\n",
+      "Name: count, dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "elmhurst_origins = elmhurst_data[\"world_region\"].value_counts()\n",
+    "print(elmhurst_origins)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "5ba835c7-bd4b-42b4-9926-fbd1ffeaba5d",