|
2 | 2 | "cells": [
|
3 | 3 | {
|
4 | 4 | "cell_type": "code",
|
5 |
| - "execution_count": 67, |
| 5 | + "execution_count": 2, |
6 | 6 | "id": "3dac29da-6756-4bf0-83ee-19fce6251f7e",
|
7 | 7 | "metadata": {
|
8 | 8 | "tags": []
|
|
28 | 28 | },
|
29 | 29 | {
|
30 | 30 | "cell_type": "code",
|
31 |
| - "execution_count": 68, |
| 31 | + "execution_count": 3, |
32 | 32 | "id": "8054ee79-e401-4394-891d-c0d8a0d51266",
|
33 | 33 | "metadata": {
|
34 | 34 | "tags": []
|
|
47 | 47 | "Source: https://www.kaggle.com/datasets/luisernestogarca/nyc-living-languages-and-distribution"
|
48 | 48 | ]
|
49 | 49 | },
|
| 50 | + { |
| 51 | + "cell_type": "markdown", |
| 52 | + "id": "e4b86c28-3408-4e74-87c3-86ee0e59a0ac", |
| 53 | + "metadata": {}, |
| 54 | + "source": [ |
| 55 | + "I'm very interested in languages and linguistic diversity and wanted to choose a dataset that reflects that. It was harder to find a dataset that also included geospatial data than expected, but the NYC living languages dataset stood out as a very comprehensive, usable dataset with valuable data.\n", |
| 56 | + "\n", |
| 57 | + "Research questions I based my EDA off: \n", |
| 58 | + "What are the most commonly spoken minority languages in NYC?\n", |
| 59 | + "What regions of origin do they have?\n", |
| 60 | + "What locations have the highest linguistic diversity? Is there any \"clustering\" of minority language communities in certain boroughs?\n", |
| 61 | + "\n", |
| 62 | + "For the next steps, I'm thinking about \n", |
| 63 | + "- exploring the history of the language communities in the dataset – there's a column with a description of the language and how speakers emigrated to NYC (diversity visa program etc.), which could be used for some NLP tasks to find out how and why minority language communities developed. The dataset could be expanded with years of migration streams etc.\n", |
| 64 | + "- exploring other dataset with e.g. economic factors to find correlations and find out why certain neighborhoods are more linguistically diverse than others.\n" |
| 65 | + ] |
| 66 | + }, |
50 | 67 | {
|
51 | 68 | "cell_type": "code",
|
52 |
| - "execution_count": 69, |
| 69 | + "execution_count": 4, |
53 | 70 | "id": "4a1c38f1-8373-4c35-b0bd-d6b86eb0cd83",
|
54 | 71 | "metadata": {},
|
55 | 72 | "outputs": [
|
|
216 | 233 | "4 achi1257 Austronesian ace "
|
217 | 234 | ]
|
218 | 235 | },
|
219 |
| - "execution_count": 69, |
| 236 | + "execution_count": 4, |
220 | 237 | "metadata": {},
|
221 | 238 | "output_type": "execute_result"
|
222 | 239 | }
|
|
227 | 244 | },
|
228 | 245 | {
|
229 | 246 | "cell_type": "code",
|
230 |
| - "execution_count": 112, |
| 247 | + "execution_count": 5, |
231 | 248 | "id": "ce6f03cb-cbe0-40f5-83d0-939702effb7b",
|
232 | 249 | "metadata": {
|
233 | 250 | "tags": []
|
|
238 | 255 | "output_type": "stream",
|
239 | 256 | "text": [
|
240 | 257 | "<class 'pandas.core.frame.DataFrame'>\n",
|
241 |
| - "Index: 1267 entries, 0 to 1273\n", |
| 258 | + "RangeIndex: 1274 entries, 0 to 1273\n", |
242 | 259 | "Data columns (total 15 columns):\n",
|
243 |
| - " # Column Non-Null Count Dtype \n", |
244 |
| - "--- ------ -------------- ----- \n", |
245 |
| - " 0 language 1267 non-null object \n", |
246 |
| - " 1 endonym 1267 non-null object \n", |
247 |
| - " 2 description 1267 non-null object \n", |
248 |
| - " 3 world_region 1196 non-null object \n", |
249 |
| - " 4 country 1196 non-null object \n", |
250 |
| - " 5 global_speakers 1048 non-null object \n", |
251 |
| - " 6 primary_location 1196 non-null object \n", |
252 |
| - " 7 add_neighborhoods 286 non-null object \n", |
253 |
| - " 8 latitude 1196 non-null float64\n", |
254 |
| - " 9 longitude 1196 non-null float64\n", |
255 |
| - " 10 size 1196 non-null object \n", |
256 |
| - " 11 status 1196 non-null object \n", |
257 |
| - " 12 glottocode 1098 non-null object \n", |
258 |
| - " 13 lg_family 1191 non-null object \n", |
259 |
| - " 14 iso_639_3 1072 non-null object \n", |
260 |
| - "dtypes: float64(2), object(13)\n", |
261 |
| - "memory usage: 158.4+ KB\n" |
| 260 | + " # Column Non-Null Count Dtype \n", |
| 261 | + "--- ------ -------------- ----- \n", |
| 262 | + " 0 language 1274 non-null object\n", |
| 263 | + " 1 endonym 1274 non-null object\n", |
| 264 | + " 2 description 1274 non-null object\n", |
| 265 | + " 3 world_region 1203 non-null object\n", |
| 266 | + " 4 country 1201 non-null object\n", |
| 267 | + " 5 global_speakers 1054 non-null object\n", |
| 268 | + " 6 primary_location 1203 non-null object\n", |
| 269 | + " 7 add_neighborhoods 293 non-null object\n", |
| 270 | + " 8 latitude 1203 non-null object\n", |
| 271 | + " 9 longitude 1201 non-null object\n", |
| 272 | + " 10 size 1203 non-null object\n", |
| 273 | + " 11 status 1202 non-null object\n", |
| 274 | + " 12 glottocode 1103 non-null object\n", |
| 275 | + " 13 lg_family 1194 non-null object\n", |
| 276 | + " 14 iso_639_3 1072 non-null object\n", |
| 277 | + "dtypes: object(15)\n", |
| 278 | + "memory usage: 149.4+ KB\n" |
262 | 279 | ]
|
263 | 280 | }
|
264 | 281 | ],
|
|
268 | 285 | },
|
269 | 286 | {
|
270 | 287 | "cell_type": "code",
|
271 |
| - "execution_count": 115, |
| 288 | + "execution_count": 6, |
272 | 289 | "id": "c51dbf56-cc41-4fd1-90f5-eb5e31b8ad00",
|
273 | 290 | "metadata": {
|
274 | 291 | "tags": []
|
|
277 | 294 | {
|
278 | 295 | "data": {
|
279 | 296 | "text/plain": [
|
280 |
| - "language object\n", |
281 |
| - "endonym object\n", |
282 |
| - "description object\n", |
283 |
| - "world_region object\n", |
284 |
| - "country object\n", |
285 |
| - "global_speakers object\n", |
286 |
| - "primary_location object\n", |
287 |
| - "add_neighborhoods object\n", |
288 |
| - "latitude float64\n", |
289 |
| - "longitude float64\n", |
290 |
| - "size object\n", |
291 |
| - "status object\n", |
292 |
| - "glottocode object\n", |
293 |
| - "lg_family object\n", |
294 |
| - "iso_639_3 object\n", |
| 297 | + "language object\n", |
| 298 | + "endonym object\n", |
| 299 | + "description object\n", |
| 300 | + "world_region object\n", |
| 301 | + "country object\n", |
| 302 | + "global_speakers object\n", |
| 303 | + "primary_location object\n", |
| 304 | + "add_neighborhoods object\n", |
| 305 | + "latitude object\n", |
| 306 | + "longitude object\n", |
| 307 | + "size object\n", |
| 308 | + "status object\n", |
| 309 | + "glottocode object\n", |
| 310 | + "lg_family object\n", |
| 311 | + "iso_639_3 object\n", |
295 | 312 | "dtype: object"
|
296 | 313 | ]
|
297 | 314 | },
|
298 |
| - "execution_count": 115, |
| 315 | + "execution_count": 6, |
299 | 316 | "metadata": {},
|
300 | 317 | "output_type": "execute_result"
|
301 | 318 | }
|
|
314 | 331 | },
|
315 | 332 | {
|
316 | 333 | "cell_type": "code",
|
317 |
| - "execution_count": 70, |
| 334 | + "execution_count": 7, |
318 | 335 | "id": "e57c7f23-33b9-40fd-b360-15955b42d66b",
|
319 | 336 | "metadata": {
|
320 | 337 | "tags": []
|
|
347 | 364 | },
|
348 | 365 | {
|
349 | 366 | "cell_type": "code",
|
350 |
| - "execution_count": 71, |
| 367 | + "execution_count": 8, |
351 | 368 | "id": "b9cede0b-8163-4795-9284-000f3ba8951a",
|
352 | 369 | "metadata": {
|
353 | 370 | "tags": []
|
|
359 | 376 | },
|
360 | 377 | {
|
361 | 378 | "cell_type": "code",
|
362 |
| - "execution_count": 72, |
| 379 | + "execution_count": 9, |
363 | 380 | "id": "9677709f-02c7-4a28-96ec-d790afdcd566",
|
364 | 381 | "metadata": {
|
365 | 382 | "tags": []
|
|
397 | 414 | },
|
398 | 415 | {
|
399 | 416 | "cell_type": "code",
|
400 |
| - "execution_count": 73, |
| 417 | + "execution_count": 10, |
401 | 418 | "id": "6c5eab1f-070e-46f4-b85c-dfa987451dd9",
|
402 | 419 | "metadata": {
|
403 | 420 | "tags": []
|
|
409 | 426 | },
|
410 | 427 | {
|
411 | 428 | "cell_type": "code",
|
412 |
| - "execution_count": 74, |
| 429 | + "execution_count": 11, |
413 | 430 | "id": "41f0de8c-496d-4e87-8e48-110c4b4e6f76",
|
414 | 431 | "metadata": {
|
415 | 432 | "tags": []
|
|
421 | 438 | },
|
422 | 439 | {
|
423 | 440 | "cell_type": "code",
|
424 |
| - "execution_count": 75, |
| 441 | + "execution_count": 12, |
425 | 442 | "id": "337c391c-01bd-4f9b-aa68-ec7de0f0e633",
|
426 | 443 | "metadata": {
|
427 | 444 | "tags": []
|
|
446 | 463 | "print(elmhurst_lgs)"
|
447 | 464 | ]
|
448 | 465 | },
|
| 466 | + { |
| 467 | + "cell_type": "markdown", |
| 468 | + "id": "8760e53f-894b-49aa-947e-c6561c5f7d89", |
| 469 | + "metadata": {}, |
| 470 | + "source": [ |
| 471 | + "What's the most common region of origin for minority languages in Elmhurst?" |
| 472 | + ] |
| 473 | + }, |
| 474 | + { |
| 475 | + "cell_type": "code", |
| 476 | + "execution_count": 15, |
| 477 | + "id": "7a326e0b-48b2-411f-9cd4-46f8f9d885ce", |
| 478 | + "metadata": { |
| 479 | + "tags": [] |
| 480 | + }, |
| 481 | + "outputs": [ |
| 482 | + { |
| 483 | + "name": "stdout", |
| 484 | + "output_type": "stream", |
| 485 | + "text": [ |
| 486 | + "world_region\n", |
| 487 | + "Southeastern Asia 31\n", |
| 488 | + "Southern Asia 13\n", |
| 489 | + "Eastern Asia 5\n", |
| 490 | + "Name: count, dtype: int64\n" |
| 491 | + ] |
| 492 | + } |
| 493 | + ], |
| 494 | + "source": [ |
| 495 | + "elmhurst_origins = elmhurst_data[\"world_region\"].value_counts()\n", |
| 496 | + "print(elmhurst_origins)" |
| 497 | + ] |
| 498 | + }, |
449 | 499 | {
|
450 | 500 | "cell_type": "markdown",
|
451 | 501 | "id": "5ba835c7-bd4b-42b4-9926-fbd1ffeaba5d",
|
|
0 commit comments