0% found this document useful (0 votes)
29 views5 pages

FinalProject Description

This document outlines the requirements for a semester-long programming project. It will be done in groups of 2-3 students and includes milestones like forming groups, submitting a proposal, providing source code, and doing a final presentation. It provides due dates for each milestone and describes what should be included in the project proposal, final report, source code submission, and final presentation. It also provides several potential project ideas like developing a spatial, distributed, or streaming database, or doing a project involving cloud computing, information retrieval, or image retrieval.

Uploaded by

Bhanu Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

FinalProject Description

This document outlines the requirements for a semester-long programming project. It will be done in groups of 2-3 students and includes milestones like forming groups, submitting a proposal, providing source code, and doing a final presentation. It provides due dates for each milestone and describes what should be included in the project proposal, final report, source code submission, and final presentation. It also provides several potential project ideas like developing a spatial, distributed, or streaming database, or doing a project involving cloud computing, information retrieval, or image retrieval.

Uploaded by

Bhanu Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Overview

This course has a semester-long programming project, done in groups of 2-3 students.

Project Timeline
Component Due Date

Proposal/Choosing Partners Monday 3/13/2023 @11:59pm

Final Report Friday 05/05/2023 @11:59pm

Code Submission Wednesday 05/03/20223 @11:59pm

Project Demo Wednesday 05/03/20223 @11:59pm

Milestones
• Group formation: find a project partner(s) and begin to discuss project problems and
ideas.

• Project proposal : your proposal should explicitly state the following:


• A description of the problem
• The motivation for the problem (e.g., why is the problem interesting, why is it
challenging, who will benefit from a solution to the problem, etc.)
• A brief discussion of previous work related to this problem.
• Your initial ideas on how to attack the problem. This includes a (rough)
methodology and plan for your project. Be sure to structure your plan into a set
of incremental, implementable milestones and include a milestone schedule
• The resources needed to carry out your project, software, hardware, cloud, etc
• Citation and link to resources used or will be used.

• Final report and software/source code: the final report should include:
• Team Organization (Team members and their role in the project)
• Survey related work in the related work section;
• Server/System Software/Hardware configuration
• Project Definition
o Present the problem and summarize your contributions
o List of functional/non-functional requirements
• Architecture (architecture outline and architectural diagram)
• Include a detailed description of your methodology, analysis, and implementation
in the technical section
• Describe evaluation methodology and significant results in the evaluation section
• The report should also include a paragraph explaining, for each group member,
their contributions and duties in the project.
• Please specify a hyperlink at the end of your report through which we can
download your source code and data set for reproducing your experimental
results.
Source code submission: Please provide your COMPLETE source code, datasets, and
runnable software in one package. You should provide a link to Github or Canvas and
upload the code to the system. Please include a README file specifying how to compile
and run your code. Students can use libraries or online code during implementation, but
such source code won't be considered as your workload.

• Recorded Final Presentation and Demo:


The presentation will be approximately 10-15 minutes. An ideal presentation will:
• Describe the problem and why it’s challenging
• Describe the methodology, features and functionality of your project
• Have a demo (A recorded demo is also required and must be submitted)
• Discuss the challenges encountered
• Discuss elements of the project that were not covered in class
• Describe what you plan on doing next

Project Ideas
Below are some high-level project ideas. You can further search Google Scholar and/or IEEE to
find reference papers for your proposals/implementation. You are encouraged to improve on the
topic of the paper.
I will be happy to discuss these ideas with you as well as help you refine your own ideas.

• Spatial Database
A spatial database is a database that is optimized for storing and querying data that represents
objects defined in a geometric space. Most spatial databases allow representing simple geometric
objects such as points, lines and polygons. Spatial databases use a spatial index to speed up
database operations. Some basic operations are: Spatial Measurements, Spatial Functions, Spatial
Predicates, and Geometry Constructors. Some databases support only simplified or modified sets
of these operations, especially in cases of NoSQL systems like MongoDB and CouchDB.
The followings are some well known relational spatial database:
PostGIS, SQL Server, Oracle Spatial & Oracle Locator, IBM DB2 Spatial Extender, and
SpatiaLite

• Distributed Database
A distributed database is a set of databases stored on multiple computers that typically appears to
applications as a single database. Consequently, an application can simultaneously access and
modify the data in several databases in a network. Each database in the system is controlled by
its local server but cooperates to maintain the consistency of the global distributed database. This
project is very easy to describe: You should design and implement a real, live distributed
database that handle techniques for transaction management, concurrency and recovery.

• Streaming Database
A data stream is an unbounded data set that is produced incrementally over time, rather than
being available in full before its processing begins. A traditional database management system
typically processes a stream of ad-hoc queries over relatively static data. In contrast, a Data
Stream Management System (DSMS) evaluates static (long-running) queries on streaming data,
making a single pass over the data and using limited working memory. One important feature of
a DSMS is the possibility to handle potentially infinite and rapidly changing data streams by
offering flexible processing at the same time, although there are only limited resources such as
main memory.
Examples of Data Stream Management systems are:
SQLstream, STREAM, AURORA, QSTREAM, TelegraphCQ, SAP Event Stream Processos,
InfoSphere Streams.

• Cloud (Google Cloud, AWS, Microsoft Ajure, etc).


Cloud computing is the practice of hosting files, computing operations, or technology services on
remote servers connected via the Internet. It can allow people to access and share information at
any time from multiple devices, rapidly deploy computing services without purchasing hardware,
temporarily leverage massive computing power, and much more. Your project must involve
harnessing a cloud computing system of some kind. The system may be one of those that we
have used or studied in class, or it may be some other commercial cloud computing or storage
system. You must evaluate what you have built for correctness, performance, and scalability. For
correctness, you must develop a procedure to evaluate that the system actually accomplishes
what it intends to do. (e.g. If you store some data in the cloud, you must verify that it is the same
when it is read back.) For performance and scalability, you must select an appropriate measure --
simulations/day, GB/s, transactions/hour -- and then evaluate the performance as the system
scales up to 100X or higher.

• Database and Information Retrieval (DB + IR)


The field of Information Retrieval has become extremely important in recent years due to the
intriguing challenges presented in tapping the Internet and the Web as an inexhaustible source of
information. The success of Web search engines is a testimony to this fact. The technical
challenges of searching and browsing for information in such unstructured domains such as the
Web and other document collections are vast. Querying is often a gray and fuzzy process; e.g.
when many results match a query, IR systems attempt to return top hits ranked by relevance.
Several ingenuous query paradigms as well as search algorithms have been developed for these
purposes. However, while the Web is indeed vast as an information source, it is estimated that
much larger amounts of recorded data is locked up in more structured sources such as databases,
which are often the propriety information of private corporations and government agencies.
Searching for information within databases is currently accomplished in completely different
ways as compared to searching for information in unstructured data sources. Often the data
explorer has to know comprehensive query languages (such as SQL), as well as important
information on how the data is structured into different tables and columns (the database
schema).
A paper about Google's system architecture, The Anatomy of a Large-Scale Hypertextual Web
Search Engine, explains the initial algorithm behind Google's search engine. Also IBM has an
interesting paper, Efficiently Linking Text Documents with Relevant
Structured Information, on the DB+IR problem.

• Image Retrieval
While searching for textual data on the World Wide Web and in other databases has become
common practice, search engines for pictorial data are still rare. This comes as no
surprise, since it is a much more difficult task to index, categorize and analyze images
automatically, compared with similar operations on text. An easy way to make a searchable
image database is to label each image with a text description, and to perform the actual search on
those text labels. However, a huge amount of work is required in manually labelling every
picture, and the system would not be able to deal with any new pictures not labelled before.
Furthermore, it is difficult to give complete descriptions for most pictures. Early approaches to
the content-based image retrieval problem include the IBM QBIC (Query-By-Image Content)
System, where users can query an image database by average color, histogram, texture, shape,
sketch, etc. Later techniques based on example query are developed. Query by example is a
query technique that involves providing the system with an example image that it will then base
its search upon. Below is a list of publicly available Content-based image retrieval (CBIR)
engines. These image search engines look at the content (pixels) of images in order to return
results that match a particular query:
Pixolution, JustVisual, Elastic Vision, Google Image Search, SearchByImage, Picalike, ID My
Pill, PicScout.

• Graph Databases and Graph Data Management


Recently, there has been a lot of interest in the application of graphs in different domains. They
have been widely used for data modeling of different application domains such as chemical
compounds, multimedia databases, protein networks, social networks and semantic web. With
the continued emergence and increase of massive and complex structural graph data, a graph
database that efficiently supports elementary data management mechanisms is crucially required
to effectively understand and utilize any collection of graphs.
The followings are examples of graph databases:
Neo4j, OrientDB, InfiniteGraph, ArangoDB, SAP HANA, and AllegroGraph

• Social and Information Network Analysis


World Wide Web, blogging platforms, instant messaging and Facebook can be characterized by
the interplay between rich information content, the millions of individuals and organizations who
create and use it, and the technology that supports it. Recent research has been focused on the
structure and analysis of such large social and information networks and on models and
algorithms that abstract their basic properties. Project topics can include methods for link
analysis and network community detection, diffusion and information propagation on the web,
virus outbreak detection in networks, and connections with work in the social sciences and
economics.

• Web Data Analytics and Management


Internet and the Web have revolutionized access to information. Today, one finds primarily on
the Web, HTML (the standard for the Web) but also documents in pdf, doc, plain text as well as
images, music and videos. The public Web is composed of billions of pages on millions of
servers. It is a fantastic means of sharing information. Typical projects can include Web data
crawling, integration and retrieval, hidden Web discovery, information extraction and entity
resolution, and dataspaces.
• Fog/Edge Computing
Usual cloud-based architecture, where application intelligence and storage are centralized in
server wire centers, satisfies the need of most of the Internet of Things (IoT) applications, but
begins to break down when real-time requirement, high volume of data, or limited network
bandwidth play an important role in the deployment model. The need for decentralized
processing is emerging. Some references will call it the for or edge computing. Fog computing is
a system-level architecture optimized for the distribution of computational, networking, and
storage capabilities in a hierarchy of levels within an IoE network. It seeks to provide the correct
balance of capacity among the three basic capabilities at exactly the levels of the network where
they are the most optimally located. Fog computing builds upon the basic capabilities of cloud
computing, but extends them toward the edge of the network, and often right up to the intelligent
sensors, actuators, and user devices that constitute the IoE. Many familiar cloud techniques, such
as virtualization, orchestration, hypervisors, multi-tenancy, and excellent security are seamlessly
extended from the cloud through the fog layers. With the fog layers augmenting the cloud,
network capabilities that are difficult or impossible exclusively in the cloud can be provided. Fog
offers performance, scalability, efficiency, security, and reliability advantages compared to cloud
only solutions for critical IoE applications. Fog usually doesn't replace the Cloud (which has
many advantages due to centralization and scalability), fog supplements the cloud for the most
critical aspects of network operations. Cisco has an interesting paper about this topic, Attaining
IoT Value: How to move from Connecting Things to Capturing Insight.

• Big Data Analytics


Our world is being revolutionized by data-driven methods: access to large amounts of data has
generated new insights and opened exciting new opportunities in commerce, science, and
computing applications. Processing the enormous quantities of data necessary for these advances
requires large clusters, making distributed computing paradigms more crucial than ever.
MapReduce is a programming model for expressing distributed computations on massive
datasets and an execution framework for large-scale data processing on clusters of commodity
servers. Spark is aslo rapidly growing and is replacing Hadoop MapReduce as the technology of
choice for big data analytics.
Potential use cases for Big Data Analytics spans many different domains. In the game industry,
processing and discovering patterns from the potential firehose of real-time in-game events and
being able to respond to them immediately is a desired capability. In the e-commerce industry,
real-time transaction information could be passed to a streaming clustering algorithm like k-
means or collaborative filtering like ALS. Results could then even be combined with other
unstructured data sources, such as customer comments or product reviews, and used to
constantly improve and adapt recommendations over time with new trends. In the finance or
security industry, the Big Data Analytics could be applied to a fraud or intrusion detection
system or risk-based authentication.
• Miscellaneous Topics in Data Management
Nowadays, traditional database systems begin to embrace new technologies from other research
domains, such as data mining, information retrieval, pattern recognition and machine learning.
There exist an array of research problems on how to bridge the gaps between different data
analytical methods and extend them in database fields, for example, supporting effective and
efficient keyword search on databases, and embedding classification or clustering algorithms
deep into database systems.

You might also like