Programming Presentation
Programming Presentation
3. Standardize Capitalization:
Within our data, we need to make sure that the text is consistent. If we
have a mixture of capitalization, this could lead to different erroneous
categories being created. It could also cause problems when we need to
translate before processing as capitalization can change the meaning. For
Example, Bill is a person's name whereas a bill or to bill is something else
entirely.
4. Convert Data Types:
Numbers are the most common data type that we will need to convert
when cleaning our data. Often numbers are imputed as text, however, in order
to be processed, they need to appear as numerals. For example, if we have
an entry that reads September 24th 2021, we’ll need to change that to read
09/24/2021.
5. Clear Formatting
Machine learning models can’t process our information if it is heavily
formatted. If we are taking data from a range of sources, it’s likely that there
are a number of different document formats. This can make our data
confusing and incorrect. We should remove any kind of formatting that has
been applied to your documents, so we can start from zero. This is normally
not a difficult process, both excel and google sheets, for example, have a
simple standardization function to do this.