Your Data Is a Mess. Here's How to Clean It Before Adding AI
A Tampa logistics company fed three years of customer data into an AI tool to predict delivery delays. The tool produced garbage predictions. Not because the algorithm was wrong, but because 47% of the zip codes in the spreadsheet were formatted as numbers instead of text, truncating leading zeros. Every Northeast delivery was mapped to the wrong location.
Most AI projects fail before the AI touches a single record. The data is inconsistent, duplicated, incomplete, or formatted in ways that look fine to humans but break machines. Cleaning it takes a few hours. Skipping it costs weeks of debugging and wrong outputs.
The 2-Hour Data Audit
Before any AI project, spend two hours checking your data. Open your main spreadsheet, CRM export, or database table and run through these five checks. You do not need a data analyst for this. Anyone who has used a spreadsheet can do it.
Check 1: Column headers. Open the file. Do the column headers make sense to someone who did not create the file? "Cust_ID_v2_FINAL" does not count. Rename columns to plain English: "customer_id", "email", "signup_date". AI tools parse headers to understand what each column contains. Ambiguous headers produce ambiguous results.
Check 2: Empty cells. Sort each column and count the blanks. If more than 15% of a column is empty, that column may not be usable for analysis. You have three options: fill in the missing values from your records, drop the column entirely, or flag it so the AI tool knows to ignore those rows.
Check 3: Date formats. Dates cause more AI failures than any other data type. Look for "01/02/2025" entries. Is that January 2 or February 1? If your data mixes MM/DD/YYYY and DD/MM/YYYY, the AI will not ask. It will pick one format and misread half your dates. Standardize to YYYY-MM-DD (2025-01-02). This format is unambiguous and sorts correctly.
Check 4: Duplicates. Search for duplicate rows. In most CRMs, duplicates come from two sources: someone entered the same customer twice, or the same person submitted a form twice. Duplicates skew averages, inflate counts, and confuse any tool that tracks individuals over time. Use your spreadsheet's "Remove Duplicates" feature or sort by email to find them manually.
Check 5: Naming consistency. Search for the same thing spelled multiple ways. "Tampa" vs "Tampa, FL" vs "tampa" vs "Tampa FL 33602." "Cancelled" vs "canceled" vs "CANCELLED." These look identical to a human scanning the column. To an AI tool, they are different categories, and your "Tampa" analysis will miss every record labeled "Tampa, FL."
The Three Formats That Break Everything
Phone numbers, zip codes, and currency values cause the majority of formatting failures. All three share the same problem: spreadsheets treat them as numbers when they should be text.
Phone numbers lose leading digits and parentheses. (813) 555-1234 becomes 8135551234 or worse, 8.14E+09 in scientific notation. Store phone numbers as text, always. The same applies to zip codes. 01234 becomes 1234 when stored as a number, and your AI maps that customer to the wrong state.
Currency values get messy when some cells have dollar signs and others do not, or when commas appear inside numbers. "$1,250.00" and "1250" represent the same amount but parse differently. Strip all currency symbols and commas. Store raw numbers, format them for display later.
Deduplication Without Losing Data
Deleting duplicates sounds simple until you realize both records might have useful information. Customer A submitted a form in January with their phone number. They submitted again in March with their email. Neither record is complete. Delete one and you lose data.
The safe approach: merge before deleting. Sort by email or phone number. For each duplicate pair, keep the record with the most complete data and copy missing fields from the other. Then delete the incomplete version. This takes longer than clicking "Remove Duplicates" but preserves information that might matter when the AI tool starts analyzing patterns.
How Much Data Do You Need?
Less than you think for most small business use cases. An AI tool analyzing customer feedback works with 50 reviews. A lead scoring model needs 200-500 past leads to find patterns. A chatbot trained on your FAQ needs your 20 most common questions and answers.
The exception is prediction. If you want AI to forecast sales, predict churn, or project inventory needs, you need 12-24 months of historical data at minimum. Less than that and the model picks up noise instead of signal. Two years of monthly data gives the model enough cycles to distinguish seasonal patterns from one-time events.
If you have less data than you need, start collecting it now. Set up consistent tracking in your CRM or spreadsheet, and come back to AI in six months. A data strategy does not require AI tools. It requires discipline about what you record and how you record it.
After the Cleanup
Once your data is clean, test it before connecting AI tools. Open the business data insights demo and upload a sample. If the AI produces a coherent summary with accurate numbers, your data is ready. If the summary contains obvious errors or misinterpretations, you have a formatting issue that still needs fixing.
The biggest mistake businesses make with messy data is assuming the AI tool will clean it up for them. Some tools handle minor inconsistencies. None handle the kind of structural mess that three years of unmanaged spreadsheets create. Two hours of cleanup saves two months of wrong answers.
AI insights that don't waste your time
One email per week. Practical AI tips for small business owners—no hype, no jargon, just what's actually working. Unsubscribe anytime.
Join 200+ Tampa Bay business owners getting smarter about AI.