Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Functionality to aid with Database Imports #61247

Open
2 of 3 tasks
mwiles217 opened this issue Apr 7, 2025 · 0 comments
Open
2 of 3 tasks

ENH: Functionality to aid with Database Imports #61247

mwiles217 opened this issue Apr 7, 2025 · 0 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@mwiles217
Copy link

mwiles217 commented Apr 7, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I would like the ability to have the following features to aid in Database import. 1) Unicode/Non Unicode identification for columns, 2) Max Length of each column including for inaccurate length when a multi value cell is converted is then saved by the dataframe[like list of states]. 3) Creation of the create table statements and supporting statements. 4) creation of the BCP file(tab delimited with some caveats), and its supporting FMT file and command line execution. 5) Replacement of certain chars in dataframe that prevent import (namely \r, \n, \t) in the data load step where it may be faster vs RegEx later. 6) Renaming of columns by stripping out certain characters or replacing them similar to the way that R has a rename all column function

I have code written that does most of this. For context I used dataframe as a key step for importing data into a sql server database at a rate of about 2.5 GB per hour with the philosophy of all columns as strings and provide conversion when in database as otherwise important leading zeroes could be dropped[like routing numbers or other custom indicators]. Also note that the methodology was to import into sql server using bcp which essentially is a tab delimited file. The code provided is not directly the code I used as I lost access to it. But it is a recreation and major refactoring making it simpler.

  1. Build into perhaps describe whether or not a column contains unicode characters or not so to know to make fields varchar or nvarchar.
  2. Build into perhaps describe getting max length of each column. NOTE. There is a discrepancy between what max length is when in a dataframe to when it is imported into a database in edge conditions when thedataframe is written back to a file. This is with multi value columns like a list of US States. The work around that works 99% of the time was multiply the length by 1.3 and then round up to nearest 100 by using math ceiling after dividing by 100 and then multiply by 100.
  3. Number 1 and 2 should be in the same spot so as to be easily consumable to script your own rest of solution.
  4. Perhaps when loading dataframe have option to replace certain characters to get it ready for import. it could probably be optimized there and run faster then running the regex after the fact. Im talking about replacing \r \n \t with a space or perhaps {n} and {t} respectively so they can easily be put back after import.
  5. A built in column rename functionality that strips or replaces bad character from column headers and ensuring uniqueness. For example ( could be stripped but @ you may want to replace with at. And unicode may want to be stripped. Essentially making it so you don't need to use [] around field names in sql server script for those columns. Perhaps this function could have options as many may want renamed done differently. Expose this replace functionality so it can be used stand alone as needed for naming a table from its filename
  6. The above i think could be good to provide building blocks for people to script the rest themselves but have the grunt part completed.
  7. Perhaps auto add the filename without extension as well is row number when opening or saving it as sometimes you need it for debugging or you literally need to reference the previous or next row.
  8. Have the ability to draft a create table statement from the information above and save it to a .sql file. With optionally adding an ID column and easily output it to a file. Also add in the rename of the table if it already exists by appending its timestamp including milliseconds and then transfer it to a different schema. Extra credit for creating the schema if it doesn't already exist. That part is to assist with auto complete tools and tracking changes to the data which was helpful for some disputes. Also make sure to strip naughty characters from the table name but use the filename without extension as the default table name.
  9. Have the ability to create a bcp fmt file which is a mapping between the table and the file
  10. Save the appropriate commands to a bat file for executing a bcp file as well as running the create table statement.
  11. Perhaps a to_bcp option that in addition to the above several points also creates the tab delimited no header file ensuring that things like tabs and newline and form feed are replaced.
  12. Perhaps to the last several points the to bcp function can auto create the other needed files when called.

Feature Description

The below is working code that does 95% of what I requested above. So this request ultimately isn't for me, but the community.

Things requested from above that are omitted from the code below are:

  1. Folding into the outputted .sql file the creation of a backup schema if it doesn't exist. And then renaming the same object if found by appending the creation date of the table including miliseconds and than transferring the table to that schema.
  2. folding in an auto ID column into the create table and then the appropriate changes to the FMT file.
  3. folding in the addition of 2 helpful columns a) the filename without extension, and b) the row number within file
  4. Expansion of the column rename to do smarter replacements vs just stripping the characters (like replacing @ with at), and then checking that all column names are still unique.

Also, this code generates what appears to be acceptable output. But I haven't tested within an actual database import into SQL Server.

import pandas as pd
import numpy as np
import os,re,uuid
import math
from typing import List,Dict

rx_unicode_str:str="[^\x00-\x7F]"
rx_unicode=re.compile(rx_unicode_str,re.IGNORECASE)

rx_space_str:str=r"\s"
rx_space=re.compile(rx_space_str,re.IGNORECASE)

rx_underscore_str:str="_{2,}"
rx_underscore=re.compile(rx_underscore_str,re.IGNORECASE)

rx_strip_chars_str="!|@|#|$|%|^|&|*|(|)|{|}|[|]|.|||;|:|'|"|,|<|>|?|=+"
rx_strip_chars=re.compile(rx_strip_chars_str,re.IGNORECASE)
class column_info:
def init(self,arg_column_name):
self.guid:str=str(uuid.uuid4())
self.column_name:str=arg_column_name
self.column_name_orig:str=arg_column_name
self.max_length:int=0
self.max_length_fixed:int=0
self.has_unicode:bool=False
self.sql_max:bool=False
self.ColumnIndex_1Based:int=0
self.last_column:bool=False
def as_create_table(self):
datatype:str="NVARCHAR" if self.has_unicode==True else "VARCHAR"
comma:str="," if self.ColumnIndex_1Based>1 else ""
data_length:str="MAX" if self.sql_max==True else str(self.max_length_fixed)
return "{c}[{name}] {t}({l})".format(c=comma,name=self.column_name,t=datatype,l=data_length)
def as_fmt_file(self):
datatype:str="SQLNCHAR" if self.has_unicode==True else "SQLCHAR"
comma:str="," if self.ColumnIndex_1Based>1 else ""
data_length:str=str(self.max_length_fixed)
data_length:str="4000" if self.sql_max==True and self.has_unicode==True else data_length
data_length:str="8000" if self.sql_max==True and self.has_unicode==False else data_length
# 1 SQLINT 0 4 "\t" 1 "ID" ""
idx:str=str(self.ColumnIndex_1Based).ljust(6)
type:str=datatype.ljust(20)
datalen:str=data_length.ljust(10)
name=str(""" + self.column_name + """).ljust(75)
sep:str="\t" if self.last_column==False else "\r\n"
sep=str(""" + sep + """).ljust(8)
return "{idx}{type} 0 {datalen} {sep} {name} """.format(idx=idx,type=type,datalen=datalen,sep=sep,name=name)
# return "{c}[{name}] {t}({l})".format(c=comma,name=self.column_name,t=datatype,l=data_length)

def as_dict(self):
    dict_ret={k: v for k, v in self.__dict__.items() if k not in["exclude_me"]}
    return dict_ret

class csv_info:
def init(self,arg_filename:str):
self.Database:str="myDB"
self.Server:str="myServer"
self.UserName:str="myUser"
self.Password:str="myPass"
self.filename:str=arg_filename
self.output_directory:str=""
self.bcp_filename:str=""
self.fmt_filename:str=""
self.table_name:str=""
self.parent_directory:str=""
self.filename_with_extension:str=""
self.filename_wo_extension:str=""
self.file_extension:str=""
self.parent_directory,self.filename_with_extension=os.path.split(arg_filename)
self.filename_wo_extension,self.file_extension=os.path.splitext(self.filename_with_extension)
self.table_name=self.fix_name(self.filename_wo_extension)
self.change_output_directory(self.parent_directory)

    self.df:pd.DataFrame=pd.read_csv(arg_filename,dtype=str)
    self.Columns:List[column_info]=[]
    max_lengths = self.df.apply(lambda x: x.astype(str).str.len().max())
    column_index:int=-1
    for col in self.df.columns:
        column_index+=1
        new_col=column_info(col)
        new_col.max_length=int(max_lengths.iloc[column_index])
        new_col.max_length_fixed=int(new_col.max_length_fixed * 1.3)
        new_col.max_length_fixed=100 if new_col.max_length_fixed<=100 else int(math.ceil(new_col.max_length_fixed/100)*100)
        # new_col.has_unicode=self.df[col].apply(has_unicode_regex).any()
        new_col.has_unicode=self.df[col].str.contains(rx_unicode, regex=True).any()
        new_col.sql_max=True if new_col.max_length_fixed>=8000 or (new_col.max_length_fixed>=4000 and new_col.has_unicode==True) else False
        new_col.ColumnIndex_1Based=column_index+1
        new_col.last_column=True if len(self.df.columns)==(column_index + 1) else False
        self.Columns.append(new_col)
    # end new_col.max_length_fixed>=4000 and new_col.has_unicode==True
    self.fix_column_names();
    self.to_bcp()
def change_output_directory(self,arg_output_directory:str):
    self.output_directory=arg_output_directory
    self.bcp_filename=os.path.join(self.output_directory,self.filename_wo_extension + ".bcp")
    self.fmt_filename=os.path.join(self.output_directory,self.filename_wo_extension + ".fmt")
def as_create_table(self):
    create_table:str="CREATE TABLE [{t}](\n".format(t=self.table_name)
    col:column_info
    for col in self.Columns:
        create_table += col.as_create_table() + "\n"
    of=os.path.join(self.output_directory,self.filename_wo_extension + ".sql")
    with open(of,"w",encoding="utf-8") as f:
        f.writelines(create_table + ")")
    # return create_table + ")"
def as_fmt_file(self):
    fmt_file:str="14.0\n{l}\n".format(l=str(len(self.df.columns)))
    col:column_info
    for col in self.Columns:
        fmt_file += col.as_fmt_file() + "\n"
    with open(self.fmt_filename,"w") as f:
        f.writelines(fmt_file)
    return
    #return fmt_file
def to_bcp(self):
    self.as_fmt_file()
    of:str=os.path.join(self.output_directory,self.table_name + ".bat")
    with open(of,"w",encoding="utf-8") as f:
        f.writelines(self.sql_cmd() + "\n")
        f.writelines(self.bcp_import_cmd() + "\n")
    self.df.replace(to_replace=r"\t|\n|\r",value=" ",regex=True,inplace=True)
    self.df.to_csv(path_or_buf=self.bcp_filename, sep="\t",index=None,header=False)
def fix_name(self,val:str):
    ret:str=val.strip().replace("-","_")
    ret=rx_unicode.sub("",ret)
    ret=rx_space.sub("_",ret)
    ret=rx_underscore.sub("_",ret)
    ret=rx_strip_chars.sub("",ret)
    return ret
def fix_column_names(self):
    col:column_info
    for col in self.Columns:
        col.column_name=self.fix_name(col.column_name)
def bcp_import_cmd(self):
    # -S {server_name} -U {username} -P {password}
    f_error=os.path.join(self.output_directory,self.filename_wo_extension + "_import_errors.txt")
    bcp_str:str= "bcp {db}.dbo.{t} in \"{f_bcp}\" -f {f_fmt} -T -C 65001 -S {server_name} -U {username} -P {password} -e\"{f_error}\"".format(
        db=self.Database,t=self.table_name,f_bcp=self.bcp_filename,f_fmt=self.fmt_filename,f_error=f_error
        ,server_name=self.Server,username=self.UserName,password=self.Password)
    return bcp_str

    return bcp_str
def sql_cmd(self):
    f=os.path.join(self.output_directory,self.table_name + ".sql")
    ret:str="sqlcmd -S {s} -U {u} -P {p} -i \"{f}\"".format(s=self.Server,u=self.UserName,p=self.Password,f=f)
    return ret

input_file:str=r"C:\data\LargeCSVFile\customers-2000000.csv"
cv=csv_info(input_file)

Alternative Solutions

see previous section for the alternative solution of custom written code.

Additional Context

No response

@mwiles217 mwiles217 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2025
@mwiles217 mwiles217 changed the title ENH: ENH: Functionality to aid with Database Imports Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant