Friday, December 12, 2008

Importing and Exporting Data with SQL Server

Importing and Exporting Data with SQL Server
View the book table of contents
Author: Kevin Cox
William Jones
Published: July 1997
Copyright: 1997
Publisher: 29th Street Press



Abstract
Transferring data, whether into or out of a database, can be a complex process. SQL Server provides several tools that help you in a variety of situations, including the Bulk Copy Program, the Select statement, the Transfer Manager, and Query results. This chapter examines those options, and times when each option is particularly useful.




THE BCP UTILITY

When you need to load a large amount of data from or into an external file, the BCP utility can be helpful. BCP is a command-line utility that can read and write files in both ASCII format and in native format. (Native format uses variable-length records with field type and length descriptors embedded in them.) BCP using native format is the fastest way to transfer data between databases in SQL Server or Sybase SQL Server. If you use BCP with ASCII files, the ASCII files can be generated as fixed-length records or delimited files, and you can choose the delimiters.

Although BCP has been supplanted in many of its SQL Server data transfer roles, it still is an efficient way to load data generated from any one of a variety of platforms and provide output that can be used on nearly every platform. BCP can quickly load a table without activating triggers and, under the right conditions, can operate without transaction logging. The Transfer Manager, which was added in SQL Server 6.0, eliminated the need to use BCP when transferring data between two SQL Servers connected by a network or between two databases on the same server. (Our discussion of the Transfer Manager begins on page 272.) However, until the addition of removable media databases to SQL Server, BCP was the only standard way to get data to remote locations (locations with no wire connection).

BCP Parameters
The command line for BCP is complicated. Improper use of the options can cause BCP to fail and create significant headaches. BCP runs in an interactive command-line mode, which lets you specify the format of the table to dump if a format file is not provided for export, but other options default unless they are specified on the command line. Knowing when an option can be left out is as important as knowing what the defaults are for other values. We describe the command line in detail here to try to prevent your frustration when you first attempt to use it. Be aware that the options are case-sensitive. Note also that the login information comes at the end, whereas in most other utilities it is the first thing you supply.

bcp [[database_name.]owner.]table_name {in | out} datafile
[/m maxerrors] [/f formatfile] [/e errfile]
[/F firstrow] [/L lastrow] [/b batchsize]
[/n] [/c] [/E]
[/t field_term] [/r row_term]
[/i inputfile] [/o outputfile]
/U login_id [/P password] [/S servername] [/v] [/a packet_size]
  • [[database_name.]owner.]table_name — The table that data will be imported to or exported from. If the user’s ID does not have the desired database as its default, the database name must be specified. If the user ID supplied does not own the table, the owner must be specified and the user must have the appropriate permission. To import data into a table, the user must have Insert permission on the table. To export data from a table, the user must have Select permission on the table and on several of the system tables (sysobjects, sysindexes, and syscolumns). Usually the DBO or SA performs this task, in which case, permissions are not an issue.
  • {in | out} — Specifies the direction of the data transfer, into the database table (in) or out of the database table (out).
  • datafile — The physical location of file to import or export to. This location must be in the path on the machine executing the BCP or you must specify the file with a complete path.
  • [/m maxerrors] — The number of errors per batch to ignore before stopping. The default value is 10. If you expect errors, you can raise the maximum number of errors to ignore to let the non-error data load completely.
  • [/f formatfile] — The physical location of a BCP format file to be used for export or import. You must use a format file if you are not using a native format or default character format (see the /n and /c parameters, below). This parameter must be supplied with a complete path. When you export data, the format file is used to create the required format. When you import a file, the format file specifies the fields in the data file and the columns that the data fields are to be imported into. If you do not specify the file when you’re exporting data, BCP will run interactively and create the format file you describe. If a format file is specified during import and the file does not exist, BCP will generate an error.
  • [/e errfile] — The physical location of a file to which BCP should write error messages and records with errors. You should specify the complete path. The default path is the directory from which the BCP executable is run.
  • [/F firstrow] — The row on which the import or export is to start. This parameter is particularly useful if you want to skip headers or break up large files into smaller batches.
  • [/L lastrow] — The row on which to stop the import or export. This parameter is useful for very large loads that need to be broken up.
  • [/b batchsize] — The number of records to load between committing transactions. The default value is all the records being inserted in one transaction. If this parameter is not set appropriately for the job, the load can cause transaction logs to become full (see the discussion about fast and slow BCP on page 266). It works with the number of errors to determine when it will abort.
  • [/n] [/c] — Native format or character format. These two options are mutually exclusive; you can specify a format with the /f parameter instead of using either. They are discussed in more detail on page 265.
  • [/E] — Turns off the identity insert, allowing files containing rows with identity column information to be inserted with their current identity value. Otherwise, the identity column is set according to the seed and increment defined at the time of creation. If you don’t want to preserve the identity column, it is best to use the format file to leave it out of the data export.
  • [/t field_term] — The field delimiter indicator; tab (\t) is the default. The most common delimiter is the double quote (“ ”) or the comma (,), but be aware that comma-delimited data will fail if the data records include commas. We recommend that you choose a character that you are sure is not in the data, such as the pipe (|) or tilde (~), if the source of the data supports it.
  • [/r row_term] — The row terminator, typically a carriage return/linefeed (\r \n).
  • [/i inputfile] — A file to replace standard input. You can use the file to provide the information BCP requests interactively.
  • [/o outputfile] — The physical path to the file to which standard output should be redirected. The file logs information such as the number of rows and other non-error output.
  • /U login_id — A valid login ID and user ID for the database(s) the tables are being imported to or exported from. The system administrator login and ID are used by default.
  • [/P password] — The correct password for the above user ID. The BCP utility will prompt for a password if none is provided.
  • [/S servername] — Server name to connect to. The default is the local server. BCP is faster if you use it locally instead of across the network.
  • [/v] — Gives the current version information for BCP. This information is important for the format files.
  • [/a packet_size] — Network packet size to be used in transferring data. Values can be between 512 and 36635. Larger packets can improve performance for larger BCP data transfers. Defaults for SQL on Windows NT are now 4096. This parameter can be set for SQL Server under the Configurations menu and overridden on a case-by-case basis through BCP. BCP reports information concerning the network packet size, number of rows processed, and the rate of processing. Microsoft recommends using a value of between 4096 and 8192 for SQL Server running on Windows NT.

The BCP Format File
The most important aspect of using BCP is the format file. The format file defines how data is stored outside the database and in which columns it belongs when it is loaded back into the database. BCP always needs format information even if you don’t want to save the format afterward.

If no format file is specified, neither /n nor /c is indicated, and you’re exporting data, BCP will query you for field types, prefix lengths, and the field delimiter. These questions are followed by the default value in brackets. These default values represent the default character format dump, which writes numeric data as binaries. If you prefer an ASCII dump so that numeric data is exported as strings, you should specify character (“c” or char) as the type and a 0 prefix length. If the type is given as “c,” BCP will ask for the length of the field in characters for all of the fields, including those fields holding numbers.

The BCP format file is itself an ASCII character file with a very specific layout, as shown in Figure 14.1. The version number appears on the first line of the file. This version number should match the version of SQL Server that you’re loading data into (you still use 6.0 as the value if you’re using either 6.0 or 6.5). You can load data from version 4.21 to version 6.5 in most cases and, using ASCII files, usually into 4.21 from either 6.0 or 6.5. The number of columns appears on the next line of the BCP format file.

The rest of the file contains the description for each column in the data file. It gives the order the columns appear in the data file, the host file data type, the prefix length (0 for ASCII files and fixed-length binary files), the host file data length, the field terminator, and the number and name of the column in the table the field is to be inserted into. You can change the order of the columns on output, on import by editing this file, or during input, as long as the data types described in the format file are accurate representations of the data types of both the data file and the database table. The column name does not need to match the column name in the database but it cannot be left blank.

For example, the following format file exports data from a table, interchanging the second and third columns.

6.0
4
1 SQLDATETIME 0 8 "\t" 1 ErrorDate
2 SQLCHAR 0 255 "\t" 3 ErrorMsg
3 SQLCHAR 0 255 "\t" 2 ErrorCaption
4 SQLCHAR 0 255 "\n" 4 ErrorClass
If the same format file is used on import, the columns show up in the correct order in the table. If the format file is changed to the following file for the import, the columns will be interchanged on the load; resulting rows will have the second and third columns interchanged.

6.0
4
1 SQLDATETIME 0 8 "\t" 1 ErrorDate
2 SQLCHAR 0 255 "\t" 2 ErrorMsg
3 SQLCHAR 0 255 "\t" 3 ErrorCaption
4 SQLCHAR 0 255 "\n" 4 ErrorClass
Fast BCP
BCP operates in one of two ways, depending on the options set in the database and whether you have indexes in the table. If the Select Into/Bulk Copy database option is set to true and the table being loaded with BCP has no indexes, BCP will run in fast mode, which means that BCP will not log transactions. Not logging transactions not only increases performance but also eliminates the need to break up large data loads so the transaction log can be cleared. (However, large batches can fail if the file causes too many errors, so breaking the data into smaller batches is still recommended if problems are encountered during the load.) If the table has indexes, BCP will operate in slow mode even when the Select Into/Bulk Copy database option is set to true. In slow mode, the transactions for the inserts are logged as normal inserts.

If it is not feasible to perform non-logged BCP, you have an option besides using BCP command-line options to break up your data loads. First, perform a non-logged insert into a separate database and table, then use those inserts to load the data in ranges. You may need to dump the transaction log during the load operation. However, remember that a normal transaction log dump writes only completed transactions to the disk dump. If you are loading a large amount of data in one batch, dumping the transaction log while loading the data does not guarantee that the log will not fill up and kill the load. To execute a large data load and still log the transaction, break the load into batches using the first and last row options on the BCP command line and dump the transaction log between the execution of the copy.

Generally speaking, in cases of mass data transfer between SQL Servers it is faster and less problematic to drop the table indexes, set the Select Into/Bulk Copy flag to true, and load the data using fast-mode BCP. You can then reset the flag, dump the database, and rebuild the indexes once the load has been performed.

Defaults, Datatypes, Rules, Constraints, and Triggers
Several unique cases can present problems when using BCP. Using BCP to transfer data between SQL Server tables that have defaults bound to columns can present a problem. When BCP loads fields that have defaults, it replaces null values with the default value, which is often what you want. However, in cases where a default exists to supply a value when none is given and the column is expected to hold the null value when it is explicitly set, BCP will substantially change the data. If you want the column to hold the null value, drop the default value and add it again after the load.

Loading a table with BCP will not execute triggers or apply any rules or foreign key constraints. If you want data in the table to conform to constraints or triggers, you might want to load a secondary table and then validate that the data satisfies the constraints. You can then use Insert statements to insert the data, which will execute the relevant triggers.

Common BCP Mistakes
The most common mistake when using BCP is to use slow-mode loads on very large data sets. Not only does a slow-mode load take a lot longer than a fast-mode load, but the transaction log often fills and causes the load to fail. For large loads, dropping indexes and re-creating them at the end of the BCP is usually faster than updating them during the load. However, it takes time and space within the database to recreate indexes, so you should monitor which method is faster. If you are continually adding data to your table, at some point it is just as fast to do the logged version as the fast-mode load. You need to keep track of the times and resources involved to see if this happens to you. (And then don’t forget to update the index statistics if you are not rebuilding the indexes.)

Another common mistake is failing to preserve or create the format file when the data is dumped. Trying to create a valid format file from scratch can be tricky. If the original source exists, performing a second dump on a very small number of rows (say, one row, by setting the last row command-line option to 1) is an excellent way to generate a format file. If you’re loading a file from an external source, it is often best to load the complete record layout into a secondary table designed to match the record structure and then use SQL to manipulate the data, eliminating values or certain fields as desired. This approach is usually easier than trying to use the format file to selectively load fields.

Some examples of BCP command lines and the resulting data files are below.

BCP Example 1
The following command line specifies a default character-mode dump using tabs to delimit the fields and carriage return/line feeds to terminate the rows. The resulting data file is shown below.

bcp pubs..publishers out publ_out /c /Sservername /Usa /Ppassword

0736 New Moon Books Boston MA USA
0877 Binnet & Hardley Washington DC USA
1389 Algodata Infosystems Berkeley CA USA
1622 Five Lakes Publishing Chicago IL USA
1756 Ramona Publishers Dallas TX USA
9901 GGG&G München Germany
9952 Scootney Books New York NY USA
9999 Lucerne Publishing Paris France
BCP Example 2
This command line produces a character file with comma-delimited fields and a linefeed at the end of each row. The resulting data file is shown below.

bcp pubs..publishers out publ_out /c /t , /r \n /Sservername
/Usa /Ppassword

0736,New Moon Books,Boston,MA,USA
0877,Binnet & Hardley,Washington,DC,USA
1389,Algodata Infosystems,Berkeley,CA,USA
1622,Five Lakes Publishing,Chicago,IL,USA
1756,Ramona Publishers,Dallas,TX,USA
9901,GGG&G,München,,Germany
9952,Scootney Books,New York,NY,USA
9999,Lucerne Publishing,Paris,,France


THE SELECT STATEMENT

In many cases, it is not necessary to use BCP to transfer data between SQL servers. The Into option of the Select statement and the Insert statement used in conjunction with the Select statement are two powerful methods of transferring data and structure between databases and servers.

Adding the Into clause to a Select statement lets you create a new permanent or temporary table. This method is good for straightforward table replication. This syntax creates the table and then inserts the data specified in the Select statement into the newly created table. The table you create with the Select Into statement must not currently exist in the database.

You can also write more complex queries to create a new structure using fields from various tables. You can add fields to the structure by creating variables of the correct type and using them to augment the Select field variables.

You can insert the values into the new table either when you specify the new structure or later by using the Insert statement with the exact Select statement used to create the structure.

The Select statement and its parameters are explained below.

SELECT [ALL | DISTINCT] select_list
[INTO [new_table_name]]
[FROM {table_name | view_name}[(optimizer_hints)]
[[, {table_name2 | view_name2}[(optimizer_hints)]
[..., {table_name16 | view_name16}[(optimizer_hints)]]]
[WHERE clause]
[GROUP BY clause]
[HAVING clause]
[ORDER BY clause]
When you use variables to add fields in a Select Into statement, the added fields must be given explicit column headers if more than one variable is to be used. If no column header is given, the column is created with a null name. Because each of the columns within a table must have a unique name, using two or more variables without assigning column headers to them will cause the Select Into statement to fail. Generally speaking, creating a table with a null column name is a bad idea and is not recommended.

Using Select Into to replicate data is a two-step process. First the table is created, then any rows that qualify are inserted. You can insert a phony Where clause to create only the table, which is a useful strategy if you want to transfer the data at some other time. For example:

SELECT T1.field1, T2.field2, T3.field3, field4 = @variable1 into
NewTable
FROM Table1 T1, Table2 T2, Table3 T3
WHERE 1 = 2
You can create permanent objects with Select Into if the database configuration option Select Into/Bulk Copy is set to true. This option is required because, like a fast BCP, the Select Into statement is a non-logged operation. Once a Select Into has been performed on a database, you should do a full database dump if you want to recover the data.

The Select Into statement can also create global and local temporary tables if you specify a temporary table that is not currently defined. They can be created regardless of the value of the Select Into/Bulk Copy flag.

The ability of the Select Into statement to copy and preserve data structure on the database is very useful when you’re developing or fixing a database. In cases where you need to save test data in its original state to be used again, the Select Into option saves time setting up the test. If you are performing large updates to correct a problem, you can preserve the current state and revert to it if the update does not resolve the problem. In general, you should precede any data adjustment to a database on a production system by creating a duplicate of all current data, either through a data dump or through Select Into and Insert statements.

Using the Insert statement with the Select statement lets you copy data from one table to another and have defaults, rules, triggers, and constraints executed during the copy. This feature is very useful for testing and for data loads. The fields in the Insert and Select statements must match in number and type, but the names of the fields do not need to be the same. Note that the table you’re inserting data in must already be defined for this to work, whereas with the Select Into statement, the table CANNOT be defined for it to work.

The Insert statement can assume two formats. The first format allows inserts with stated values. These values can be forced to defaults or can be specified using a Values statement. The second format lets you use a query to specify the data to be inserted. The two formats are as follows:

INSERT [INTO]
{table_name | view_name} [(column_list)]
{DEFAULT VALUES | VALUES(values_list) | select_statement)

INSERT [INTO]
{table_name | view_name} [(column_list)]
SELECT {field_list}
[FROM {table_list}]
[WHERE {where_clause}]
If the (column_list) parameter is left out of the Insert statement, SQL takes the whole table, in column ID order, as the default. The Default Values option is used to insert the default values in the columns of the insert table. Type, timestamp, and identity fields take on their next appropriate value. Fields that accept nulls and have no default defined are set to null. If a field in the table does not have a default and does not accept nulls, the Insert statement will fail. The Values clause lets you insert variables and fixed values and can be used in scripts to load data. The Default keyword can be used in the Values list in place of an actual value if the column has a default or allows nulls.

Stored Procedures
Instead of using a Select statement directly, you can use a stored procedure containing a Select statement to insert data into a table. When Select is used in this way, the result set generated by the stored procedure’s Select statement is loaded into the table. The number and type of the returned fields must match the layout of the table given in the Insert statement, as shown below.

INSERT [INTO]
{table_name | view_name} [(column_list)]
EXECute { procedure_name | @procedure_name_var}
[[@parameter_name=] {value | @variable [OUTPUT] | DEFAULT}
[, [@parameter_name =] {value | @variable [OUTPUT] |
DEFAULT}]...]
The stored procedure can have multiple Select statements in it, as long as all of them have the same result format. The parameters provided to the stored procedure can be specified as output types, which allows the stored procedure to return additional information.

You can use remote stored procedures to load or pull data from a remote source either by taking the field values as arguments and executing the Insert statement on the remote machine or by returning result sets and using the remote procedure in the Insert statement. With these methods, you can perform transfers across servers, which can be very useful in distributed systems.


TRANSFER MANAGER

If you don’t need to manipulate data during a transfer between two SQL Servers, the Transfer Manager is the easiest way to do transfer data. Go to the Enterprise Manager’s Tools menu and choose Database/Object Transfer to display the Database/Object Transfer dialog box in Figure 14.2. From this dialog box, you can transfer data and structure between any two SQL Server databases.

Clicking the Foreign Source button displays the Foreign Source dialog box, where you enter the server name, user ID, and password required to make the connection. The source server can be any SQL server that you can connect to from the destination server. The specified user account must have Select permission on the objects to be transferred. The destination server must be a registered server; you can access the Register Server dialog box with either the New Source or the New Destination button.

From the dialog box in Figure 14.2, you can choose to transfer data, schema, or both. You can generate schema to include Drop statements to guarantee that the new schema will replace any objects that currently exist. Remember that this process will eliminate all the data in destination tables, so if you aren’t transferring the data at the same time or if you need to preserve the data on the destination objects, you should use some sort of storage. If the schema contains Drops and objects in the schema don’t exist in the destination database, errors will be generated, but the transfer process will ignore these errors and transfer the remaining objects. If the schema doesn’t contain Drops, errors will be generated if the object is already present in the destination database. The schema can also be generated to include all the dependencies of the objects specified, thereby ensuring that if the entire database isn’t transferred at once, all the necessary dependent objects are generated at the time the transfer is performed. The data can be transferred either preserving the current data or overwriting it. Data rows that violate constraints or rules will not be transferred. The data transfer aspect of the transfer manager follows the same rules as the BCP application in terms of triggers, rules, defaults, and DRI.

All the objects in the source database are transferred by default. By clearing the Transfer All Objects check box and clicking Choose Objects, you can see the Choose Objects to be Transferred dialog box ( Figure 14.3). You can choose categories or individual objects.

You can customize scripting options from the Database Object Transfer dialog box ( Figure 14.2) by clearing the Use Default Scripting Options check box and clicking Scripting Options. The Transfer Scripting Options dialog box ( Figure 14.4) will appear. At this dialog box, you can specify scripts to include referential integrity, triggers, and column bindings. The user-defined types can be forced to their base values and the field identifiers can be specified in quotes so that fields that have key words for names will be interpreted correctly.

You can also customize the security generated with each script, which controls users, logins, and permissions on transfer. You can also selectively include the clustered and non-clustered indexes in the script. By default, the clustered indexes are created before the data transfer and the non-clustered indexes are created after the transfer, requiring the index pages to be updated during the data load. When you select the Create Clustered Keys/Indexes After Data Transfer With 'Sorted Data' option, the clustered indexes are built with the SORTED_DATA option after the data is transferred. This option causes the statistics to be generated correctly and significantly improves the performance of large data transfers. By specifying this option, the fill factor for the index can be retained in the destination table.

The transfer process generates storage files containing the scripts for the transfer. These files are stored in the \Log directory. Many files are generated, each corresponding to one of the object type categories. After the scripting is complete, if any errors were encountered, the View Logs button is enabled and you can review the list of errors. If the whole database is not transferred, errors in transferring rights and owners often occur. The object is not transferred if the user does not exist in the database that the object is transferred to and the scripting options are set not to transfer the users.


SAVING QUERY RESULTS

The query interface can be a fast and efficient method of exporting data. After executing a query or stored procedure, you can save the output to a file by clicking the File Save button. The Save option in the Enterprise Manager’s File menu also presents the dialog box. When you use this method, you should generally limit the output to a single result layout; you can use multiple queries easily if all of the result sets have the same number of fields. You can display the Query Flags tab of the Query Options dialog box ( Figure 14.5) by clicking Query Options on the tool bar in the Query window or by selecting Set Options from the Enterprise Manager’s Query menu.

The second tab of the Query Options dialog box is the Format Options tab ( Figure 14.6); from this tab, you can set the characteristics of the output format. The output from a query can be returned as a standard, column-aligned result set or as text delimited with a comma, tab, or with any other single character. The numeric fields in a column-delimited file can be right-aligned, which helps some import programs recognize them as numeric. The output format can be adjusted to perform Verbose Prints to include print statements in Showplan and I/O statistics analysis. You can suppress column headings by clearing the Print Headers check box. Selecting the Output Query box inserts the query that was executed to produce the results at the top of the result set. After the data is saved, the file can be loaded into any application capable of processing the format the query was generated in.

Using query results to transfer data lets you gather specific data quickly with very little preparation. It can also be very useful in troubleshooting. You can create one-time reports and perform very specialized data analysis with this technique, saving hours of development time.


SUMMARY

You can move data and the structure of a database into and out of SQL Server in many ways. Some of the older methods can still be used effectively in appropriate circumstances. When you transfer database information between SQL servers, using the SQL object identifier extensions makes writing the transfer as easy as writing the Select statement. The Transfer Manager lets you load remote, connected servers with data from a variety of sources, which is very useful when you create a new database on a registered server. The results window of the query tool provides a quick and easy way of exporting character data for other applications from any registered server.

Table 14.1 summarizes your data transfer options.


Page: 1, 2

No comments: