SQL Remove Duplicate Records Solutions

Today i face a little problem. I wanted to change one of my field into unique however there were already some duplicate records (around 900+ records) in that field. And here are some of the ways i can remove duplicate records i though that i would share it out. However, before doing any fixing on our duplicate records. How about we find out which records were duplicated? If you want to know which records is duplicated, just fire the below command,

SELECT photo
FROM photograph
GROUP BY photo
HAVING count(photo) > 1

This sort of sql command is tested in most interview question. However, it is pretty simple by using groupby and having to show which data have more than 1 records.

Duplicate another table

The easiest way is to extract all unique records into temporary table, remove your current table and rename your temporary table.


CREATE TABLE photograph_tmp AS SELECT * FROM photograph GROUP BY photo
DELETE TABLE photograph
RENAME TABLE photograph_tmp TO photograph

The above will definitely fix your duplication problem but do remember to backup your records before doing anything silly.

Similar Duplicate

Another similar way instead we backup our records first, empty our original table and insert the non duplicated records in.

CREATE TABLE photograph_backup AS SELECT * FROM photograph GROUP BY photo
TRUNCATE photograph
INSERT INTO photograph SELECT * FROM photograph_backup  GROUP BY productid

This way, you don't have to worry about backing up or destorying your precious data.

Delete duplicate records

Another method is to delete all the duplicated data in the table without creating another table.

DELETE FROM photograph
   WHERE photo IN
   (SELECT photo
       FROM photograph
	   GROUP BY photo HAVING count(photo) > 1);

bascailly we search all our duplicated data and delete them from the real table. This is quite risky so don't be lazy and try out the safer way above.

Hope it helps 🙂

15 Ways to Optimize Your SQL Queries

Previous article was on 10 Ways To Destroy A SQL Database that sort of teaches you what mistakes many company might make on their database that will eventually lead to a database destroy. In this article,  you will get to know 15 ways to optimize your SQL queries. Many ways are common to optimize a query while others are less obvious.

Indexes

Index your column is a common way to optimize your search result. Nonetheless, one must fully understand how does indexing work in each database in order to fully utilize indexes. On the other hand, useless and simply indexing without understanding how it work might just do the opposite.

Symbol Operator

Symbol operator such as >,<,=,!=, etc. are very helpful in our query. We can optimize some of our query with symbol operator provided the column is indexed. For example,

SELECT * FROM TABLE WHERE COLUMN > 16

Now, the above query is not optimized due to the fact that the DBMS will have to look for the value 16 THEN scan forward to value 16 and below. On the other hand, a optimized value will be

SELECT * FROM TABLE WHERE COLUMN >= 15

This way the DBMS might jump straight away to value 15 instead. It's pretty much the same way how we find a value 15 (we scan through and target ONLY 15) compare to a value smaller than 16 (we have to determine whether the value is smaller than 16; additional operation).

Wildcard

In SQL, wildcard is provided for us with '%' symbol. Using wildcard will definitely slow down your query especially for table that are really huge. We can optimize our query with wildcard by doing a postfix wildcard instead of pre or full wildcard.

#Full wildcard
SELECT * FROM TABLE WHERE COLUMN LIKE '%hello%';
#Postfix wildcard
SELECT * FROM TABLE WHERE COLUMN LIKE  'hello%';
#Prefix wildcard
SELECT * FROM TABLE WHERE COLUMN LIKE  '%hello';

That column must be indexed for such optimize to be applied.

P.S: Doing a full wildcard in a few million records table is equivalence to killing the database.

NOT Operator

Try to avoid NOT operator in SQL. It is much faster to search for an exact match (positive operator) such as using the LIKE, IN, EXIST or = symbol operator instead of a negative operator such as NOT LIKE, NOT IN, NOT EXIST or != symbol. Using a negative operator will cause the search to find every single row to identify that they are ALL not belong or exist within the table. On the other hand, using a positive operator just stop immediately once the result has been found. Imagine you have 1 million record in a table. That's bad.

COUNT VS EXIST

Some of us might use COUNT operator to determine whether a particular data exist

SELECT COLUMN FROM TABLE WHERE COUNT(COLUMN) > 0

Similarly, this is very bad query since count will search for all record exist on the table to determine the numeric value of field 'COLUMN'. The better alternative will be to use the EXIST operator where it will stop once it found the first record. Hence, it exist.

Wildcard VS Substr

Most developer practiced Indexing. Hence, if a particular COLUMN has been indexed, it is best to use wildcard instead of substr.

#BAD
SELECT * FROM TABLE WHERE  substr ( COLUMN, 1, 1 ) = 'value'.

The above will substr every single row in order to seek for the single character 'value'. On the other hand,

#BETTER
SELECT * FROM TABLE WHERE  COLUMN = 'value%'.

Wildcard query will run faster if the above query is searching for all rows that contain 'value' as the first character. Example,

#SEARCH FOR ALL ROWS WITH THE FIRST CHARACTER AS 'E'
SELECT * FROM TABLE WHERE  COLUMN = 'E%'.

Index Unique Column

Some database such as MySQL search better with column that are unique and indexed. Hence, it is best to remember to index those columns that are unique. And if the column is truly unique, declare them as one. However, if that particular column was never used for searching purposes, it gives no reason to index that particular column although it is given unique.

Max and Min Operators

Max and Min operators look for the maximum or minimum value in a column. We can further optimize this by placing a indexing on that particular columnMisleading We can use Max or Min on columns that already established such Indexes. But if that particular column is frequently use, having an index should help speed up such searching and at the same time speed max and min operators. This makes searching for maximum or minimum value faster. Deliberate having an index just to speed up Max and Min is always not advisable. Its like sacrifice the whole forest for a merely a tree.

Data Types

Use the most efficient (smallest) data types possible. It is unnecessary and sometimes dangerous to provide a huge data type when a smaller one will be more than sufficient to optimize your structure. Example, using the smaller integer types if possible to get smaller tables. MEDIUMINT is often a better choice than INT because a MEDIUMINT column uses 25% less space. On the other hand, VARCHAR will be better than longtext to store an email or small details.

Primary Index

The primary column that is used for indexing should be made as short as possible. This makes identification of each row easy and efficient by the DBMS.

String indexing

It is unnecessary to index the whole string when a prefix or postfix of the string can be indexed instead. Especially if the prefix or postfix of the string provides a unique identifier for the string, it is advisable to perform such indexing. Shorter indexes are faster, not only because they require less disk space, but because they also give you more hits in the index cache, and thus fewer disk seeks.

Limit The Result

Another common way of optimizing your query is to minimize the number of row return. If a table have a few billion records and a search query without limitation will just break the database with a simple SQL query such as this.

SELECT * FROM TABLE

Hence, don't be lazy and try to limit the result turn which is both efficient and can help minimize the damage of an SQL injection attack.

SELECT * FROM TABLE WHERE 1 LIMIT 10

Use Default Value

If you are using MySQL, take advantage of the fact that columns have default values. Insert values explicitly only when the value to be inserted differs from the default. This reduces the parsing that MySQL must do and improves the insert speed.

In Subquery

Some of us will use a subquery within the IN operator such as this.

SELECT * FROM TABLE WHERE COLUMN IN (SELECT COLUMN FROM TABLE)

Doing this is very expensive because SQL query will evaluate the outer query first before proceed with the inner query. Instead we can use this instead.

SELECT * FROM TABLE, (SELECT COLUMN FROM TABLE) as dummytable WHERE dummytable.COLUMN = TABLE.COLUMN;

Using dummy table is better than using an IN operator to do a subquery. Alternative, an exist operator is also better.

Utilize Union instead of OR

Indexes lose their speed advantage when using them in OR-situations in MySQL at least. Hence, this will not be useful although indexes is being applied

SELECT * FROM TABLE WHERE COLUMN_A = 'value' OR COLUMN_B = 'value'

On the other hand, using Union such as this will utilize Indexes.

SELECT * FROM TABLE WHERE COLUMN_A = 'value'
UNION
SELECT * FROM TABLE WHERE COLUMN_B = 'value'

Hence, run faster.

Summary

Definitely, these optimization tips doesn't guarantee that your queries won't become your system bottleneck. It will require much more benchmarking and profiling to further optimize your SQL queries. However, the above simple optimization can be utilize by anyone that might just help save some colleague rich bowl while you learn to write good queries. (its either you or your team leader/manager)

10 Ways To Destroy A SQL Database

Database is the asset of most online or internet based company. Everyone looks at how to improve and secure their databases to protect or improve their company. While everyone is searching for remedies or enhancement pills for their company, there are often simple mistakes made by some companies (especially the small to middle ones) that might just destroy their businesses. Rather than looking at how we can protect our database, this article will look at ways to destroy it instead! (through mistake, of course)

Don't Monitor Error Log

The first line of defense that any database would have. Error log may indicates first time problem occurs or warnings that your database might be facing problem. These troubles can be easily avoided or missed depending on what you do. Be my guest and ignore error log will definitely help to destroy your database.

Many company databases are designed in a way to enforce availability. Hence, there will surely be primary and slaves databases in such company. These databases also contain error files. However, if you would like your secondary databases or slave databases go out of sync with the primary database, be sure to ignore the error file and give it some time. Depending on the size of your company, the amount of data lost caused by the lost of synchronization might just cost you dearly when some hero DBA shutdown the primary database without issuing "stop slave" first on the slave database or wait till some errant SQL come down the line. Although this might take some times to destroy your database but its worth to think about using it.

Don't Fine Tune Queries

You have a big server, lots of memory and fast disk, so don't have to worry. Continue with this attitude and you are on your way to success (destroy it!). Developers writing bad code that caused full table scan and trying their best to trash your query cache, overloading your Innodb buffer cache with useless blocks. Plus hitting disk instead of main memory as much as possible? Well, there won't be ANY problem since every piece of hardware is the latest, fastest and most powerful ones! Let's just wait and see till your database fall on its knees! Especially when data is getting larger! That's the best time we see this happen!

Don't Document Procedures and Configurations

Ah! No documentation won't cause a single problem! No problem at all my friend! Why the need to enforce such tedious job when 'we' can maintain the job? Just you wait when the 'we' becomes 'i' and 'i' becomes 'who'. Employees come and go nowadays. People always look for a better future in life and no matter what happens to you, it's really none of their concern. Man! What so difficult? Let's just hire an expert to take the job. Oh boy! That's a great solution! Let's see when he presses the wrong button 🙂

Don't Backup

Another great way to destroy your database is to avoid making backups! Hardware failure is a common thing in data center. Hard disk fail, power supply down, plugs get pulled, basically anything you can imagine. Don't backup regularly can just do the thing!

Its not only hardware that might assist you in your task. Developers or DBA who accidentally deletes data can also be of help. Deleting columns, table, rows, data or even database! These sort of things do happen and it happens quite frequently. Other than deleting data, mistakes made on program might just transfer the wrong data to the wrong place. All of these are just things that might happen to any IT firms. Well, there is always never happen before situation for some of you. Just follow your instinct (backup suck!) on this and you will do just fine.

Don't Use Memory Wisely

Server nowadays have huge memory installed with it. Technology advancement has made everything more powerful than before. Furthermore, it is that affordable that many companies can afford big and fast memory! With such powerful memory backup, we can assure that whatever developers throw in, the server and database will surely be able to take it! We can safely assume MySQL knows our database memory requirement! We just have to run the wizard installation and viola! Everything is automatic nowadays! There won't be such thing as misallocation of memory. The system is perfect.

Good to know. You have everything that required (even attitude) in preparing to bomb up your database.

Don't Worry About Indexes

Indexes is the most effective way to destroy your database. However, you must know the trick to do this. There are two ways to succeed. The first one required you to do absolutely nothing. No indexing is required purely full scan table. However, this required certain criteria to be meet but it should do the trick. If this doesn't suit your taste, you can try a faster way by creating useless or unwanted indexes and ensure that your table have tons of records. This can surely improve the process of destroying your database.

Don't Normalize Your Database Design

If you are just starting to build a system, you can consider skipping normalization in your database design. Skipping normalization can help contribute to bad database design which is part of the plan to destroy a database. Furthermore, without normalization, there is a good chance your system can be inaccurate, slow, and inefficient, and it might not even produce the data you expect.

Don't Make Policies For Database Patch

Its true that majority company doesn't update their databases immediately after a new security vulnerability patch has been released. It would be irresponsible for a company to deploy a patch in production without first running it through quality assurance. Furthermore, some companies didn't event bother to have policies to update their database. If you think that databases are a little more isolated than the desktop, there's less of a security concern and thinks that your databases are more secure because they're behind firewalls and and have a good perimeter security, you are on the right track of destroying a database.

Don't Bother Caching

My database can take tons of crap anything throw towards it. Its the fastest computer (why bother to have technology advancement when we already have the fastest? Dot) on the planet. Cache or no cache won't destroy my database. Its the fastest (ya ya, i get it.). The dramatics performance gain using cache table might not interest you. Scalability, flexibility, availability and performance are just some benefits that caching can gives. Multitier architecture, what bullshit. Your server will NEVER go down and you will NEVER required cache table to be available. Millions of hits on the server database might just do the trick on helping you achieve what you want in this article. It will definitely work better when your table has few millions record and a few lousy queries. (it might not even required millions of hit to kill it)

Don't Use Fast or Reliable Disk

Using something like a single disk or mirror will definitely makes your I/O the main bottleneck on your system. With the help of a single disk, you can expect OS and your database fighting for resources, serving one user at a time while others waiting for their turn. To make things worst, you can try utilize RAID-5 instead of RAID-10. May be you already are!  Well, to  compare between these two, RAID-5 only performed reasonably well on read while RAID-10 exceed almost two times better than RAID-5 on writes. RAID-5 only can handle 1 fails and any drive die will approximately caused 64% degration in read performance until the faulty drive is discovered. Furthermore during recovery, read performance for a RAID5 array is degraded by as much as 80% compared to RAID-5 which only degrade performance on the faulty disk itself. There are more 'advantages' on RAID-5 but i will just stop here. RAID-5 seems to be good with destroying than RAID-10 don't you think so?

Conclusion

The points discuss here might just happen to large data or traffic internet site (well, your site will eventually grow to have big data, hopefully). However, the conclusion to all these jokes are more valuable; Learn to save your ass. No one will.