@BhanuPratap wrote:
I am working on San Francisco crime data here I want to divide my data set in five sub dataset.( more than 800000 rows). With the help of these subset I am going to create five model and then using these model I am going to predict category.Help me in sampling. Here one category have only 6 row and other have 174900 rows. how to handle this problem in sampling so every subset will have normal distribution of category.
What method of sampling I have to use for better prediction? What are the other way to do sampling of data (878049 rows). is it necessary to have all category in all five sub dataset or we can divide dataset categorywise?
These are the category frequency in train dataset.
|-------------------------------------------------|----------------|
|Category | Freq |
|-------------------------------------------------|----------------|
|TREA | 6 |
|---------------------------|-------------|
|PORNOGRAPHY/OBSCENE MAT | 22 |
|---------------------------|-------------|
|GAMBLING | 146 |
|---------------------------|-------------|
|SEX OFFENSES NON FORCIBLE | 148 |
|---------------------------|------ ------|
|BRIBERY | 289 |
|---------------------------|-------------|
|BAD CHECKS | 406 |
|---------------------------|-------------|
|FAMILY OFFENSES | 491 |
|---------------------------|-------------|
|SUICIDE | 508 |
|---------------------------|-------------|
|EMBEZZLEMENT | 1166 |
|---------------------------|-------------|
|LOITERING | 1225 |
|---------------------------|-------------|
|ARSON | 1513 |
|---------------------------|-------------|
|LIQUOR LAWS | 1903 |
|---------------------------|-------------|
|DRIVING UNDER THE INFLUENCE| 2268 |
|---------------------------|-------------|
|KIDNAPPING | 2341 |
|---------------------------|-------------|
|RECOVERED VEHICLE | 3138 |
|---------------------------|-------------|
|DRUNKENNESS | 4280 |
|---------------------------|-------------|
|DISORDERLY CONDUCT | 4320 |
|---------------------------|-------------|
|SEX OFFENSES FORCIBLE | 4388 |
|---------------------------|-------------|
|STOLEN PROPERTY | 4540 |
|---------------------------|-------------|
|TRESPASS | 7326 |
|---------------------------|-------------|
|PROSTITUTION | 7484 |
|---------------------------|-------------|
|WEAPON LAWS | 8555 |
|---------------------------|-------------|
|SECONDARY CODES | 9985 |
|---------------------------|-------------|
|FORGERY/COUNTERFEITING | 10609 |
|---------------------------|-------------|
|FRAUD | 16679 |
|---------------------------|-------------|
|ROBBERY | 23000 |
|---------------------------|-------------|
|MISSING PERSON | 25989 |
|---------------------------|-------------|
|SUSPICIOUS OCC | 31414 |
|---------------------------|-------------|
|BURGLARY | 36755 |
|---------------------------|-------------|
|WARRANTS | 42214 |
|---------------------------|-------------|
|VANDALISM | 44725 |
|---------------------------|-------------|
|VEHICLE THEFT | 53781 |
|---------------------------|-------------|
|DRUG/NARCOTIC | 53971 |
|---------------------------|-------------|
|ASSAULT | 76876 |
|---------------------------|-------------|
|NON-CRIMINAL | 92304 |
|---------------------------|-------------|
|OTHER OFFENSES | 126182 |
|---------------------------|-------------|
|LARCENY/THEFT | 174900 |
|---------------------------|-------------|PS: I am using R
Posts: 1
Participants: 1