Withthefastdevelopmentofinformationtechnology,thedataqualityreqestisincreasinglyhigher,datacleaningtechniquehasincreasinglybecomeafocusofconcernandresearch.
Howfromthemassmarketinformationfast,accuratelyminetheusefulinformation,isarichpotentialfordevelopmentoftextminingresearchdirection.Thepresenttextminingtechnologyisalsodifficulttofast,accuratelyidentifytheinformationintheerrorornotrelatedtothe"dirtydata".
Ingeneral,buylowhomologousinstitutionalinvestorsholdingsharesofahigherproportionof(andfargreaterthanthatoftheothertenshareholderssharesratio)stock,isarelativelyeasytoobtainhighyieldinvestmentbehaviorinthestockmarketinformation,sothisthesisdig"tenbigcirculationstockEastfellowinvestorssharestheproportionof"theproblemthathastherealisticmeaningforthespecificobjectofstudy,tostudyhowtousethedatacleaningtechniquetosolvetheproblemof"dirtydata".
Accordingtomarketinformationinmining"thetenshareholdersofsyngeneicinstitutionalinvestorssharesproportion"ofthepracticalapplication,introducedinthestockmarketinformationminingsystemindatacleaningproblemsoftheresearchbackground,textmining,datacleaningtechniqueofthedomesticandforeignresearchgeneralsituation;anoverviewofdatacleaningrelatedknowledge,tostudyhowtoapplicationofstatisticalanalysisandartificialintelligencetechnologytodetectandcleaningmarketinformationminingsysteminabnormaldata;andthenintheduplicaterecordscleansingmeaning,definitionandbasicflowonthebasisofstockmarketinformation,combinedwiththeminingpractice,analysisoftheduplicaterecordscleansingprocessinvolvedinthealgorithm,andputforwardrelevantimprovement;finallyfromthesystemapplicationbackground,sourcedataproblems,systemframework,theexperimentalprocessandresults,evaluationsystemandinnovationetchavebeenintroducedtodig"thetenlargestshareholderincirculationfrominstitutionalinvestorssharesratio"asthemainfunctionofthestockmarketinformationminingsystem.