A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.

**To avoid hash table collision there are many techniques, here we list out two:**

**Separate Chaining:**

It uses the data structure to store multiple items that hash to the same slot.

**Open addressing:**

It searches for other slots using a second function and store item in first empty slot that is found

- Tableau
- RapidMiner
- OpenRefine
- KNIME
- Google Search Operators
- Solver
- NodeXL
- io
- Wolfram Alpha’s
- Google Fusion tables

**N-gram:**

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).

**Criteria for a good data model includes:**

- It can be easily consumed
- Large data changes in a good model should be scalable
- It should provide predictable performance
- A good model can adapt to changes in requirements.

During imputation we replace missing data with substituted values.

**The types of imputation techniques involve are:**

**Single Imputation**

**Hot-deck imputation: **A missing value is imputed from a randomly selected similar record by the help of punch card

**Cold deck imputation:** It works same as hot deck imputation, but it is more advanced and selects donors from another datasets

**Mean imputation:** It involves replacing missing value with the mean of that variable for all other cases

**Regression imputation:** It involves replacing missing value with the predicted values of a variable based on other variables

**Stochastic regression:** It is same as regression imputation, but it adds the average regression variance to regression imputation

**Multiple Imputation:**

Unlike single imputation, multiple imputation estimates the values multiple times

Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.

**Properties for clustering algorithm are:**

- Hierarchical or flat
- Iterative
- Hard and soft
- Disjunctive

**A data scientist must have the following skills:**

**Database knowledge**

- Database management
- Data blending
- Querying
- Data manipulation

**Predictive Analytics**

- Basic descriptive statistics
- Predictive modeling
- Advanced analytics

**Big Data Knowledge**

- Big data analytics
- Unstructured data analysis
- Machine learning

**Presentation skill**

- Data visualization
- Insight presentation
- Report design

- Prepare a validation report that gives information of all suspected data. It should give information like validation criteria that it failed and the date and time of occurrence
- Experience personnel should examine the suspicious data to determine their acceptability
- Invalid data should be assigned and replaced with a validation code
- To work on missing data use the best analysis strategy like deletion method, single imputation methods, model based methods, etc.

The outlier is a commonly used terms by analysts referred for a value that appears far away and diverges from an overall pattern in a sample.

**There are two types of Outliers:**

- Univariate
- Multivariate

**Tools used in Big Data includes:**

- Hadoop
- Hive
- Pig
- Flume
- Mahout
- Sqoop

**Statistical methods that are useful for data scientist are:**

- Bayesian method
- Markov process
- Spatial and cluster processes
- Rank statistics, percentile, outliers detection
- Imputation techniques, etc.
- Simplex algorithm
- Mathematical optimization

**Various steps in an analytics project include:**

- Problem definition
- Data exploration
- Data preparation
- Modelling
- Validation of data
- Implementation and tracking

**Some of the common problems faced by data analyst are:**

- Common misspelling
- Duplicate entries
- Missing values
- Illegal values
- Varying value representations
- Identifying overlapping data

**To become a data analyst:**

- Robust knowledge on reporting packages (Business Objects), programming language (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, etc.)
- Strong skills with the ability to analyze, organize, collect and disseminate big data with accuracy
- Technical knowledge in database design, data models, data mining and segmentation techniques
- Strong knowledge on statistical packages for analyzing large datasets (SAS, Excel, SPSS, etc.)

**To deal the multi-source problems:**

- Restructuring of schemas to accomplish a schema integration
- Identify similar records and merge them into single record containing all relevant attributes without redundancy.

**Some of the best practices for data cleaning includes:**

- Sort data by different attributes
- For large datasets clee it stepwise and improve the data with each step until you achieve a good data quality
- For large datasets, break them into small data. Working with less data will increase your iteration speed
- To handle common cleing task create a set of utility functions/tools/scripts. It might include, remapping values based on a CSV file or SQL database or, regex search-and-replace, blanking out all values that don’t match a regex
- If you have an issue with data cleanliness, arrange them by estimated frequency and attack the most common problems
- Analyze the summary statistics for each column ( standard deviation, mean, number of missing values,)
- Keep track of every date cleaning operation, so you can alter changes or remove operations if required.

**KPI:** It stands for Key Performance Indicator, it is a metric that consists of any combination of spreadsheets, reports or charts about business process

**Design of experiments:** It is the initial process used to split your data, sample and set up of a data for statistical analysis

**80/20 rules:** It me that 80 percent of your income comes from 20 percent of your clients.

**Usually, methods used by data analyst for data validation are:**

- Data screening
- Data verification

Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.

A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s pops out based on your browsing history.

**The difference between data mining and data profiling is that:**

**Data profiling:** It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.

**Data mining:** It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between several attributes, etc.

K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.

**In K-mean algorithm:**

**The clusters are spherical:**the data points in a cluster are centered around that cluster- The variance/spread of the clusters is similar: Each data point belongs to the closest cluster.

**Responsibility of a Data analyst include:**

- Provide support to all data analysis and coordinate with customers and staffs
- Resolve business associated issues for clients and performing audit on data
- Analyze results and interpret data using statistical techniques and provide ongoing reports
- Prioritize business needs and work closely with management and information needs
- Identify new process or areas for improvement opportunities
- Analyze, identify and interpret trends or patterns in complex data sets
- Acquire data from primary or secondary data sources and maintain databases/data systems
- Filter and “clean” data, and review computer reports
- Determine performance indicators to locate and correct code problems
- Securing database by developing access system by determining user level of access.

**The missing patterns that are generally observed are:**

- Missing completely at random
- Missing at random
- Missing that depends on the missing value itself
- Missing that depends on unobserved input variable